Embedding cognitive search on HR intranet

What per diem can I claim on my business trip to Hungary? How does company car impact my taxes? How do I request paternity leave?

How well does your company intranet search answer similar questions? Google-easy? Well done to your CIO.

If fast and accurate content search is not yet a standard capability in your company, you might want to consider Watson Retrieve & Rank service. Not only can it replace on premise search engine installations, it opens doors to cognitive search capabilities in a seamless way.

I would like to share my experience implementing a proof of concept that uses the service for 24,000 HR pages. The steps below show how we convert HTML pages to JSON, upload the content to the Retrieve & Rank service, and answer employee questions using a JavaScript widget that queries the service securely via a proxy.

Further details including the overview and an easy tutorial for enabling the service is also well described on IBM Watson Developer Cloud.

1. IBM Bluemix application

In the first step we need to create a Bluemix account and a new application. This is essential for 2 purposes. First, it allows you to procure the Watson Retrieve & Rank service as an add-on service on your account. Second, it allows you to proxy the service calls in a way that does not reveal your credentials to users, especially if you intend to query the service directly using a JavaScript widget.

2. Data collection

The recommended way to collect data and to set up the service is to use provided Java-based Data Crawler. The link to the crawler (Java 8) is available from Bluemix, but a more comprehensive solution is kale – that not only crawls content, but also integrates document conversion service and the set-up of Retrieve & Rank.

Alternatively you can use other ways of preparing your content. At the end of the day, you just need a simple JSON file and it does not matter whether you get it from your content management system or from the hosting server directly.

The example below comes from a 400 line bash script that converts heavily profiled content from HTML to JSON. The snippets demonstrate the basics of building the JSON file, the full bash script might require another blog entry. Note that converting HTML into JSON does not need to be perfect, removing tags is sufficient. We are just merging the contents for search purposes and don’t need any structure or links.

#start JSON
  echo “{“ >> $out
#loop through documents
#for each page separate all html tags on their own line
  sed 's/</\
  </g' $f | sed 's/>/>\
  /g' | sed '/^$/d' > "$tmpPage"
#loop through the page extracting content from the main content area
#send output to the JSON file
  echo ' "add": { "doc": {' >> $out
  echo ' "url": "http://w3.ibm.com'$webpath$page'",' >> $out
  echo ' "body": "'"${maincontent//\"/\\\"}"'",' >> $out
  echo ' "title": "'"${mainheading//\"/\\\"}"'"' >> $out
  echo ' } },' >> $out
#finish JSON with a commit message
  echo ' "commit" : { }' >> $out
  echo "}" >> $out

Text content extracted this way will be extremely small compared to the size of your web directories. All U.S. content including global content and U.S. sections within global content ended up to be less than 16MB in total!

3. Retrieve & rank set-up

If you don’t use kale or other automated services, here is an example of curl commands to set up Retrieve & Rank manually:


#create a cluster
curl -X POST -u "OUR_RR_USERNAME":"OUR_RR_PASSWORD" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters" -d ""

#upload custom configuration for Apache Solr (POC config I used enables automatic assignment of ID field to UUID)
curl -X POST -H "Content-Type: application/zip" -u "OUR_RR_USERNAME":"OUR_RR_PASSWORD" --data-binary @hrweb-solr-config.zip "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/OUR_CLUSTER_ID/config/hrweb"

#create a collection ("us" is our collection name, "hrweb" is our solar config reference)
curl -X POST -u "OUR_RR_USERNAME":"OUR_RR_PASSWORD" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/OUR_CLUSTER_ID/solr/admin/collections" -d "action=CREATE&name=us&collection.configName=hrweb"

#upload custom data
curl -X POST -H "Content-Type: application/json; charset=UTF-8" -u "OUR_RR_USERNAME":"OUR_RR_PASSWORD" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/OUR_CLUSTER_ID/solr/us/update" --data-binary @us.json

#check stats, e.g. how much space is left
curl -u "OUR_RR_USERNAME":"OUR_RR_PASSWORD" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sce1986209_f7df_4671_a245_0001c2a94dcd/stats"

4. Proxy set-up (Bluemix)

A Node JS based application will perform well for the purpose of a proxy. You will need to set it up with CORS to allow cross-domain access to the JSON output. The core can be very simple:

//allow only specific servers to access the resources, and only using GET or OPTIONS request:
app.use(function(req, res, next) {
  var allowedOrigins = ['http://ourdevelopmentserver.ibm.com', 'http://ourtestserver.ibm.com', 'http://ourproductionserver.ibm.com'];
  var origin = req.headers.origin;
  if(allowedOrigins.indexOf(origin) > -1){
       res.setHeader('Access-Control-Allow-Origin', origin);
  }
  res.header('Access-Control-Allow-Methods', 'GET, OPTIONS');
  res.header('Access-Control-Allow-Headers', 'X-Requested-With, Content-Type, Authorization');
  res.header('Access-Control-Allow-Credentials', true);
  return next();
});

//get service content and proxy it to users
var request = require('request');
app.get('/select', function (req, res) {
    var url='https://' + process.env.RR_USERNAME + ':' + process.env.RR_PASSWORD + '@gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/' + process.env.RR_CLUSTERID + '/solr/' + req.query.c + '/select?q=' + req.query.q + '&wt=json&fl=' + req.query.fl;
    request({
        'url': url,
        'method': 'GET'
    }, function (error, response, body) {
        if (error) {
            console.error('Watson error:', response ? response.statusCode : null, response ? response.statusMessage : null, error);
        } else if (body.error) {
            console.error('Watson body error:', response.statusCode, response.statusMessage, body.error);
        } else if (!('' + response.statusCode).match(/^2\d\d$/)) {
            console.error('Watson status code error:', response.statusCode, response.statusMessage, body);
        } else {
            res.send(body);
        }
    });
});

5. Widget set-up (HR pages)

The final step is to plug in the search to your pages. In the example below I use slightly extended Dojo’s QueryReadStore, but other JavaScript frameworks offer easier methods and your webmasters will certainly enjoy the challenge of setting up a cognitive search box for users.

  this.initDataStore= function(){
	//modify QueryReadStore to provide own formating and query
	dojo.provide("QueryReadStore");
	dojo.declare("QueryReadStore", dojox.data.QueryReadStore, {
	  fetch:function(request) {
	    request.serverQuery = {q:request.query.q};
	    return this.inherited("fetch", arguments);
	  },
		_filterResponse(data) {
			var items = [];
			dojo.forEach(data.response.docs,function(e){
				var title = e.title[0] || e.url || "-";
				items.push({
					label: '<a href="'+ e.url + '" class="link">' + title + '</a>',
					id: e.url,
					title: title});
			},this);
			return {label:'ok', items:items}
		}
	});

	this.watsonStore = new QueryReadStore({url:'http://ourhrwebbluemixapp.mybluemix.net/select?fl=title,url&c=us'});
  }

  this.initInputBox = function(){
    this.watsonBox = new FilteringSelect({name:"watson",searchAttr:"q",labelAttr: "label",labelType: "html",store:this.watsonStore,trim:true,hasDownArrow:false,"class":"ouribmclass",query:{"exclude":false},queryExpr: "${0}",highlightMatch:"all",pageSize:"10",style:"display:inline-block !important;width:100px !important",fetchProperties:{sort:[{attribute: 'label', descending: false}]}}
		,dojo.byId(this.inputDivId));
    this.watsonBox.autoComplete = false;
    this.watsonBox.textbox.placeholder=this.dataWidget.label;
  }


The proof of concept works really well and is extremely fast – as one would expect from Apache Solr engine running it. Now we can automate and deliver the solution and work with HR and our employees to get the best value out of it. At the end of the day, I want to find my per diem easily too!

Leave a Reply

Your email address will not be published. Required fields are marked *