Wednesday, March 19, 2014

An interesting suggester in Solr

Auto suggestion has evolved over time. Lucene has a number of implementations for type ahead suggestions. Existing suggesters generally find suggestions whose whole prefix matches the current user input. Recently a new AnalyzingInfixSuggester has been developed which finds matches of tokens anywhere in the user input and in the suggestion. Here is how the implementation looks.




Let us see how to implement this in Solr. Firstly, we will need solr 4.7 which has the capability to utilize the Lucene suggester module. For more details check out  SOLR-5378 and  SOLR-5528. To implement this, look at the searchComponent named "suggest" in solrconfig.xml. Make the following changes.

<searchComponent name="suggest" class="solr.SuggestComponent">
      <lst name="suggester">
      <str name="name">mySuggester</str>
      <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
      <str name="suggestAnalyzerFieldType">text_ws</str>
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory -->
      <str name="field">cat</str>
      <str name="weightField">price</str>
      <str name="buildOnCommit">true</str>
    </lst>
</searchComponent>

Here we have changed the type of lookup to use - lookupImpl to AnalyzingInfixLookupFactory. And defined the Analyzer to use for building the dictionary as text_ws - which is a simple WhiteSpaceTokenizer factory implementation. The field to be used for providing suggestions is "cat" and we use the "price" field as weight for sorting the suggestions.

Also change the default dictionary for suggester to "mySuggester". Add the following line to "requestHandler" by the name of "/suggest".

<str name="suggest.dictionary">mySuggester</str>

Once these configurations are in place, simply restart the solr server. In order to check the suggester, index all the documents in the exampleDocs folder. The suggester index is created when the documents are committed to the index. In order to check the implementation simply use the following URL.

http://localhost:8983/solr/suggest/?suggest=true&suggest.q=and&suggest.count=2

The output will be somewhat similer to this...


<response>
  <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">3</int>
  </lst>
  <lst name="suggest">
    <lst name="mySuggester">
      <lst name="and">
        <int name="numFound">2</int>
        <arr name="suggestions">
        <lst>
          <str name="term">electronics <b>and</b> computer1</str>
          <long name="weight">2199</long>
          <str name="payload"/>
        </lst>
        <lst>
          <str name="term">electronics <b>and</b> stuff2</str>
          <long name="weight">279</long>
          <str name="payload"/>
        </lst>
        </arr>
      </lst>
    </lst>
  </lst>
</response>

We are already getting the suggestions as highlighted. Try getting suggestions for some other partial words like "elec", "comp" and see the output. Let us note down some limitations that I came across while implementing this.

Checkout the type of the field "cat" which is being used for providing suggestions in schema.xml. It is of the type "string" and is both indexed and stored. We can change the field name in our solrconfig.xml to provide suggestions based on some other field, but the field has to be both indexed and stored. We would not want to tokenize the field as it may mess up with the suggestions - so it is recommended to use "string" fields for providing suggestions.

Another flaw, that I came across is that the suggestions do not work on multiValued fields. "cat" for example is multivalued. Do a search on all "electronics" and get the "cat" field.

http://localhost:8983/solr/collection1/select/?q=electronics&fl=cat

We can see that in addition to "electronics", the cat field also contains "connector", "hard drive" and "memory". But a search on those strings does not give any suggestions.

http://localhost:8983/solr/suggest/?suggest=true&suggest.q=hard&suggest.count=2

So, it is recommended that the field be of type "string" and not multivalued. If there are multiple fields on which suggestions are to be provided, it is recommended to merge them into a single "string" field in our index.