An interesting suggester in Solr

Auto suggestion has evolved over time. Lucene has a number of
implementations for type ahead suggestions. Existing suggesters
generally find suggestions whose whole prefix matches the current user
input. Recently a new AnalyzingInfixSuggester has been developed which
finds matches of tokens anywhere in the user input and in the
suggestion. Here is how the implementation looks.

Let us see how to implement this in Solr. Firstly, we will need solr 4.7 which has the capability to utilize the Lucene suggester module. For more details check out  SOLR-5378 and  SOLR-5528. To implement this, look at the searchComponent named “suggest” in solrconfig.xml. Make the following changes.

<searchComponent name=”suggest” class=”solr.SuggestComponent”>
      <lst name=”suggester”>
      <str name=”name”>mySuggester</str>
      <str name=”lookupImpl”>AnalyzingInfixLookupFactory</str>
      <str name=”suggestAnalyzerFieldType”>text_ws</str>
      <str name=”dictionaryImpl”>DocumentDictionaryFactory</str>     <!– org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory –>
      <str name=”field”>cat</str>
      <str name=”weightField”>price</str>
      <str name=”buildOnCommit”>true</str>
    </lst>
</searchComponent>

Here we have changed the type of lookup to use – lookupImpl to AnalyzingInfixLookupFactory. And defined the Analyzer to use for building the dictionary as text_ws – which is a simple WhiteSpaceTokenizer factory implementation. The field to be used for providing suggestions is “cat” and we use the “price” field as weight for sorting the suggestions.

Also change the default dictionary for suggester to “mySuggester“. Add the following line to “requestHandler” by the name of “/suggest“.

<str name=”suggest.dictionary”>mySuggester</str>

Once these configurations are in place, simply restart the solr server. In order to check the suggester, index all the documents in the exampleDocs folder. The suggester index is created when the documents are committed to the index. In order to check the implementation simply use the following URL.

http://localhost:8983/solr/suggest/?suggest=true&suggest.q=and&suggest.count=2

The output will be somewhat similer to this…

<response>
  <lst name=”responseHeader”>
  <int name=”status”>0</int>
  <int name=”QTime”>3</int>
  </lst>
  <lst name=”suggest”>
    <lst name=”mySuggester”>
      <lst name=”and”>
        <int name=”numFound”>2</int>
        <arr name=”suggestions”>
        <lst>
          <str name=”term”>electronics <b>and</b> computer1</str>
          <long name=”weight”>2199</long>
          <str name=”payload”/>
        </lst>
        <lst>
          <str name=”term”>electronics <b>and</b> stuff2</str>
          <long name=”weight”>279</long>
          <str name=”payload”/>
        </lst>
        </arr>
      </lst>
    </lst>
  </lst>
</response>

We are already getting the suggestions as highlighted. Try getting suggestions for some other partial words like “elec“, “comp” and see the output. Let us note down some limitations that I came across while implementing this.

Checkout the type of the field “cat” which is being used for providing suggestions in schema.xml. It is of the type “string” and is both indexed and stored. We can change the field name in our solrconfig.xml to provide suggestions based on some other field, but the field has to be both indexed and stored. We would not want to tokenize the field as it may mess up with the suggestions – so it is recommended to use “string” fields for providing suggestions.

Another flaw, that I came across is that the suggestions do not work on multiValued fields. “cat” for example is multivalued. Do a search on all “electronics” and get the “cat” field.

http://localhost:8983/solr/collection1/select/?q=electronics&fl=cat

We can see that in addition to “electronics“, the cat field also contains “connector“, “hard drive” and “memory“. But a search on those strings does not give any suggestions.

http://localhost:8983/solr/suggest/?suggest=true&suggest.q=hard&suggest.count=2

So, it is recommended that the field be of type “string” and not multivalued. If there are multiple fields on which suggestions are to be provided, it is recommended to merge them into a single “string” field in our index.

Win Free e-copies of Apache Solr PHP Integration

Readers would be pleased to know that I have teamed up with Packt Publishing to organize a giveaway of Apache Solr PHP Integration

And 3 lucky winners stand a chance to win e-copies of their new book. Keep reading to find out how you can be one of the Lucky Winners.

Overview


• Understand the tools that can be used to communicate between PHP and Solr, and how they work internally
• Explore the essential search functions of Solr such as sorting, boosting, faceting, and highlighting using your PHP code
• Take a look at some advanced features of Solr such as spell checking, grouping, and auto complete with implementations using PHP code

How to enter ?

All you need to do is head on over to the book page and look through the product description of the book and drop a line via the comments below this post to let us know what interests you the most about this book. It’s that simple.

Deadline

The contest will close on 5th March 2014 Winners will be contacted by email, so be sure to use your real email address when you comment!

A book every php developer should read

Once upon a time, a long long time ago, when there was no Solr and lucene used to be a search engine api available for php developers to use, they used to struggle for using lucene. Most people reverted back to Mysql full text search. And some ventured into using sphinx – another free full text search engine. Then came Solr and php developers were thrilled with the ease of use over the http interface for both indexing and searching text using lucene abstracted by Solr.

Even in those days, it was difficult to fully explore and use the features provided by lucene and Solr through php. There is an extension in Php to communicate to Solr. But the extension has not been in active development. As Solr came out with more and more features, the extension became very basic. Most of the advanced features provided by Solr were not available in the php Solr extension. Then came Solarium, an open source library which is being very actively developed and had support for the latest features of Solr.

But as the features of Solarium and Solr kept on increasing, php developers find it difficult to keep up to date with them. The book Apache Php Solr Integration provides an up to date and in depth view of the latest features provided by Solr and how it can be explored in php via Solarium. Php developers are generally very comfortable in writing code and in setting up systems. But in case you are a developer and are not very familiar with how to Setup solr or how to connect to Solr using php, the book hand holds you with configurations, examples and screen shots.

In addition to discussing simple topics like indexing and search which are very basic to Solr, the book also goes in depth on advanced queries in Solr like filter queries and faceting. The book also guides a developer on setting up Solr for highlighting hits in the results and goes into the implementation with sample codes. Other advanced functionalities discussed in the book are development and implementation of spell check in Solr and php, grouping of results, implementing the more like this feature in Php and Solr. The book also discusses distributed search – a feature used for Scaling Solr horizontally. Setting up of Master-Slave on Solr is discussed with sample configuration files. Load balancing of queries using php and Solarium is also discussed with sample code.

As a php developer, you may have some questions like

Q: Why should i read this book?
A: The book would make you an expert in search using Solr. That would be an additional skill that you can show off.

Q: I know Solr. What additional does the book provide ?
A: Are you up to date with the latest features provided by Solr ?  Have you implemented featues like spell check, suggestions, result grouping, more like this ?

Q: I am an expert in all the above features of Solr. What else does the book have ?
A: Are you comfortable implementing Solr on large scale sites with index which has millions of documents ? Do you know how Solr calculates relevance and how it can be tweaked ? Can you provide index statistics of Solr using php ?

If you are still undecided, the following article and table of contents of the book will help you make your mind.

http://www.packtpub.com/article/apache-solr-php-integration
http://www.packtpub.com/apache-solr-php-integration/book

how to go about apache-solr

I had explored lucene for providing full text search engines. I had even gone into the depth of modifying the core classes of lucene to change and also add new functionality into lucene. Being good at lucene, i never looked at solr. I had the notion that Solr was a simple web interface on top of lucene so it will not be very customizable. But recently my belief was broken. I have been going though solr 1.4.1. Here is a list of features that solr provides by default : http://lucene.apache.org/solr/features.html.

Solr 1.4.1 combines

  • Lucene 2.9.3 – an older version of lucene. The recent version of lucene 3.0.2 is based on java 5 and has some really good performance improvements over the 2.x version of lucene. I wish we had lucene 3.x on solr. Lucene powers the core full-text search capability of solr.
  • Tika 0.4 – Again the latest version here is 0.8. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
  • Carrot2 3.1.0 – Here also the latest version is 3.4.2. Carrot is an open source search result clustering engine. It readily integrates with Lucene as well.

To install solr simply download it from http://lucene.apache.org/solr/index.html and untar it. You can launch the solr by simply navigating to <solr_directory>/example directory and running java -jar start.jar. This will start the sample solr server without any data. You can go to http://localhost:8983/solr/admin to see the admin page for solr. To post some sample files to solr simply do <solr_dir>/example/exampledocs$ java -jar post.jar *.xml. This will load example documents in solr server and create an index on them.

Now the real fun begins when you want to index your own site on solr. First of all, you need to define the schema and identify how data will be ported into the index.

For starters
— copy example directory : cp example myproject
— go to solr/conf directory : cd myproject/solr/conf
— ls will show you the directory contents : $ ls
admin-extra.html dataimport.properties mapping-ISOLatin1Accent.txt schema.xml solrconfig.xml stopwords.txt xslt
data-config.xml elevate.xml protwords.txt scripts.conf spellings.txt synonyms.txt

These are the only files in solr which need to be tweaked for getting solr working…
We will see all the files one by one

schema.xml
— you need to define the fieldType such as String, boolean, binary, int, float, double etc…
— Each field type has a class and certain properties associated with it.
— Also you can specify how a fieldtype is analyzed/tokenized/stored in the schema
— Any filters related to a fieldType can also be specified here.
— Lets take an example of a text field

# the text field is of type solr.TextField
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    # the analyzer is to be applied during indexing.
    <analyzer type="index">
        # pass the text through the following tokenizer
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        # and then apply these filters
        # use the stop words specified in stopwords.txt for the stopwordfilter
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        # avoid stemming words which are in protwords.txt file
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
    # for query use the following analyzers and filters
    # generally the analyers/filters for indexing and querying are same
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
</fieldType>

— Once we are done defining the field types, we can go ahead and define the fields that will be indexed.
— Each field can have additional attributes like type=fieldType, stored=true/false, indexed=true/false, omitNorms=true/false
— if you want to ignore indexing of some fields, you can create an ignored field type and specify the type of the field as ignored

<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<dynamicField name="*" type="ignored" multiValued="true" />

— in addition to these, you also have to specify the unique-key to enforce uniqueness among multiple documents

<uniqueKey>table_id</uniqueKey>

— Default Search field and default search operator are among other things that can be specified.

solrconfig.xml
– used to define the configuration for solr
– parameters like datadir (where index is to be stored), and indexing parameters can be specified here.
– You can also configure caches like queryResultCache, documentCache or fieldValueCache and their caching parameters.
– It also handles warming of cache
– There are request handlers for performing various tasks.
– Replication and partitioning are sections in request handling.
– Various search components are also available to handle advanced features like faceted search, moreLikeThis, highlighting
– All you have to do is put the appropriate settings in the xml file and solr will handle the rest.
– Spellchecking is available which can be used to generate a list of alternate spelling suggestions.
– Clustering is also a search component, it integrates search with Carrot2 for clustering. You can select which algorithm you want for clustering – out of those provided by carrot2.
– porting of data can be made using multiple options like xml, csv. There are request handlers available for all these formats
– A more interesting way of porting data is by using the dataImportHandler – it ports data directly from mysql to lucene. Will go in detail on this.
– There is an inbuilt dedup results handler as well. So all you have to do is set it up by telling it which fields to monitor and it will automatically deduplicates the results.

DataImportHandler

For people who use sphinx for search, a major benefit is that they do not have to write any code for porting data. You can provide a query in sphinx and it automatically pulls data out of mysql and pushes it to sphinx engine. DataImportHandler is a similar tool available for linux. You can register a dataimporthandler as a requestHandler

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    # specify the file which has the config for database connection and query to be fired for getting data
    # you can also specify parameters to handle incremental porting
    <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml

<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/databaseName?zeroDateTimeBehavior=convertToNull"
                 user="root"  password="jayant" />
    <document>
        <entity name="outer" pk="table_id"
        query="SELECT table_id, data1, field1, tags, rowname, date FROM mytable"
        deltaImportQuery="SELECT table_id, data1, field1, tags, rowname, date FROM mytable where table_id='${dataimporter.delta.table_id}'"
        deltaQuery="SELECT table_id from mytable where last_modified_time > '${dataimporter.last_index_time}'">
        
        # this is the map which says which column will go into which fieldname in the index
        <field column="table_id" name="table_id" />
        <field column="data1" name="data1" />
        <field column="field1" name="field1" />
        <field column="tags" name="tags" />
        <field column="rowname" name="rowname" />
        <field column="date" name="date" />

            # getting content from another table for this tableid
            <entity name="inner"
            query="SELECT content FROM childtable where table_id = '${outer.table_id}'" >
            <field column="content" name="content" />
            </entity>
        </entity>
    </document>
</dataConfig>

Importing data

Once you start the solr server using java -jar start.jar, you can see the server working on
http://localhost:8983/solr/

It will show you a “welcome message”

To import data using dataimporthandler use
http://localhost:8983/solr/dataimport?command=full-import (for full import)
http://localhost:8983/solr/dataimport?command=delta-import (for delta import)

To check the status of dataimporthandler use
http://localhost:8983/solr/dataimport?command=status

Searching

The ultimate aim of solr is searching. So lets see how can we search in solr and how to get results from solr.
http://localhost:8983/solr/select/?q=solr&start=0&rows=10&fl=rowname,table_id,score&sort=date desc&hl=true&wt=json

Now this says that
– search for the string “solr”
– start from 0 and get 10 results
– return only fields : rowname, table_id and score
– sort by date descending
– highlight the results
– return output as json

All you need to do is process the output and the results.

There is a major apprenhension in using solr due to the reason that it provides an http interface for communication with the engine. But i dont think that is a flaw. Ofcourse you can go ahead and create your own layer on top of lucene for search, but then solr uses some standards for search and it would be difficult to replicate all these standards. Another option is to create a wrapper around the http interface with limited functionality that you need. Http is an easier way of communication as compred to defining your own server and your own protocols.

Definitely solr provides an easy to use – ready made solution for search on lucene – which is also scalable (remember replication, caching and partitioning). And in case the solr guys missed something, you can pick up their classes and modify/create your own to cater to your needs.