how to go about apache-solr

I had explored lucene for providing full text search engines. I had even gone into the depth of modifying the core classes of lucene to change and also add new functionality into lucene. Being good at lucene, i never looked at solr. I had the notion that Solr was a simple web interface on top of lucene so it will not be very customizable. But recently my belief was broken. I have been going though solr 1.4.1. Here is a list of features that solr provides by default : http://lucene.apache.org/solr/features.html.

Solr 1.4.1 combines

  • Lucene 2.9.3 – an older version of lucene. The recent version of lucene 3.0.2 is based on java 5 and has some really good performance improvements over the 2.x version of lucene. I wish we had lucene 3.x on solr. Lucene powers the core full-text search capability of solr.
  • Tika 0.4 – Again the latest version here is 0.8. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
  • Carrot2 3.1.0 – Here also the latest version is 3.4.2. Carrot is an open source search result clustering engine. It readily integrates with Lucene as well.

To install solr simply download it from http://lucene.apache.org/solr/index.html and untar it. You can launch the solr by simply navigating to <solr_directory>/example directory and running java -jar start.jar. This will start the sample solr server without any data. You can go to http://localhost:8983/solr/admin to see the admin page for solr. To post some sample files to solr simply do <solr_dir>/example/exampledocs$ java -jar post.jar *.xml. This will load example documents in solr server and create an index on them.

Now the real fun begins when you want to index your own site on solr. First of all, you need to define the schema and identify how data will be ported into the index.

For starters
— copy example directory : cp example myproject
— go to solr/conf directory : cd myproject/solr/conf
— ls will show you the directory contents : $ ls
admin-extra.html dataimport.properties mapping-ISOLatin1Accent.txt schema.xml solrconfig.xml stopwords.txt xslt
data-config.xml elevate.xml protwords.txt scripts.conf spellings.txt synonyms.txt

These are the only files in solr which need to be tweaked for getting solr working…
We will see all the files one by one

schema.xml
— you need to define the fieldType such as String, boolean, binary, int, float, double etc…
— Each field type has a class and certain properties associated with it.
— Also you can specify how a fieldtype is analyzed/tokenized/stored in the schema
— Any filters related to a fieldType can also be specified here.
— Lets take an example of a text field

# the text field is of type solr.TextField
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    # the analyzer is to be applied during indexing.
    <analyzer type="index">
        # pass the text through the following tokenizer
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        # and then apply these filters
        # use the stop words specified in stopwords.txt for the stopwordfilter
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        # avoid stemming words which are in protwords.txt file
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
    # for query use the following analyzers and filters
    # generally the analyers/filters for indexing and querying are same
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
</fieldType>

— Once we are done defining the field types, we can go ahead and define the fields that will be indexed.
— Each field can have additional attributes like type=fieldType, stored=true/false, indexed=true/false, omitNorms=true/false
— if you want to ignore indexing of some fields, you can create an ignored field type and specify the type of the field as ignored

<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<dynamicField name="*" type="ignored" multiValued="true" />

— in addition to these, you also have to specify the unique-key to enforce uniqueness among multiple documents

<uniqueKey>table_id</uniqueKey>

— Default Search field and default search operator are among other things that can be specified.

solrconfig.xml
– used to define the configuration for solr
– parameters like datadir (where index is to be stored), and indexing parameters can be specified here.
– You can also configure caches like queryResultCache, documentCache or fieldValueCache and their caching parameters.
– It also handles warming of cache
– There are request handlers for performing various tasks.
– Replication and partitioning are sections in request handling.
– Various search components are also available to handle advanced features like faceted search, moreLikeThis, highlighting
– All you have to do is put the appropriate settings in the xml file and solr will handle the rest.
– Spellchecking is available which can be used to generate a list of alternate spelling suggestions.
– Clustering is also a search component, it integrates search with Carrot2 for clustering. You can select which algorithm you want for clustering – out of those provided by carrot2.
– porting of data can be made using multiple options like xml, csv. There are request handlers available for all these formats
– A more interesting way of porting data is by using the dataImportHandler – it ports data directly from mysql to lucene. Will go in detail on this.
– There is an inbuilt dedup results handler as well. So all you have to do is set it up by telling it which fields to monitor and it will automatically deduplicates the results.

DataImportHandler

For people who use sphinx for search, a major benefit is that they do not have to write any code for porting data. You can provide a query in sphinx and it automatically pulls data out of mysql and pushes it to sphinx engine. DataImportHandler is a similar tool available for linux. You can register a dataimporthandler as a requestHandler

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    # specify the file which has the config for database connection and query to be fired for getting data
    # you can also specify parameters to handle incremental porting
    <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml

<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/databaseName?zeroDateTimeBehavior=convertToNull"
                 user="root"  password="jayant" />
    <document>
        <entity name="outer" pk="table_id"
        query="SELECT table_id, data1, field1, tags, rowname, date FROM mytable"
        deltaImportQuery="SELECT table_id, data1, field1, tags, rowname, date FROM mytable where table_id='${dataimporter.delta.table_id}'"
        deltaQuery="SELECT table_id from mytable where last_modified_time > '${dataimporter.last_index_time}'">
        
        # this is the map which says which column will go into which fieldname in the index
        <field column="table_id" name="table_id" />
        <field column="data1" name="data1" />
        <field column="field1" name="field1" />
        <field column="tags" name="tags" />
        <field column="rowname" name="rowname" />
        <field column="date" name="date" />

            # getting content from another table for this tableid
            <entity name="inner"
            query="SELECT content FROM childtable where table_id = '${outer.table_id}'" >
            <field column="content" name="content" />
            </entity>
        </entity>
    </document>
</dataConfig>

Importing data

Once you start the solr server using java -jar start.jar, you can see the server working on
http://localhost:8983/solr/

It will show you a “welcome message”

To import data using dataimporthandler use
http://localhost:8983/solr/dataimport?command=full-import (for full import)
http://localhost:8983/solr/dataimport?command=delta-import (for delta import)

To check the status of dataimporthandler use
http://localhost:8983/solr/dataimport?command=status

Searching

The ultimate aim of solr is searching. So lets see how can we search in solr and how to get results from solr.
http://localhost:8983/solr/select/?q=solr&start=0&rows=10&fl=rowname,table_id,score&sort=date desc&hl=true&wt=json

Now this says that
– search for the string “solr”
– start from 0 and get 10 results
– return only fields : rowname, table_id and score
– sort by date descending
– highlight the results
– return output as json

All you need to do is process the output and the results.

There is a major apprenhension in using solr due to the reason that it provides an http interface for communication with the engine. But i dont think that is a flaw. Ofcourse you can go ahead and create your own layer on top of lucene for search, but then solr uses some standards for search and it would be difficult to replicate all these standards. Another option is to create a wrapper around the http interface with limited functionality that you need. Http is an easier way of communication as compred to defining your own server and your own protocols.

Definitely solr provides an easy to use – ready made solution for search on lucene – which is also scalable (remember replication, caching and partitioning). And in case the solr guys missed something, you can pick up their classes and modify/create your own to cater to your needs.

intro to lucene 2.9

What crap!!!. Why do they have to come out with a new version every now and then. And make people rewrite their code to upgrade to a new version. How much do they still have to improve their code. Just because of their frequent upgrades, i have to change my code every now and then. Why
should i upgrade to lucene 2.9?

To answer this question – it could be said that you build something and then you figure out that – oh no if this could have been done in this way then it would have been better. So for example you make an omlette and you figure out that putting a bit of cheese and pepper would have improved its taste. So next time you try that, but then you figure out that making it in butter would bring out more taste. Or you buy a Pentium 3 pc with 1 GB ram and after 2 years you see that it is outdated – the softwares have grown and so have the processing powers. To run the currently available softwares, you would need to upgrade your pc to a Pentium 4 – core 2 duo and maybe upgrade your graphics card to ATI Radeon 4870 X2 from the previous nvidia 9800 GT to play the recent games more effectively. And maybe upgrade your 20 inch CRT television to a 42 inch HD LCD for better graphics display.

It is the same reason that lucene keeps on optimizing its code and improving the features – they realize that better code leads to faster indexing and searching on the same machine.

The reason why you shud upgrade your lucene version is defined by the list of features that lucene 2.9 provides:

  • Per segment searching and caching (can lead to much faster reopen among other things). FieldCache – takes advantage of the fact that most segments of the index are static, only processes the parts that change, save on time and memory. Also faster searching among multiple segments.
  • Near real-time search capabilities added to IndexWriter – new way to search the current in-memory segment before index is written to disk.
  • New Query types
  • Smarter, more scalable multi-term queries (wildcard, range, etc)
  • A freshly optimized Collector/Scorer API
  • Improved Unicode support and the addition of Collation contrib
  • A new Attribute based TokenStream API
  • A new QueryParser framework in contrib with a core QueryParser replacement impl included.
  • Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required.
  • New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)
  • New fast-vector-highlighter for large documents
  • Lucene now includes high-performance handling of numeric fields (NumericField & NumericRangeQuery). Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values. This improves the Lucene number indexing, and is faster for searching numbers, geo-locations, and dates, faster for sorting, and hugely faster for range searching.

For the newbies, all this techno-rant is just there to make you feel good about upgrading. In brief – faster search and more features.

Lets take a look at how you would go ahead with indexing and searching using lucene 2.9

Here is a very rough example. What i have done is use the twitter api to search for keywords in twitter and fetch the micro-blogs and create an index using lucene 2.9. And then use the same program to open the index and run a search – displaying only the n results. You can fetch the twitter api from http://yusuke.homeip.net/twitter4j/en/index.html


import twitter4j.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.store.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.*;
import org.apache.lucene.document.*;
import java.io.*;
import java.util.Date;
import java.util.ArrayList;
import java.util.List;

public class lucene
{
public static void main(String[] args) throws Exception
{
if(args.length != 3)
{
System.out.println("Usage : java lucene <index/search> <dirname> <string>");
System.exit(1);
}

if(!args[0].equalsIgnoreCase("index") && !args[0].equalsIgnoreCase("search"))
{
System.out.println("Usage : java lucene <index/search> <dirname> <string>");
System.exit(1);
}
System.out.println(args[0]+","+args[1]+","+args[2]);

lucene lu = new lucene(args[0], args[1]);
if(args[0].equalsIgnoreCase("index"))
lu.indexFiles(args[2]);
else if(args[0].equalsIgnoreCase("search"))
lu.searchFiles(args[2]);


}

File index_dir;
String action;

public lucene(String action, String dirname) throws Exception
{
this.index_dir = new File(dirname);
this.action = action;

if(index_dir.exists() && action.equalsIgnoreCase("index"))
{
System.out.println("Index already exisits... enter another another directory for indexing...");
System.exit(1);
}
}

public void indexFiles(String searchstr) throws Exception
{
Twitter tw = new Twitter();
System.out.println("Getting tweets for "+searchstr);
twitter4j.Query qry = new twitter4j.Query("source:twitter4j "+searchstr);
qry.setRpp(50);

QueryResult res = tw.search(qry);
List<Tweet> tweets = res.getTweets();
System.out.println("Got "+tweets.size()+" tweets in "+res.getCompletedIn()+" : "+res.getMaxId());

// constructor changed from lucene 2.4.1
IndexWriter iw = new IndexWriter(NIOFSDirectory.open(this.index_dir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

int docs = 0;
for(int z=0; z<tweets.size(); z++)
{
Tweet twt = (Tweet)(tweets.get(z));
String user = twt.getFromUser();
String usrTwt = twt.getText();
System.out.println("Got : "+user+" => "+usrTwt);

Document d = new Document();
// constructor for Field changed - introduced new constants ANALYZED & NOT_ANALYZED. Not storing NORMS improve performance.
d.add(new Field("user", user, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS, Field.TermVector.YES));
d.add(new Field("tweet", usrTwt, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS));

iw.addDocument(d);
docs++;
}

System.out.println("optimizing..."+docs+" docs");
iw.optimize();
iw.close();
}

public void searchFiles(String searchstr) throws Exception
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
QueryParser parser = new QueryParser("tweet",new WhitespaceAnalyzer());
// New constructor in 2.9 - pass true to open in readonly mode.
IndexReader ir = IndexReader.open(NIOFSDirectory.open(this.index_dir), true);
Searcher searcher = new IndexSearcher(ir);
int ResultsPerPage = 5;
do
{
org.apache.lucene.search.Query qry = parser.parse(searchstr);
System.out.println("Searching for : "+searchstr);

//use TopScoreDocCollector to get results and do paging. Get 2 page in a go. Do not sort on score.
TopScoreDocCollector collector = TopScoreDocCollector.create(2*ResultsPerPage, false);
searcher.search(qry, collector);
//get total no of hits found;
int totalResults = collector.getTotalHits();
int start = 0;
int end = Math.min(totalResults, ResultsPerPage);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

System.out.println("Total hits : "+totalResults+", end : "+end);

for(int i=start; i<end; i++)
{
Document doc = searcher.doc(hits[i].doc);
System.out.println(i+"] "+doc.get("user")+" => "+doc.get("tweet"));
}


System.out.print("nQuery (enter "quit" to exit): ");
searchstr = br.readLine();
if(searchstr == null || searchstr.length() == -1)
{
break;
}
searchstr.trim();
if(searchstr.length()==0)
{
break;
}

}while(!searchstr.equalsIgnoreCase("quit"));

}
}

moniter and improve lucene search

How ?
Simple – use lucidgaze available at http://www.lucidimagination.com/Downloads/LucidGaze-for-Lucene

With lucidgaze you could analyze

  • Read/write stats on index utilization, distribution and throughput
  • Query efficiency stats – show how effectively user input is analyzed and decomposed for processing by the index
  • Mapping of tokenizers, token-streams and analyzers – makes transparent how text is processed and indexed

So, we went ahead and downloaded the api but were unable to find any example program or source code. After some amount of tinkering around we were finally able to figure out how to use it.

You could get the code here : http://ai-cafe.blogspot.com/2009/09/lucid-gaze-tough-nut.html

lucene search class diagram

Check out the UML diagram for lucene search & scoring i found on the net

Application flow:
1. create query
2. Pass to searcher
3. Searcher creates weight
4. Weight creates scorer
5. Searcher create collector and pass to scorer
6. scorer find matches.
7. scorer calculates scores.
8. Matches are written to collector

Document scoring in lucene – Part 2

This is an addition to the previous post document scoring/calculating relevance in lucene. If you find the link inadequate you can refer scoring@lucene.org and the formula at Default similarity formula.

What i did was i created some txt file and some code to index the file and have tried to find out in practice how lucene calculates relevance using the DefaultSimilarity class. Here is the file and the source code.

file: file_a.txt
jayant project manager project leader team leader java linux c c++ lucene apache solaris aix unix minix gnome kde ubuntu redhat fedora rpm deb media player vlc evolution exchange microsoft java vb vc vc++ php mysql java

source code: FIndexer.java
import java.io.*;
import java.util.Date;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;

public class FIndexer
{

public static void main (String[] args) throws Exception
{
String src_path = args[0];
String des_path = args[1];

File f_src = new File(src_path);
File f_des = new File(des_path);

if(!f_src.isDirectory() || !f_des.isDirectory())
{
System.out.println(“Error : “+f_src+” || “+f_des+” is not directory”);
System.exit(1);
}

IndexWriter writer = new IndexWriter(f_des,new WhitespaceAnalyzer(), true);

File[] files = f_src.listFiles();
for(int x=0; x < files.length; x++)
{
Document doc = new Document();
if(files[x].isFile())
{
BufferedReader br = new BufferedReader(new FileReader(files[x]));
StringBuffer content = new StringBuffer();
String line = null;
while( (line = br.readLine()) != null)
{
content.append(line).append(“n”);
}

Field f1 = new Field(“name”,files[x].getName(), Field.Store.YES, Field.Index.NO);
doc.add(f1);
Field f2 = new Field(“content”, content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
doc.add(f2);
Field f3 = new Field(“content2”, content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
doc.add(f3);

}
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}
}

I created copies of the file_a.txt as file_b.txt and file_c.txt and edited file_c.txt. Such that file_a.txt and file_b.txt have the same content (that is word java occurs 3 times in both the files). And file_c.txt has word java occuring only 2 times.

I created an index using FIndexer.java and used luke to fire queries on the index.

Firstly did a search content:java. And as expected i got 3 results with the following score.

Rank score filename
0 0.1928 file_b.txt
1 0.1928 file_a.txt
2 0.1574 file_c.txt

Lets take the score for file_b.txt and see how it is calculated. The score is a product of
tf = 1.7321 (java occurs 3 times)
idf = 0.7123 (document freq = 3)

And the score of file_c.txt is a product of
tf = 1.4142 (java occurs 2 times)
idf = 0.7123 (document freq = 3)

Here score is equivalent to the fieldWeight since there is just one field. If more than one fields are used in the query, then the score would be a product of the queryWeight and fieldWeight

So if i change the query to Content:java^5 Content2:java^2. Here i am boosting java in content by 5 and java in content2 by 2. That is java in content is 2.5 times more important than java in content2. Lets check the scores.

Rank score filename
0 0.2506 file_b.txt
1 0.2506 file_a.txt
2 0.2046 file_c.txt

Lets look at how the score was calculated

Again for file_b.txt
0.2506 = 0.1709 (weight of content:java^5) * 0.0716 (weight of content2:java^2)

Out of which Weight of content:java^5 is
= 0.9285(Query weight) [ 5 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

And weight of content2:java^2 is
= 0.3714(Query weight) [ 2 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(Field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

The same formula is used for calculating of score of file_c.txt except for the fact that termfrequency = 1.4142 (tf of content:java or content2:java is 2)

This explains a little about scoring in lucene. The scoring can be altered by either changing the DefaultSimilarity class or extending the DefaultSimilarity and changing certain factors/calculations in it. More on how to change the scoring formula later.

lucene in php & lucene in java

Something i found out while solving some issue from Mr. Nguyen from vietnam.

He used lucene-php in zend framework for building a lucene index and searching on the index, and was facing issues with search times. It turned out that mysql full text index was performing better than lucene index.

So i did a quick benchmark and found the following stuff

1. Indexing using php-lucene takes a huge amount of time as compared to java-lucene. I indexed 30000 records and the time it took was 1673 seconds. Optimization time was 210 seconds. Total time for index creation was 1883 seconds. Which is hell lot of time.

2. Index created using php-lucene is compatible to java-lucene. So index created by php-lucene can be read by java-lucene and vice versa.

3. Search in php-lucene is very slow as compared to java-lucene. The time for 100 searches are –

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 231 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 15 seconds

So i thought that maybe php would be retrieving the documents upfront. And changed the code to extract all documents in php and java. Still the time for 100 searches were –

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 2128 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 63 seconds

The code for php search for lucene index is:


And the code for java search of lucene index is


/*
* searcher.java
* On 2007-06-06
* By jayant
*
*/

import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;

public class searcher {

public static void main (String args[]) throws Exception
{
IndexSearcher s = new IndexSearcher("/tmp/myindex");
System.out.println("Total : "+s.maxDoc()+" docs");
QueryParser q = new QueryParser("content",new StandardAnalyzer());
Query qry = q.parse("java");

long t1 = System.currentTimeMillis();
for(int x=0; x< 100; x++)
{
Hits h = s.search(qry);
// retrieve all documents. Comment this code if you dont want to retrieve documents
for(int y=0; y< h.length(); y++)
{
Document d = h.doc(y);
}

}
long t2 = System.currentTimeMillis();
System.out.println("t2-t1 : "+(t2-t1)+" ms");
}
}

Hope i havent missed anything here.

realtime fulltext index

For past some days, i have been struggling with getting something which can allow me to create, insert, update and search fulltext indexes – big fulltext indexes. Not very large but somewhere around 3-4 GB of data.

You would suggest mysql, but with mysql the inserts on table with fulltext indexes is very slow. For every insert, data is put in the index which leads to slow inserts.

What else? Well, i had come across some other fulltext engines like sphinx and senna. Compiled mysql with senna but there was no benefit. The index size was almost double and the searches were also slow – as slow as mysql fulltext index.

What about sphinx. Well, had integrated sphinx with mysql. But the sphinx engine has lots of limitations – like the first 2 columns must be integer and the 3rd column should be a text. Rest all columns should be integers. So if i use sphinx, i would need to write a stored procedure and trigger it to port data from my current table to a parallel sphinx table whenever a new row is inserted. And what would i do if i have to run a query – i would be joining both the tables. Would that be fast. Dont know. Let me figure out how to get this thing running…

You would say – what about lucene – my favorite search engine. Well dear, lucene is in java and there is no way i could integrate it with mysql if i have to do it in a jiffy. I would need to design and build a library to get this thing running. Forget it. In fact dbsight provides a similar type of library. Maybe i will get that thing working and check it out. At a small price, I might get what the organization requires and that too without wasting any resources on building it.

Hope i get some solution to the situation i am in right now. I have solutions for it, but not a solution which requires least amount of resources to get it running. Till i get one, the search will continue…

document scoring/calculating relevance in lucene

How is scoring of a document calculated in lucene? Well, first of all what is soring and what does it do. Scoring is a factor which decides the order by which documents are returned after a search is done using lucene. Something like “order by relevance”. If we provide an order by/sort by clause in the search query, then ordering is done on this. Else, ordering is done on the score of the document. All documents returned from the search have a score no associated with it. The document with the highest score comes on top and that with the lease score comes at the bottom.

Scoring is implemented in the similarity class of lucene (org.apache.lucene.search.Similarity).

Lets see how the score of a document is calculated.

The score of a query q for document d is defines as follows (t refers to a term):

score(q,d)= Σ (tf(t in d) * idf(t)^2 * getBoost(t in q) * getBoost(t.field in d) * lengthNorm(t.field in d) ) * coord(q,d) * queryNorm(sumOfSqaredWeights)
t in q

where

sumOfSqaredWeights = Σ ( idf(t) * getBoost(t in q) )^2
t in q

Now, lets decode it…

tf = term/phrase(t) frequency in a document(d). Terms and phrases repeated in a document indicate the topic of the document, so implementations of this method usually return larger values when freq is large, and smaller values when freq is small.

idf = term document frequency(no of documents which contain the term). Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

getBoost (t in q) = Gets the boost for this clause in the query. Documents matching this clause will (in addition to the normal weightings) have their score multiplied by b. The boost b=1.0 by default.

getBoost (t.field in d) = Gets the boost for the field in the document if the term is in the field. Again boost for the field b is 1.0 by default.

lengthNorm(t.field in d) = normalization value for a field given the total number of terms contained in a field. These values, together with field boosts, are stored in an index and multipled into scores for hits on each field by the search code. Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. These values are computed under IndexWriter.addDocument(Document d) and stored. Thus they have limited precision, and documents must be re-indexed if this method is altered.

coord = Score factor based on the fraction of all query terms that a document contains. This value is multiplied into scores. The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small.

queryNorm = Computes the normalization value for a query given the sum of the squared weights of each of the query terms. This value is then multipled into the weight of each query term. This does not affect ranking, but rather just attempts to make scores from different queries comparable.

So affectively the score of a document returned, in search done, using a query is calculated keeping in mind the frequency of the term/phrase in the document, frequency of documents containing the term, boosting factor of a clause in the query, boosting factor of a field, total number of terms in a field, fraction of all query terms contained in a document and some normalization factor based in the idf and boost of term in query.

If you are searching a single field using a query, the results should be very relevant, provided the text entered is meaningful and gramatically correct. With the logic here, even if you are searching multiple fields, the order of relevance of results should be good. But relevance can be increased by assigning boost factor to fields either during the query formation process or during the indexing process. Fields could be boosted according to their order of importance resulting in very relevant results.

Boosting during the query formation process provides the flexibility to change the boost factor dynamically but slows down the search and scoring due to dynamic computation of score. Whereas if fields are boosted during the indexing process, then though the search will be much faster, you would loose the flexibility of altering the boost factor dynamically.

Anything i have missed, please let me know and i will add it…

Benchmarking results of mysql, lucene and sphinx…

Finally i was able to do a benchmark of the 3 search engines
– mysql fulltext search
– lucene search engine
– sphinx www.sphinxsearch.com

I came across sphinx while doing a search on full text search engines. It is a very good engine. Few points regarding sphinx
-> Very simple to configure, create index and search
-> Very easy to integrate with php and mysql. APIs for the same are also available. I was able to build index and search using sphinx in a few hours.
-> The index which has been created is a combination of all fields of mysql. There is no distinction between different fields being searched. So you can perform search on an index and not on different fields of the index.
-> Of course since its source code is available, the searching process can be customized according to your needs. Moreover 0.9.6 version which is under development will be providing field wise search.
-> Since this is in C, it is supposed to be faster as compared to lucene.

I did the benchmarking on my own laptop. It is a dell Inspiron 700m running linux (fedora core 4) kernel 2.6.11. Configuration of the m/c ==>>
Processor : Intel(R) Pentium(R) M processor 1.60GHz
Cache size : 2048 KB
Memory : 512 MB

I got down a table containing 1 Lakh (100,000) records. The data size was 456 MB. And created index on some fields from the table.

INDEXING TIME

Mysql Version – 5.1.9 (Beta)
Stop words : Built in
Indexing words of length >=2 &

mysql Fulltext search versus lucene

Here is the comparison between mysql fulltext and lucene search engines. On the forefront the only thing that distinguishes one from another is

==> speed of fulltext search in lucene is much faster as compared to mysql
==> lucene is much more complex to use as compared to mysql.

In mysql, you can simply mark an index on a text/varchar column as fulltext, and your work is done. All you need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added. On the other hand, in lucene, the addition/modification of documents is to be handled programatically. Moreover Mysql is pluggable to your application using available apis. Mysql is running on a port and all you need to do is connect to the port and fire sql queries to that port using your application. Whereas in case of lucene, you will have to plug your application to the index using lucene api.

Another difference is that lucene is very efficient in searching large no of documents. Where as in case of mysql as the no of documents increases, the speed of search goes down. Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. Where as in case of lucene, the size of index does not affect the search performance even when the size exceeds the RAM on the system.

With mysql, when a fulltext index is created on a table, inserts on the table become very slow. Lets analyze this. Well… for each record, mysql does some major processing to fix the record in current index. After the record is indexed, the cache/RAM containing the index needs to be rebuilt, since the index which was previously there is not correct – does not have the new record. So, with each record Mysql fetches the complete index in the cache/RAM. So if you are performing search and inserts/updates on a single table with fulltext index, the performance of both search & indexing goes very very down. On the other hand, with lucene, addition of new documents is not a major drawback. Documents can be added on the fly. Which makes indexing very fast. And this process does not affect the search. Two things to be noted here.

==> lucene does not allow you to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.

==> lucene requires an object of the index to perform the search. You will know about it when you use the api. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.

With lucene, you do not have the flexibility to join two indexes and form a single query. Like in mysql you can do something like this

SELECT TABLE1.COLA, TABLE2.COLB FROM TABLE1, TABLE2 WHERE MATCH(TABLE1.COL1) AGAINST ‘TEXT1’ AND MATCH(TABLE2.COL2) AGAINST ‘TEXT2’ AND TABLE1.ID=TABLE2.FOREIGNKEY

(Pls dont see the syntax, look for the meaning/logic behind. I am not good at syntaxes. 😀 ) This cannot be done with lucene. You will have to play with the data in such a way that your index contains both the data of say TABLE1 & TABLE2 and then you will have to play with the search to get the data that you need. Too complex right??

Also mysql comes with inbuilt list of stopwords and a default word tokenizer, which separates the words based on ” “, “,”, “.” etc. Whereas in case of lucene, both – the list of stop words and the word tokenizer has to be defined by us. This is advantageous, because then you can define your own stopwords and tokenize the text as per your requirements.

In case of mysql the searches are by default case insensitive. Whereas in case of lucene, you can make the searches case-sensitive or case-insensitive, the way you want it to be.

With mysql you have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will you do if you want to index words like “php”, “asp”, “c”? You will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all your searches as a consequence. There are no such issues in lucene.

In mysql, every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight and if a word is rare, it has higher weight. So if a word is present in 50% of the rows in a table, a query searching for that word will result in 0 result. This, mysql terms as relevance. But for me, it resulted in incorrect results for a query.

This link http://dev.mysql.com/doc/connector/j/en/fulltext-search.html will give a better idea of mysql fulltext search.

In lucene, there are some advanced options like

  • proximity search – find documents where there is one word between searchword1 and searchword2
  • wildcard search – find documents which have word like searchword* or maybe search?word etc etc…
  • fuzzy/similarity searches – find documents with words sounding similar to roam~ (will look for roam, foam etc…)
  • Term boosting – you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word “lucene” to be more relevant than those with word “mysql”. Then you can do something like -> lucene^4 mysql .

Sorting of results is extremely fast if you are using lucene. In case of mysql, if you expect your results to come out fast, you will have to forget sorting. Sorting takes huge amount of time. Even retrieving the total no of documents in mysql is a chore. Where as for lucene the total no of documents come out as a default.

From this, if you are looking for a fulltext index on a small table without much hassles, go for mysql fulltext index. Whereas if you are looking at searching a large collection of data then go for lucene.

Whew… i wrote a lot… And there is more to write…Maybe next time… Do let me know if i have missed anything.