Friday, May 12, 2006

discovering lucene

How i discovered lucene and how did i tune it to use it with our organization... Oops, i may have to leave out specific details about its usage in our organization but i will try to give out a brief general idea.

Well.... i found out lucene thru the mysql website. Was just going thru the forums of mysql full text search engine. At that time mysql 4.0 had just rolled out and we were using the full text search engine of mysql. But it was terribly slow. Since the data size was not "that" large and number of searches could also be numbered down easily, we were able to survive on it. But we were aware that we would be hitting a bottle-neck some time later and well, we had to do something about it. Soooo... i was just browsing the forums and somewhere i found someone mention that mysql full text search is slow and a better alternative to it would be "aspseek" or "lucene". I first tried out aspseek, but it did not allow me to do a boolean query using different fields. Later i tried lucene. It is completely in java, but recently some versions of lucene for c and other languages are coming out... Another thing which is better than lucene is "egothor". But there is not much support/awareness about it. And i have tried but have been unable to use it to perform field based searches.

What i did was build up a huge index on one of the small P-III machines and try out the search. Made a program to fire random queries concurrently to the index and checked the load on the system. It turned out that the searches were extremely fast and the total no of results found was obtained in a flash, but the retrieval of large number of documents after search was a tedious process. Another thing that was found is that with compound file structure (lucene has 2 forms of index structure - compound and simple - will explain in detail later) and increased concurrency in searches, the speed of search would go down and load on the system would shoot up. So we decided on using the simple index structure.

The most important thing we did was that we found out a bug in that version of lucene (at that time 1.3 was the version being used) related to thread concurrency. A class which was synchronized and was not supposed to be. Pointed it out and sent thread stack dumps to Doug Cutting - the creator of lucene and he helped us solve it out. So we recompiled lucene with the patches, modified the data to suit the search we wanted , created our own analyzer (lucene has analyzers - will explain in detail later) and then used lucene. It still does serve our purpose. Though from that time till now, numerous changes have been done to the data and the search to optimize the search.

To begin, lucene is just an API. A set of tools that allows you to create an index and then search it. The index created is an inverted index (you will have to do some googling to find out what inverted index is - if u dont know it). And the index is basically a directory containing files related to the index. When compound index structure is used, the index directory contains a single file. But when simple index structure is used, the index directory contains lots of files (generally 1 file/field being indexed and 7 (i think) more files).

An analyzer is the most important part of the index. It defines how data to be searched is being broken and formatted. You can break data into phrases or words, convert all of them to either lower-case or upper-case (search can also be made case sensitive). For example there is the whitespace-analyzer which breaks the text to be indexed into tokens (or words) separated by white space. There is a StandardAnalyzer which retains only alphanumeric stuff from your text and discards special characters. There is a stop-word analyzer which breaks text on the basis of stop word list provided to it. This seems to be too heavy. In fact this part was the most difficult one when i started out with lucene. It may be difficult to get what you exactly want from your analyzer and like me, you may end up making your own analyzer and define your own stop words.

What are stop words? Oh... i forgot. Stop words or noise words are words which are neither indexed nor searched. Something like "the" can be made a stop word, since it is a common word and is not relevant during search.

I would just put down 2 small and basic programs for indexing and search. Dont copy and paste them, it wont work. I dont believe in spoon feeding.

INDEXING: A program which would index text files in a directory.

/** Declare the main class. Index all text files under a directory. */
public class IndexFiles
{
// Name of the index directory. where index would be stored.
static final File INDEX_DIR = new File("index");

// The main method
public static void main(String[] args)
{
// idxsrc is the directory where files to be indexed is stored
final File docDir = new File("idxsrc");

// Start the indexing process
try
{
//create an object of the indexwriter, use standard analyzer and create new index
IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");
//call a function which does the actual indexing
indexDocs(writer, docDir);
System.out.println("Optimizing...");
//optimize the index - very important - increases speed of search
writer.optimize();
writer.close();

} catch (IOException e) //exception handling
{
e.printStackTrace();
}
}

static void indexDocs(IndexWriter writer, File file)
throws IOException
{
System.out.println("adding " + file);
try
{
Document doc = new Document();
doc.add(new Field("path", file.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.add(new Field("contents", new FileReader(file), Field.Store.NO, Field.Index.TOKENIZED));
writer.addDocument(doc);
}
catch (FileNotFoundException fnfe)
{
fnfe.printStackTrace();
}
}
}
}
}

Now this is a very crude program to index a directory containing text files. The program creates 2 fields in the index - "path" and "contents". The important thing to note over here is how the index is made up. An index is a collection of documents and each document has n number of fields. For each field, you can decide whether you want to store the data or not while indexing. Also you can decide whether you want to break up the data for that field into tokens for search using the analyzer that you have specified.

SEARCH: Now lets search the index created

// Declare the class
public class SearchFiles
{
/** main method */
public static void main(String[] args) throws Exception
{
String index = "index"; //index to search
String field = "contents"; //default field to search
String queries = null;

//open up the index
Searcher searcher = new IndexSearcher(IndexReader.open(index));
//create an analyzer - to convert both indexed data and search query to same format
Analyzer analyzer = new StandardAnalyzer();

BufferedReader in = null;

while (true)
{
if (queries == null) // prompt the user
System.out.print("Query: ");

String line = in.readLine();

if (line == null || line.length() == -1)
break;

//create a query from the string put by user
Query query = QueryParser.parse(line, field, analyzer);
System.out.println("Searching for: " + query.toString(field));

//hits defines the pointer to the result set obtained after search
Hits hits = searcher.search(query);

System.out.println(hits.length() + " total matching documents");

// get and display first 10 results/documents
final int DISPLAY = 10;
for (int start = 0; start < hits.length(); start++)
{
int end = 0;
Document doc = hits.doc(i);
String path = doc.get("path"); //show the path - which was stored
end++;
if(end>DISPLAY)
break;
}
}
//close the index
reader.close();
}
}


Another crude program for search. For search the field where search is to be performed has to be named. Suppose we had 2 fields - say "contents" and "contents1", then we would be giving a query like this :

contents: phrase1 AND/OR contents1: phrase 2

There is lots that could be done using lucene. This is just the starting point. Wellll, i cud have skipped it and just given the link lucene.apache.org, but blogs dont work that way. So..

Maybe some time later, i will write down something advanced about lucene...

4 comments:

Anonymous said...

Excellent! Thanks for sharing knowledge. Lot 2 learn from u...

Anonymous said...

how does this compare to mysql's full-text search.

(I saw this post on planet-mysql .. thats all)

gamegeek said...

Lucene is extremely fast as compared to mysql full text search engine. Both in indexing and search.

In mysql, as the index size exceeds the RAM/System cache, the speed of search reduces drastically. This is not the case with lucene.

Also lucene is much more flexible and has lots of features as compared to mysql full text search.

Will write a blog comparing both soon...

Anonymous said...

Thanks Jayant!
your mysql vs lucene blog post was worth reading

regards
Ian