Whatever....: lucene in php & lucene in java

Thursday, June 07, 2007

lucene in php & lucene in java

Something i found out while solving some issue from Mr. Nguyen from vietnam.

He used lucene-php in zend framework for building a lucene index and searching on the index, and was facing issues with search times. It turned out that mysql full text index was performing better than lucene index.

So i did a quick benchmark and found the following stuff

1. Indexing using php-lucene takes a huge amount of time as compared to java-lucene. I indexed 30000 records and the time it took was 1673 seconds. Optimization time was 210 seconds. Total time for index creation was 1883 seconds. Which is hell lot of time.

2. Index created using php-lucene is compatible to java-lucene. So index created by php-lucene can be read by java-lucene and vice versa.

3. Search in php-lucene is very slow as compared to java-lucene. The time for 100 searches are -

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 231 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 15 seconds

So i thought that maybe php would be retrieving the documents upfront. And changed the code to extract all documents in php and java. Still the time for 100 searches were -

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 2128 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 63 seconds

The code for php search for lucene index is:

/*
 *      searcher.php
 *      On 2007-06-06
 * By jayant 
 *
 */

include("Zend/Search/Lucene.php");

$index = new Zend_Search_Lucene("/tmp/myindex");
echo "Total ".$index->numDocs()." docs\n";
$query = "java";
$s = time();
for($i=0; $i<100; $i++)
{
 $hits = $index->find($query);
// retrieve all documents. Comment this code if you dont want to retrieve documents
 foreach($hits as $hit)
  $doc = $hit->getDocument();

}
$total = time()-$s;
echo "total time : $total s";
?>

And the code for java search of lucene index is


/*
 *      searcher.java
 *      On 2007-06-06
 * By jayant 
 *
 */

import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;

public class searcher {

 public static void main (String args[]) throws Exception
 {
  IndexSearcher s = new IndexSearcher("/tmp/myindex");
  System.out.println("Total : "+s.maxDoc()+" docs");
  QueryParser q = new QueryParser("content",new StandardAnalyzer());
  Query qry = q.parse("java");

  long t1 = System.currentTimeMillis();
  for(int x=0; x< 100; x++)
  {
   Hits h = s.search(qry);
// retrieve all documents. Comment this code if you dont want to retrieve documents
   for(int y=0; y< h.length(); y++)
   {
    Document d = h.doc(y);
   }

  }
  long t2 = System.currentTimeMillis();
  System.out.println("t2-t1 : "+(t2-t1)+" ms");
 }
}

Hope i havent missed anything here.

1 comment:

Unknown said...: Good to know.. I am looking for Lucene to use in my application... can you point me where I can find good lucene Java indexing details and how can I use it..
thanks in advance..

Jay; 11/19/2007 10:10 AM