Friday, July 06, 2007

Document scoring in lucene - Part 2

This is an addition to the previous post document scoring/calculating relevance in lucene. If you find the link inadequate you can refer and the formula at Default similarity formula.

What i did was i created some txt file and some code to index the file and have tried to find out in practice how lucene calculates relevance using the DefaultSimilarity class. Here is the file and the source code.

file: file_a.txt
jayant project manager project leader team leader java linux c c++ lucene apache solaris aix unix minix gnome kde ubuntu redhat fedora rpm deb media player vlc evolution exchange microsoft java vb vc vc++ php mysql java

source code:
import java.util.Date;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;

public class FIndexer

public static void main (String[] args) throws Exception
String src_path = args[0];
String des_path = args[1];

File f_src = new File(src_path);
File f_des = new File(des_path);

if(!f_src.isDirectory() || !f_des.isDirectory())
System.out.println("Error : "+f_src+" || "+f_des+" is not directory");

IndexWriter writer = new IndexWriter(f_des,new WhitespaceAnalyzer(), true);

File[] files = f_src.listFiles();
for(int x=0; x < files.length; x++)
Document doc = new Document();
BufferedReader br = new BufferedReader(new FileReader(files[x]));
StringBuffer content = new StringBuffer();
String line = null;
while( (line = br.readLine()) != null)

Field f1 = new Field("name",files[x].getName(), Field.Store.YES, Field.Index.NO);
Field f2 = new Field("content", content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
Field f3 = new Field("content2", content.toString(), Field.Store.NO, Field.Index.TOKENIZED);


I created copies of the file_a.txt as file_b.txt and file_c.txt and edited file_c.txt. Such that file_a.txt and file_b.txt have the same content (that is word java occurs 3 times in both the files). And file_c.txt has word java occuring only 2 times.

I created an index using and used luke to fire queries on the index.

Firstly did a search content:java. And as expected i got 3 results with the following score.


Lets take the score for file_b.txt and see how it is calculated. The score is a product of
tf = 1.7321 (java occurs 3 times)
idf = 0.7123 (document freq = 3)

And the score of file_c.txt is a product of
tf = 1.4142 (java occurs 2 times)
idf = 0.7123 (document freq = 3)

Here score is equivalent to the fieldWeight since there is just one field. If more than one fields are used in the query, then the score would be a product of the queryWeight and fieldWeight

So if i change the query to Content:java^5 Content2:java^2. Here i am boosting java in content by 5 and java in content2 by 2. That is java in content is 2.5 times more important than java in content2. Lets check the scores.


Lets look at how the score was calculated

Again for file_b.txt
0.2506 = 0.1709 (weight of content:java^5) * 0.0716 (weight of content2:java^2)

Out of which Weight of content:java^5 is
= 0.9285(Query weight) [ 5 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

And weight of content2:java^2 is
= 0.3714(Query weight) [ 2 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(Field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

The same formula is used for calculating of score of file_c.txt except for the fact that termfrequency = 1.4142 (tf of content:java or content2:java is 2)

This explains a little about scoring in lucene. The scoring can be altered by either changing the DefaultSimilarity class or extending the DefaultSimilarity and changing certain factors/calculations in it. More on how to change the scoring formula later.

No comments: