Different lucene search results using different search space sizes

Question

Different lucene search results using different search space sizes

I have an application that uses lucene to search. Search space is in the thousands. Looking for these thousands, I get only a few results, about 20 (this is normal and expected).

However, when I reduce my search space to these 20 entries (i.e. I indexed only these 20 entries and ignored everything else ... so development would be easier), I get the same 20 results, but in a different order ( and scoring).

I tried to disable normal factors using the # setOmitNorms (true) field, but still get different results?

What could be the difference in the calculation?

thank

+2

java search size lucene scoring

Franz see Nov 16 '09 at 13:13

source share

2 answers

The assessment depends on all the documents in the index:

In general, the idea is a Vector Space Model (VSM) - the longer the query time appears in the document regarding the number of times this term appears in all documents in the collection, the more relevant this document is.

Source: Apache Lucene - Evaluation

+2

sfussenegger Nov 16 '09 at 13:34

source share

Yuval F · Accepted Answer · 2009-11-16T13:37:14+0000

See the calculation documentation in the Lucene Similarity API . My bet depends on the difference between idf between two cases (both numDocs and docFreq). To know exactly, use the explain () function to debug the results.

Edit: code snippet for explanation:

TopDocs hits = searcher.search(query, searchFilter, max);
ScoreDoc[] scoreDocs = hits.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
  String explanation = searcher.explain(query, scoreDoc.doc).toString();
  Log.debug(explanation);
}

Different lucene search results using different search space sizes

More articles: