Different lucene search results using different search space sizes

I have an application that uses lucene to search. Search space is in the thousands. Looking for these thousands, I get only a few results, about 20 (this is normal and expected).

However, when I reduce my search space to these 20 entries (i.e. I indexed only these 20 entries and ignored everything else ... so development would be easier), I get the same 20 results, but in a different order ( and scoring).

I tried to disable normal factors using the # setOmitNorms (true) field, but still get different results?

What could be the difference in the calculation?

thank

+2
source share
2 answers

See the calculation documentation in the Lucene Similarity API . My bet depends on the difference between idf between two cases (both numDocs and docFreq). To know exactly, use the explain () function to debug the results.

Edit: code snippet for explanation:

TopDocs hits = searcher.search(query, searchFilter, max);
ScoreDoc[] scoreDocs = hits.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
  String explanation = searcher.explain(query, scoreDoc.doc).toString();
  Log.debug(explanation);
}
+5
source

The assessment depends on all the documents in the index:

In general, the idea is a Vector Space Model (VSM) - the longer the query time appears in the document regarding the number of times this term appears in all documents in the collection, the more relevant this document is.

Source: Apache Lucene - Evaluation

+2
source

Source: https://habr.com/ru/post/1736126/


All Articles