Lucene Custom Count for Numeric Fields

Question

Lucene Custom Count for Numeric Fields

I would like to have, in addition to the standard term search with tf-idf similarity in a field of textual content, scoring based on the "similarity" of numeric fields. This similarity will depend on the distance between the value in the request and in the document (for example, gaussian with m = [user input], s = 0.5)

those. let's say documents represent people, and a person’s document has two fields:

description (full text)
age (number).

I want to find documents like

Description: (xyz) Age: 30

but age is not a filter , but rather part of the assessment (for a person with an age of 30, the multiplier will be 1.0, for a 25-year-old person 0.8, etc.).

Could this be achieved in a reasonable way?

EDIT: Finally, I found out that this can be done by wrapping ValueSourceQuery and TermQuery with CustomScoreQuery. See My solution below.

EDIT 2: With rapidly changing versions of Lucene, I just want to add that it has been tested on Lucene 3.0 (Java).

+6

lucene tf-idf scoring

jakub.g May 08 '11 at 12:41

source share

2 answers

This can be achieved with Solr FunctionQuery.

+1

bajafresh4life May 08 '11 at 18:47

source share

jakub.g · Accepted Answer · 2011-05-09T15:22:21+0000

Ok, so here is a (a bit detailed) proof of concept as a complete JUnit test. Its effectiveness has not been tested yet for a large index, but from what I read, probably after a warm-up, it should work well, providing enough RAM to cache numeric fields.

package tests; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.NumericField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.function.CustomScoreQuery; import org.apache.lucene.search.function.IntFieldSource; import org.apache.lucene.search.function.ValueSourceQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; import junit.framework.TestCase; public class AgeAndContentScoreQueryTest extends TestCase { public class AgeAndContentScoreQuery extends CustomScoreQuery { protected float peakX; protected float sigma; public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) { super(subQuery, valSrcQuery); this.setStrict(true); // do not normalize score values from ValueSourceQuery! this.peakX = peakX; // age for which the age-relevance is best this.sigma = sigma; } @Override public float customScore(int doc, float subQueryScore, float valSrcScore){ // subQueryScore is td-idf score from content query float contentScore = subQueryScore; // valSrcScore is a value of date-of-birth field, represented as a float // let convert age value to gaussian-like age relevance score float x = (2011 - valSrcScore); // age float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma); float finalScore = ageScore * contentScore; System.out.println("#contentScore: " + contentScore); System.out.println("#ageValue: " + (int)valSrcScore); System.out.println("#ageScore: " + ageScore); System.out.println("#finalScore: " + finalScore); System.out.println("+++++++++++++++++"); return finalScore; } } protected Directory directory; protected Analyzer analyzer = new WhitespaceAnalyzer(); protected String fieldNameContent = "content"; protected String fieldNameDOB = "dob"; protected void setUp() throws Exception { directory = new RAMDirectory(); analyzer = new WhitespaceAnalyzer(); // indexed documents String[] contents = {"foo baz1", "foo baz2 baz3", "baz4"}; int[] dobs = {1991, 1981, 1987}; // date of birth IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < contents.length; i++) { Document doc = new Document(); doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i])); // store & index writer.addDocument(doc); } writer.close(); } public void testSearch() throws Exception { String inputTextQuery = "foo bar"; float peak = 27.0f; float sigma = 0.1f; QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer); Query contentQuery = parser.parse(inputTextQuery); ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) ); // or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT); CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma); IndexSearcher searcher = new IndexSearcher(directory); TopDocs docs = searcher.search(finalQuery, 10); System.out.println("\nDocuments found:\n"); for(ScoreDoc match : docs.scoreDocs) { Document d = searcher.doc(match.doc); System.out.println("CONTENT: " + d.get(fieldNameContent) ); System.out.println("DOB: " + d.get(fieldNameDOB) ); System.out.println("SCORE: " + match.score ); System.out.println("-----------------"); } } }

Lucene Custom Count for Numeric Fields

More articles: