Lucene: Similarity class ... how to define several similarity measures?

Question

Lucene: Similarity class ... how to define several similarity measures?

For my experiment, I need to determine specific similarity indicators for each field of my collection documents.

For example, I need to measure the similarity of the Description field with the tf.idf and Geolocation fields with the Harvesine .. distance, etc.

Now I am studying the similarity class. I was wondering if there is a good tutorial or example about this to speed up the process ...

thanks

+4

java lucene

aneuryzm Feb 25 '11 at 21:19

source share

1 answer

Yuval F · Accepted Answer · 2011-02-27T08:42:37+0000

EDIT: IIUC, you have a similarity formula for each field, and you want to use it for each document, working with all other documents. You can use several options, all during indexing:

Extend the DefaultSimilarity class.
Extend the SimilarityDelegator class if you only need to change part of the methods.

In both methods, you can use payloads to store deadline information (this can be useful for lat-long data).

After implementing a similarity class using one of these methods, use the Similarity.setDefault (mySimilarity) method to set it as an affinity instance for indexing and searching.

Only then index your text body, which you can find later - you may have to extend the Searcher class to get the original similarity.

Having said that, I believe that this approach is not true for your use case - Lucene is optimized to get several similar documents, not an estimate for each, so I predict that the execution time will be prohibitive - I hope I'm wrong, but nonetheless , I suggest you read "Extraction of Massive Datasets" for a better approach - minimal hashes and tiles.

Good luck.

~~Patrick, I will first quote Grant Ingersoll about changing the similarity class: "There will be Dragons . "~~ ~~Setting up the Lucene affinity class is complicated.~~ ~~I have done it.~~ ~~This is not fun.~~ ~~Only do this if you absolutely need to.~~

I suggest you first read the Spatial Search Paper Grant , its ability to find paper, and its “Debug Relevance” article . They show other ways of getting hits as needed.

Lucene: Similarity class ... how to define several similarity measures?

More articles: