Lucene: Similarity class ... how to define several similarity measures?

For my experiment, I need to determine specific similarity indicators for each field of my collection documents.

For example, I need to measure the similarity of the Description field with the tf.idf and Geolocation fields with the Harvesine .. distance, etc.

Now I am studying the similarity class. I was wondering if there is a good tutorial or example about this to speed up the process ...

thanks

+4
source share
1 answer

EDIT: IIUC, you have a similarity formula for each field, and you want to use it for each document, working with all other documents. You can use several options, all during indexing:

In both methods, you can use payloads to store deadline information (this can be useful for lat-long data).

After implementing a similarity class using one of these methods, use the Similarity.setDefault (mySimilarity) method to set it as an affinity instance for indexing and searching.

Only then index your text body, which you can find later - you may have to extend the Searcher class to get the original similarity.

Having said that, I believe that this approach is not true for your use case - Lucene is optimized to get several similar documents, not an estimate for each, so I predict that the execution time will be prohibitive - I hope I'm wrong, but nonetheless , I suggest you read "Extraction of Massive Datasets" for a better approach - minimal hashes and tiles.

Good luck.

Patrick, I will first quote Grant Ingersoll about changing the similarity class: "There will be Dragons . " Setting up the Lucene affinity class is complicated. I have done it. This is not fun. Only do this if you absolutely need to.

I suggest you first read the Spatial Search Paper Grant , its ability to find paper, and its โ€œDebug Relevanceโ€ article . They show other ways of getting hits as needed.

+1
source

Source: https://habr.com/ru/post/1341368/


All Articles