How to determine if there is a similar document stored in the Lucene index

I need to exclude duplicates in my database. The problem is that duplicates are not considered an exact match, but rather similar documents. For this, I decided to use FuzzyQueryas follows:

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

The idea was to set a minimum semblance of 0.8 (which, in my opinion, is quite high), so only similar documents will be found, except for those that are not similar enough.

To check this code, I decided to see if it finds an existing document. The variable queryTextwas assigned a value that is stored in the index. The code above did not find anything, in other words, it does not even find an exact match.

The index was built on this code:

 doc.Add(new global::Lucene.Net.Documents.Field(
            "text",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

, : TermQuery .

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

, , , .

+3
3

- , Lucene "" . Luke . Lucent.NET, , .

+2

. , :

  • , TermQuery "". , .
  • "" (), , ( , ).
  • .
+1

Try the MoreLikeThis class in Lucene ... it has a great heuristic encoded to help you identify "similar" documents.

+1
source

Source: https://habr.com/ru/post/1732060/


All Articles