Lucene Hightlighter sometimes inexplicably returns empty fragments

I have been working on Lucene's document retrieval program for the past few days, and everything went well, so far. I am trying to use the Lucene.Net.Highlight.Highlighter class to display matching snippets for my search results, but it does not work sequentially. Most of the time that the caller of Highlighter.GetBestFragments() does exactly what I expect (shows the corresponding text fragments with the given query string in them), but sometimes it just returns an empty string.

I triple checked my inputs, and I can verify that the query string I am using exists in the input text, but the marker simply randomly returns an empty string. The problem is reproducible; documents that have empty fragments returned will still have empty fragments returned using the same query, while documents with legal fragments continue to have legal fragments.

However, the problem is NOT document specific. Some queries return valid snippets for a document, in which other queries return an empty string for the same document. The problem is also not related to my analyzer; The problem shows if I use StandardAnalyzer or SnowballAnalyzer .

After many hours of digging up, I could not find any template in requests / documents that fail than those that work. Keep in mind that this happens on documents that have been specifically removed from the Lucene index using the same query . This means that Searcher can find the appropriate query string in the target document, but Highlighter does not.

Is this a bug in Lucene? If so, how can I get around this?

My code is:

 private static SimpleHTMLFormatter _formatter = new SimpleHTMLFormatter("<b>", "</b>"); private static SimpleFragmenter _fragmenter = new SimpleFragmenter(50); ... { using (var searcher = new IndexSearcher(analyzerInfo.Directory, false)) { QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "Text", analyzerInfo.Analyzer); parser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); //build query BooleanQuery booleanQuery = new BooleanQuery(); booleanQuery.Add(new TermQuery(new Term("PageNum", "0")), BooleanClause.Occur.MUST); booleanQuery.Add(parser.Parse(searchQuery), BooleanClause.Occur.MUST); Query query = booleanQuery.Rewrite(searcher.GetIndexReader()); //get results from query ScoreDoc[] hits = searcher.Search(query, 50).ScoreDocs; List<DVDoc> results = hits.Select(hit => MapLuceneDocumentToData(searcher.Doc(hit.Doc))).ToList(); //add relevant fragments to search results (shows WHY a certain result was chosen) QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(_formatter, scorer); highlighter.SetTextFragmenter(_fragmenter); foreach (DVDoc result in results) { TokenStream stream = analyzerInfo.Analyzer.TokenStream("Text", new StringReader(result.Text)); result.RelevantFragments = highlighter.GetBestFragments(stream, result.Text, 3, "..."); } //clean up analyzerInfo.Analyzer.Close(); searcher.Close(); return results; } } 

(Note: DVDoc is essentially just a structure that stores information about documents found. The MapLuceneDocumentToData method turns the Lucene Document into my custom DVDoc class, there is no magic there.)

And since everyone likes the example of inputs and outputs:

I am using Lucene.NET Version 2.9.4g.

+6
source share
1 answer

By default, Highlighter only processes the first 51200 characters of a document.

To increase this limit, set the MaxDocCharsToAnalyze property.

http://lucene.apache.org/core/old_versioned_docs/versions/2_9_2/api/contrib-highlighter/org/apache/lucene/search/highlight/Highlighter.html#setMaxDocCharsToAnalyze(int)

+9
source

Source: https://habr.com/ru/post/916487/


All Articles