I am trying to implement elasticsearch matching to optimize phrase searches in large text. As suggested in this article , I use a pebble filter to create several unigrams per phrase.
Two questions:
In the mentioned article, the stopwatch is filtered out, and the tile removes the missing places by inserting the "_" tokens. These tokens must be excluded from the unigram, which is indexed by the engine. The reason for this elimination is the ability to respond to phrasal requests containing all kinds of "useless" words. The standard solution (as mentioned in the article) is no longer possible, given that Lucene devalues ββthe specific function (enable_position_increments) needed for this behavior. How to solve this problem?
Given the elimination of punctuation, I regularly see unigrams arising from this tile process that span both phrases. From a search perspective, any result containing words from two separate phrases is incorrect. How to avoid (or mitigate) such problems?
source share