Solr-composite text tokenizer - results are processed as an OR expression

There are words in Dutch and German that can be combined with new words; compound words.

For example, "accountmanager" is considered a single word, exacerbated by the words "account" and "manager". Our users will use the "accountmanager" and "account manager" in documents and queries and expect the same results for both queries.

To be able to decode (separate) the words, solr has a dictionary filter, which I configured in the circuit:

<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="../../compound-word-dictionary.txt" minWordSize="8" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/> 

The compound vocabulary word .txt file contains a list of words that are used to decompose compound words. In this list you will find, for example, the words "account" and "manager".

The decompression result is in order when it is analyzed in the Solr debugger when searching with the request "accountmanager": (term text):

  • AccountManager
  • Account
  • manager

This result, however, is considered as an OR operator and finds all documents that contain at least one of the terms. I want it to behave like an AND operator (so I want only those results that have both the terms “account” and “manager” in the document).

I tried to set the defaultOperator parameter in the schema to "AND", but when using edismax this is ignored. Therefore, I set the proposed Min-should-Match to 100% (mm = 100%), again without the desired result. Setting dictionary filter attributes in a schema does not change the behavior to "AND".

Has anyone encountered this behavior when using the dictionary dictionary of the word factory and know a solution that allows it to behave like an AND operator?

+6
source share
2 answers

it works as expected, DictionaryCompoundWordTokenFilterFactory simply adds the “internal words” that it found, in this case both “account” and “manager,” but there can only be one if, for example, the word “accountbanana” and “ banana "is not in the dictionary, only" account "is added.

This serves to ensure that someone is looking for a "manager" and also find a document that has an "accountmanager" in it.

To get the desired behavior (I understand that you are applying this on the request side), you can use a dictionary that makes accountmanager = "manager manager"

+3
source

Just a head when I study this, there is a lot of noise during this. Since SOLR 3.6 sets the increment of the position of each broken token to 0 in CompoundWordTokenFilterBase, you will receive queries that correctly index (sort). However, when prompted, you will receive a giant OR query of your compound word, because AnalyzerQueryNodeProcessor checks if positionCount == 1 is valid.

For example, a Castaway search will query (exclude or drop or off). This adds a lot of noise when the Castaway movie (which is really Cast Away) will work, but you will also get everything that is just “Away” or just “Cast”.

We actually fixed Lucene to set the PositionIncrement to 1 and added additional code to the AnalyzerQueryNodeProcessor so that there would be OR'd PhraseQueryNodes where you get ("flip" or "discard"). This is also not true, but reduces noise. Phrase queries can return strange results if you always set position 1, because (castaway0, cast1, away2) can return "discard" results. Also, positions of later dates are now disabled. For a more detailed description, see http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

+2
source

Source: https://habr.com/ru/post/918345/


All Articles