Verbose synonyms for the word solr

I am trying to implement verbose synonyms in solr, in particular of type

msc divina => divina 

So, if the user enters "msc divina", solr should return results only for "divina".

The definition in schema.xml looks like this:

 <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true" expand="false" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_de.txt" /> <filter class="solr.SnowballPorterFilterFactory" language="German2" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_de.txt" /> <filter class="solr.SnowballPorterFilterFactory" language="German2" /> </analyzer> </fieldType> 

This does not work. If I add a synonym filter to the query analyzer, a search in "msc divina" returns every hit for "msc and" divina.

How can i solve this?

+6
source share
2 answers

From Solr documentation :

Keep in mind that although SynonymFilter will be happy to work with synonyms containing several words (ie: "sea cookies, sea biscite, seabiscuit"). The recommended approach to working with synonyms such as this is to expand the synonym when indexing. This is because there are two potential problems that can be achieved during a request:

Lucene QueryParser tokens on a white background before giving any text to the analysis, so if a person is looking for words, the analyzer will be given the words "sea" and "biscite" separately, and will not know that they correspond to a synonym. A search by phrase (for example: “sea of ​​biscit”) will force QueryParser to pass the entire string to the analyzer, but if SynonymFilter is configured to expand synonyms, then when QueryParser receives the resulting list of tokens back from the analyzer, it will build a MultiPhraseQuery that will not have the desired effect. This is due to the limited mechanism available to the analyzer to indicate that two members occupy the same position: there is no way to indicate that a “phrase” occupies the same position as the term. For our example, we get MultiPhraseQuery will be "(sea | sea | sea cookies) (cookies | biscit)", which would not correspond to the simple case of "seabiscuit" occurring in the document

One problem is described here: you cannot find the sea biscit and get the correspondence of the indexed seabiscuit if you do not use expand=true , but they also explain what happens during a query with a verbose query, this is your case.

 msc divina -> msc | divina - phrase query 

which will match the msc and divina docs. If you can indicate at the time of the request that you are looking for "msc divina" , it will work.

Otherwise, you will need either a tokenizer with a few words at the time of the request, or you can expand the FieldQParser plugin to do this for you. Here you can find here .

0
source

Here is the solution you will find on the Internet: https://dzone.com/articles/solution-multi-term-synonyms

Also, my solution to this problem was domain specific. In my case, I was sure of my query lengths (i.e., Less than 200, or just 5-10 words).

  • I replaced the spaces with underscores in the synonym entry. Here is one of my synonyms entries:

     "like_to":["love_to","loves_to","need_to","needs_to"] 
  • Use KeywordTokenizerFactory to send a complete filter request

     <tokenizer class="solr.KeywordTokenizerFactory"/> 
  • Use ShingleFilterFactory to index / query all possible subphrases of size from minShingleSize and mazShingleSize .

     <filter class="solr.ShingleFilterFactory" minShingleSize="2" outputUnigrams="true" maxShingleSize="3"/> 
  • Then use PatternReplaceCharFilterFactory to replace spaces with underscores ( _ )

     <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\s+" replacement="_"/> 
  • Use the synonym filter factory.

Example

Request: I love to travel

Tokens: I love, I love to, love to, love to travel, to travel, tavel

Replaced _ : I_love, I_love_to, love_to, love_to_travel, to_travel, tavel

A synonymous filter turns them into: I_love, I_love_to, like_to, love_to_travel, to_travel, tavel

Thus, he will eventually change the phrase love to to like to .

Hope this trick will help, although it requires expensive operations.

0
source

Source: https://habr.com/ru/post/957904/


All Articles