Solr: combining EdgeNGramFilterFactory and NGramFilterFactory

I have a situation where I need to use both EdgeNGramFilterFactory and NGramFilterFactory.

I use NGramFilterFactory to do a search for “contains” style with a minimum number of characters like 2. I also want to search for the first letter, for example “startswith” with the front EdgeNGramFilterFactory.

I do not want to lower the value of the NGramFilterFactory parameter to the minimum characters of 1, since I do not want to index all the characters.

Some help would be greatly appreciated.

Greetings

+6
source share
2 answers

You do not have to do all this in one field. I would create different fields using different custom types for each processing so that you can apply the logic separately.

In the following:

  • text contains original markers, minimally processed;
  • text_ngram uses NGramFilter for your two-character tokens
  • text_first_letter uses EdgeNGram for single-character letter tokens

If you handle all text fields this way, you can leave with copyField to populate the fields. Otherwise, you can instruct your Solr client to send the same field values ​​for three separate field types.

When searching, include all of them in your search queries using the qf parameter.

 <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/> </analyzer> </fieldType> <fieldType name="text_first_letter" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1" side="front"/> </analyzer> </fieldType> 

Setting the field and dynamicField definitions is yours. Or let me know if you have further questions and I can edit with clarifications.

+7
source

Start by applying EdgeNgramFilter with min = 1 and max = 1000 (we want the whole source token to be enabled). Example:

hello => 'h', 'he', 'hel', 'hell', 'hello'

Secondly, use NGramFilter with min = 2. (I just use 2 as max in the example for simplicity)

'h', 'he', 'hel', 'hell', 'hello' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', ' he ',' el ',' ll ',' lo '

You will now have several identical tokens, since you applied NGramFilter on all the “partial” tokens from EdgeNGramFilter, but just use RemoveDuplicatesTokensFilter to remove them.

'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', ' he ',' el ',' ll ',' lo '

Now your field will support one char request "startsWith", and several characters "contain" the request.

+3
source

Source: https://habr.com/ru/post/896180/


All Articles