Can I prioritize more exact matches when using the ngram filter in search results?

When using the ngram filter with elasticsearch, so when I look for something like "test", I return the document "last", "tests" and "test". Is there a way to make a "document that exactly matches the query" test "always return higher in the search results?

+6
source share
3 answers

This is a bit of a problem with ngrams: you get a lot of false positives in your ranking. The solution is to combine ngrams with tiles. Basically, in addition to ngrams, you also index the complete word as a separate term or even a combination of words. The tier layer is mostly reminiscent of ngrams, but with words, not symbols.

Thus, the exact match with the conditions of the pebbles is higher than that which corresponds only to the grams.

Update . Here is an example of a custom analyzer. Having defined it, you can use it in your comparisons. In this case, I use icu_normalizer and folding and my suggestions_shingle. All of this is set as the default parser, so all my lines are handled this way.

{ "analyzer":{ "default":{ "tokenizer":"icu_tokenizer", "filter":"icu_normalizer,icu_folding,suggestions_shingle" } }, "filter": { "suggestions_shingle": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 5 } } } 
+6
source

You need a multifield and multimatch request.

I have the same problem. I had to search by name, so if I put the search term β€œAND”, I get the first β€œAndy,” not β€œMandy.” With just nGram, I couldn't achieve this.

I added another analyzer that uses the cutting edge of NGram (the code below is for Spring Data Elasticsearch, but you can get this idea).

  setting.put("analysis.analyzer.word_parts.type", "custom"); setting.put("analysis.analyzer.word_parts.tokenizer", "ngram_tokenizer"); setting.put("analysis.analyzer.word_parts.filter", "lowercase"); setting.put("analysis.analyzer.type_ahead.type", "custom"); setting.put("analysis.analyzer.type_ahead.tokenizer", "edge_ngram_tokenizer"); setting.put("analysis.analyzer.type_ahead.filter", "lowercase"); setting.put("analysis.tokenizer.ngram_tokenizer.type", "nGram"); setting.put("analysis.tokenizer.ngram_tokenizer.min_gram", "3"); setting.put("analysis.tokenizer.ngram_tokenizer.max_gram", "50"); setting.put("analysis.tokenizer.ngram_tokenizer.token_chars", new String[] { "letter", "digit" }); setting.put("analysis.tokenizer.edge_ngram_tokenizer.type", "edgeNGram"); setting.put("analysis.tokenizer.edge_ngram_tokenizer.min_gram", "2"); setting.put("analysis.tokenizer.edge_ngram_tokenizer.max_gram", "20"); 

I displayed the required fields as multiple fields:

 @MultiField(mainField = @Field(type = FieldType.String, indexAnalyzer = "word_parts", searchAnalyzer = "standard"), otherFields = @NestedField(dotSuffix = "autoComplete", type = FieldType.String, searchAnalyzer = "standard", indexAnalyzer = "type_ahead")) private String firstName; 

For the request, I use multimatch, I first specify 'firstName.autoComplete', and not just 'firstName'

 QueryBuilders.multiMatchQuery(searchTerm, new String[]{"firstName.autoComplete", "firstName"}) 

It seems to work correctly.

In your case, if you need an exact match, perhaps instead of "edgeNGram" you can use only "standard".

0
source

You can copy the contents of a field into fields using a mapping. Example:

  "fullName": { "type": "string", "search_analyzer": "str_search_analyzer", "index_analyzer": "str_index_analyzer", "fields": { "fullWord": { "type": "string" }, "raw": { "type": "string", "index": "not_analyzed" } } } 

Note that str_index_analyzer uses nGram here. You can then create your search to also search for these fields. Example:

 { "query": { "bool": { "should": [{ "multi_match": { "fields": [ "firstName.fullWord", ... "query": query, "fuzziness": "0" } }], "must": [{ "multi_match": { "fields": [ "firstName",...], "query": query, "fuzziness": "AUTO" } }] } } }; } 
0
source

Source: https://habr.com/ru/post/948395/


All Articles