Elasticearch multi-pole fuzzy search does not return first match

I am performing a fuzzy elasticsearch query in the text and keywords fields. I have two documents in elasticsearch, one with the text “testPhone 5” and the other “testPhone 4s”. When I execute a fuzzy query with "testPhone 5", I see that both documents get the same rating value. Why is this happening?

Additional information: I index documents using the uax_url_email and lowercase tokenizers.

This is the request I am making:

{ query : { bool: { // match one or the other fuzzy query should: [ { fuzzy: { text: { min_similarity: 0.4, value: 'testphone 5', prefix_length: 0, boost: 5, } } }, { fuzzy: { keywords: { min_similarity: 0.4, value: 'testphone 5', prefix_length: 0, boost: 1, } } } ] } }, sort: [ '_score' ], explain: true } 

This is the result:

 { max_score: 0.47213298, total: 2, hits: [ { _index: 'test', _shard: 0, _id: '51fbf95f82e89ae8c300002c', _node: '0Mtfzbe1RDinU71Ordx-Ag', _source: { next: { id: '51fbf95f82e89ae8c3000027' }, cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ], other: false, _id: '51fbf95f82e89ae8c300002c', category: '51fbf95f82e89ae8c300002b', image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png', text: 'testPhone 5', keywords: [ [length]: 0 ], __v: 0 }, _type: 'productgroup', _explanation: { details: [ { details: [ { details: [ { details: [ { details: [ { value: 3.8888888, description: 'boost' }, { value: 1.5108256, description: 'idf(docFreq=2, maxDocs=5)' }, { value: 0.17020021, description: 'queryNorm' }, [length]: 3 ], value: 0.99999994, description: 'queryWeight, product of:' }, { details: [ { details: [ { value: 1, description: 'termFreq=1.0' }, [length]: 1 ], value: 1, description: 'tf(freq=1.0), with freq of:' }, { value: 1.5108256, description: 'idf(docFreq=2, maxDocs=5)' }, { value: 0.625, description: 'fieldNorm(doc=0)' }, [length]: 3 ], value: 0.944266, description: 'fieldWeight in 0, product of:' }, [length]: 2 ], value: 0.94426596, description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' }, [length]: 1 ], value: 0.94426596, description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' }, [length]: 1 ], value: 0.94426596, description: 'sum of:' }, { value: 0.5, description: 'coord(1/2)' }, [length]: 2 ], value: 0.47213298, description: 'product of:' }, _score: 0.47213298 }, { _index: 'test', _shard: 4, _id: '51fbf95f82e89ae8c300002d', _node: '0Mtfzbe1RDinU71Ordx-Ag', _source: { next: { id: '51fbf95f82e89ae8c3000027' }, cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ], other: false, _id: '51fbf95f82e89ae8c300002d', category: '51fbf95f82e89ae8c300002b', image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png', text: 'testPhone 4s', keywords: [ 'apple', [length]: 1 ], __v: 0 }, _type: 'productgroup', _explanation: { details: [ { details: [ { details: [ { details: [ { details: [ { value: 3.8888888, description: 'boost' }, { value: 1.5108256, description: 'idf(docFreq=2, maxDocs=5)' }, { value: 0.17020021, description: 'queryNorm' }, [length]: 3 ], value: 0.99999994, description: 'queryWeight, product of:' }, { details: [ { details: [ { value: 1, description: 'termFreq=1.0' }, [length]: 1 ], value: 1, description: 'tf(freq=1.0), with freq of:' }, { value: 1.5108256, description: 'idf(docFreq=2, maxDocs=5)' }, { value: 0.625, description: 'fieldNorm(doc=0)' }, [length]: 3 ], value: 0.944266, description: 'fieldWeight in 0, product of:' }, [length]: 2 ], value: 0.94426596, description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' }, [length]: 1 ], value: 0.94426596, description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' }, [length]: 1 ], value: 0.94426596, description: 'sum of:' }, { value: 0.5, description: 'coord(1/2)' }, [length]: 2 ], value: 0.47213298, description: 'product of:' }, _score: 0.47213298 }, [length]: 2 ] } 
+4
source share
2 answers

Fuzzy queries are not parsed, but the field so your search for testphone 5 with a distance of 0.4 gives the analyzed term testphone for both documents, and this term is used to further filter the results

description: 'weight (text: testphone ^ 3.8888888 to 0) [PerFieldSimilarity], result:'},

See also @imotov excellent answer here: Fuzzy ElasticSearch query

You can see exactly how the string will be marked using the _analyze API

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

i.e

http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

will return:

 { "tokens": [ { "token": "testphone", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "5", "start_offset": 10, "end_offset": 11, "type": "<NUM>", "position": 2 } ] } 

Thus, even if you index the value of testphone sammsung , a fuzzy query for "testphone samsunk" will not give anything where only samsunk will be.

You can get better results without analyzing (or using a keyword analyzer) field.

If you want to have a different analysis in one field, you can use the multi_field construct.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

+2
source

I recently ran into this problem. I can’t say exactly why this is happening, but I can tell you how I fixed it:

I ran 2 queries in one field, one with exact match, and then the same query in the same field with fuzzy matches enabled and a lower level.

This convinced me that my exact matches always ended higher than fuzzy matches.

PS I think that they are scored the same way, because due to fuzziness both matches and ES do not make sure that there is an exact match as long as both matches, but this is a pure theory, created at my end, since I don’t thoroughly familiar with the evaluation algorithm.

0
source

Source: https://habr.com/ru/post/1495017/


All Articles