Elasticsearch broken English compound words?

I am looking for a filter in elasticsearch that will allow me to break down English complex words into their component parts, for example, for a term like eyewitness , eye witness and eyewitness , because the requests will match eyewitness . I noticed a compound word filter, but this requires an explanation of the word list, which I could not come up with myself.

+6
source share
1 answer

First, you need to ask yourself if you really need to break complex words. Consider a simpler approach, for example, using edge n-grams to hit the leading or trailing edges. This would have the side effect of mildly striking at fragments such as "ey", but perhaps that would be acceptable for your situation.

If you need to break connections and want to explicitly index word fragments, you will need to get a list of words. You can download a list of English words, one example here . A list of vocabulary words is used to determine which fragments of compound words are actually the words themselves. This will add the overhead of indexing, so be sure to check it out. An example showing usage here .

If your text is German, consider https://github.com/jprante/elasticsearch-analysis-decompound

+2
source

Source: https://habr.com/ru/post/972938/


All Articles