How to find the most commonly used phrases in elasticsearch?

I know that you can find the most used terms in an index using faces.

For example, on the following inputs:

"ABC" "AA BB CC" "A AA B BB" "AA B" 

term facet returns this:

 B:3 AA:3 A:2 BB:2 CC:1 C:1 

But I wonder if the following can be listed:

 AA B:2 AB:1 BB CC:1 ....etc... 

Is there such a feature in ElasticSearch?

+6
source share
2 answers

As mentioned in a ramseykhalaf comment, a pebble filter will produce tokens with a length of "n" words.

 "settings" : { "analysis" : { "filter" : { "shingle":{ "type":"shingle", "max_shingle_size":5, "min_shingle_size":2, "output_unigrams":"true" }, "filter_stop":{ "type":"stop", "enable_position_increments":"false" } }, "analyzer" : { "shingle_analyzer" : { "type" : "custom", "tokenizer" : "whitespace", "filter" : ["standard," "lowercase", "shingle", "filter_stop"] } } } }, "mappings" : { "type" : { "properties" : { "letters" : { "type" : "string", "analyzer" : "shingle_analyzer" } } } } 

See the blog post for more details.

+2
source

I'm not sure elasticsearch will let you do this the way you want it initially. But you might be interested in checking out Carrot2 - http://project.carrot2.org/index.html to accomplish what you want (and maybe more.)

0
source

Source: https://habr.com/ru/post/951802/


All Articles