Get the 100 most used three phrases in all documents

Question

Get the 100 most used three phrases in all documents

I have about 15,000 cleaned websites with their text texts stored in the elastic search index. I need to get the 100 most popular phrases of the three words that are used in all of these texts:

Something like that:

Hello there sir: 203 Big bad pony: 92 First come first: 56 [...]

I am new to this. I looked at the timeline vectors, but they seem to be applicable to individual documents. Therefore, I believe that this will be a combination of terminal vectors and aggregation with n-gram analysis. But I do not know how to implement this. Any pointers would be helpful.

My current mapping and settings:

 { "mappings": { "items": { "properties": { "body": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }

+2

indexing elasticsearch lucene

Hydera Sep 7 '16 at 23:32

source share

1 answer

Peter Dixon-Moses · Accepted Answer · 2016-09-08T16:50:36+0000

What you are looking for is called Shingles. Precipitation alternates like "word n-grams": consecutive combinations of more than one member per line. ("We all live," "everyone lives," "live in," "yellow," "yellow submarine")

Take a look here: https://www.elastic.co/blog/searching-with-shingles

Basically, you need a field with a pebble analyzer producing exclusively three-membered tiles:

Elastic blog post configuration, but with:

 "filter_shingle":{ "type":"shingle", "max_shingle_size":3, "min_shingle_size":3, "output_unigrams":"false" }

After applying the pebble analyzer in the appropriate field (as in the blog post) and reindexing your data , you should be able to issue a query that returns a simple aggregation of terms in the body field to see the top hundred phrases of 3 words.

 { "size" : 0, "query" : { "match_all" : {} }, "aggs" : { "three-word-phrases" : { "terms" : { "field" : "body", "size" : 100 } } } }

Get the 100 most used three phrases in all documents

More articles: