I have about 15,000 cleaned websites with their text texts stored in the elastic search index. I need to get the 100 most popular phrases of the three words that are used in all of these texts:
Something like that:
Hello there sir: 203 Big bad pony: 92 First come first: 56 [...]
I am new to this. I looked at the timeline vectors, but they seem to be applicable to individual documents. Therefore, I believe that this will be a combination of terminal vectors and aggregation with n-gram analysis. But I do not know how to implement this. Any pointers would be helpful.
My current mapping and settings:
{ "mappings": { "items": { "properties": { "body": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }
source share