Choose different than elasticsearch

I have a collection of documents that belongs to several authors:

[ { id: 1, author_id: 'mark', content: [...] }, { id: 2, author_id: 'pierre', content: [...] }, { id: 3, author_id: 'pierre', content: [...] }, { id: 4, author_id: 'mark', content: [...] }, { id: 5, author_id: 'william', content: [...] }, ... ] 

I would like to get and paginate various collections of the best matching document based on author id:

 [ { id: 1, author_id: 'mark', content: [...], _score: 100 }, { id: 3, author_id: 'pierre', content: [...], _score: 90 }, { id: 5, author_id: 'william', content: [...], _score: 80 }, ... ] 

Here is what I am doing now (pseudo-code):

 unique_docs = res.results.to_a.uniq{ |doc| doc.author_id } 

The problem of proper pagination: how to choose 20 "excellent" documents?

Some people specify term facets , but I'm not actually tag clouds:

Thanks,
Gallery

+4
source share
2 answers

As ElasticSearch does not provide the equivalent of group_by right now , here is my attempt to do this manually.
Although the ES community is working on a direct solution to this problem (possibly with a plugin) here, a basic attempt that works for my needs.

Assumptions.

  • I am looking for suitable content

  • I suggested that the first 300 documents are relevant, so I consider limiting my research to this choice, regardless of whether the same authors belong to them.

  • for my needs, I didn’t really need a full pagination, this was enough the "show more" button was updated via ajax.

disadvantages

  • the results are not accurate
    since we accept 300 documents at a time, we don’t know how many unique documents will come out (perhaps it could be 300 documents from the same author!). You need to understand whether it matches your average number of documents per author and probably takes into account the limit.

  • you need to make 2 requests (expecting the cost of a remote call):

    • The first request requests 300 relevant documents with only these fields: id and author_id
    • get full documents with paginated identifiers in the second request

Here's some pseudo-code ruby: https://gist.github.com/saxxi/6495116

+4
source

Now the problem with 'group_by' has been updated, you can use this function from elastic 1.3.0 # 6124 .

If you are looking for the following query,

 { "aggs": { "user_count": { "terms": { "field": "author_id", "size": 0 } } } } 

you will get the result

 { "took" : 123, "timed_out" : false, "_shards" : { ... }, "hits" : { ... }, "aggregations" : { "user_count" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "mark", "doc_count" : 87350 }, { "key" : "pierre", "doc_count" : 41809 }, { "key" : "william", "doc_count" : 24476 } ] } } } 
0
source

Source: https://habr.com/ru/post/1494307/


All Articles