How to execute elasticsearch query of a date range with multiple dates per document?

I use ElasticSearch to index forum topics and reply to posts. Each message has a date field associated with it. I would like to execute a query that includes a date range that will return streams containing records matching the date range. I looked at using nested matching, but the docs say that this feature is experimental and might lead to inaccurate results.

What is the best way to achieve this? I am using the Java API.

+4
source share
1 answer

You didnโ€™t talk much about your data structure, but I come out of your question that you have post objects that contain a date field and, presumably, a thread_id field, that is, a way to identify which thread the message belongs to?

Do you also have a thread object or is your thread_id sufficient?

In any case, your stated goal is to return a list of streams that contain messages in a specific date range. This means that you need to group your threads (rather than returning the same thread_id multiple times for each message in a date range).

This grouping can be done using facets .

Thus, the request in JSON will look like this:

 curl -XGET 'http://127.0.0.1:9200/posts/post/_search?pretty=1&search_type=count' -d ' { "facets" : { "thread_id" : { "terms" : { "size" : 20, "field" : "thread_id" } } }, "query" : { "filtered" : { "query" : { "text" : { "content" : "any keywords to match" } }, "filter" : { "numeric_range" : { "date" : { "lt" : "2011-02-01", "gte" : "2011-01-01" } } } } } } ' 

Note:

  • I use search_type=count , because I really do not want the messages to be returned, just thread_id s
  • I pointed out that I want the 20 most common thread_id ( size: 20 ). The default will be 10
  • I use numeric_range for the date field, because dates usually have many different values, and the numeric_range filter takes a different approach to the range filter, which makes it more efficient in this situation
  • If your thread_id looks like how-to-perform-a-date-range-elasticsearch-query , you can use these values โ€‹โ€‹directly. But if you have a separate thread object, you can use the multi-get API to retrieve these
  • your thread_id field should be displayed as { "index": "not_analyzed" } , so that the whole value is treated as a single term, and not analyzed for individual terms
+12
source

Source: https://habr.com/ru/post/1380826/


All Articles