How to execute the filter "OR" in the unit?

Question

How to execute the filter "OR" in the unit?

I am trying to capture the first 10 documents grouped by domain. These 10 documents should have a value for "crawl_date" that has not been scanned for some time or has not been scanned at all (for example, an empty value). I have:

curl -XPOST 'http://localhost:9200/tester/test/_search' -d ' { "size": 10, "aggs": { "group_by_domain": { "filter": { "or":[ "term": {"crawl_date": ""}, "term": {"crawl_date": ""} // how do I put a range here? eg <= '2014-12-31' ] }, "terms": { "field": "domain" } } } }'

I am new to ES and am using version 2.2. Since the documentation is not fully updated, I am struggling.

EDIT: To clarify, I need 10 URLs that have not been crawled or have not been crawled for a while. Each of these 10 URLs must come from a unique domain, so when I crawl them, I don’t overload someone's server.

Other Editing: So, I need something like this (1 link for each of 10 unique domains):

 1. www.domain1.com/page 2. www.domain2.com/url etc...

Instead, I get only the domain and the number of pages:

 "buckets": [ { "key": "http://www.dailymail.co.uk", "doc_count": 212 }, { "key": "https://sedo.com", "doc_count": 196 }, { "key": "http://www.foxnews.com", "doc_count": 118 }, { "key": "http://data.worldbank.org", "doc_count": 117 }, { "key": "http://detail.1688.com", "doc_count": 117 }, { "key": "https://twitter.com", "doc_count": 112 }, { "key": "http://search.rakuten.co.jp", "doc_count": 104 }, { "key": "https://in.1688.com", "doc_count": 92 }, { "key": "http://www.abc.net.au", "doc_count": 87 }, { "key": "http://sport.lemonde.fr", "doc_count": 85 } ]

"hits" returns multiple pages for only one domain:

 "hits": [ { "_index": "tester", "_type": "test", "_id": "http://www.barnesandnoble.com/w/at-the-edge-of-the-orchard-tracy-chevalier/1121908441?ean=9780525953005", "_score": 1, "_source": { "domain": "http://www.barnesandnoble.com", "crawl_date": "0001-01-01T00:00:00Z" } }, { "_index": "tester", "_type": "test", "_id": "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb", "_score": 1, "_source": { "domain": "http://www.barnesandnoble.com", "crawl_date": "0001-01-01T00:00:00Z" } }, etc....

Barnes and Noble will quickly block my UA if I try to bypass many domains at the same time.

I need something like this:

 1. "http://www.dailymail.co.uk/page/text.html", 2. "https://sedo.com/another/page" 3. "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb" 4. "http://www.starbucks.com/homepage/" etc.

+5

elasticsearch

user776942 Mar 20 '16 at 4:29

source share

3 answers

I suggest you use the exists filter instead of trying to match an empty term (the missing filter is deprecated in 2.2). The range filter then helps you filter out documents that you don't need.

Finally, since you used the absolute URL as id, be sure to aggregate in the _uid field and not in the domain field, this way you will get unique values for each page.

 curl -XPOST 'http://localhost:9200/tester/test/_search' -d '{ "size": 10, "aggs": { "group_by_domain": { "filter": { "bool": { "should": [ { "bool": { "must_not": { "exists": { "field": "crawl_date" } } } }, { "range": { "crawl_date": { "lte": "2014-12-31T00:00:00.000" } } } ] } }, "aggs": { "domains": { "terms": { "field": "_uid" } } } } } }'

+2

Val Mar 20 '16 at 5:11

source share

You should use Filter Aggregation and then Sub-Aggregation

 { "size": 10, "aggs": { "filter_date": { "filter": { "bool": { "should": [ { "bool": { "must_not": [ { "exists": { "field": "crawl_date" } } ] } }, { "range": { "crawl_date": { "from": "now-100d" } } } ] } }, "aggs": { "group_by_domain": { "terms": { "field": "domain" } } } } } }

0

Richa Mar 20 '16 at 5:27

source share

Michael Stockerl · Accepted Answer · 2016-03-29T17:11:52+0000

Use of aggregations

If you want to use aggregations, I would suggest using term aggregations to remove duplicates from your result set and as sub-aggregations, I would use top_hits aggregation , which gives the best result from the aggregated documents of each domain (by default, the rating for each document in the domain must be the same.)

Therefore, the query will look like this:

 POST sites/page/_search { "size": 0, "aggs": { "filtered_domains": { "filter": { "bool": { "should": [ { "bool": { "must_not": { "exists": { "field": "crawl_date" } } } }, { "range": { "crawl_date": { "lte": "2016-01-01" } } } ] } }, "aggs": { "domains": { "terms": { "field": "domain", "size": 10 }, "aggs": { "pages": { "top_hits": { "size": 1 } } } } } } } }

Give you such results

 "aggregations": { "filtered_domains": { "doc_count": 3, "domains": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "barnesandnoble.com", "doc_count": 2, "pages": { "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "test", "_type": "page", "_id": "barnesandnoble.com/test2.html", "_score": 1, "_source": { "crawl_date": "1982-05-16", "domain": "barnesandnoble.com" } } ] } } }, { "key": "starbucks.com", "doc_count": 1, "pages": { "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test", "_type": "page", "_id": "starbucks.com/index.html", "_score": 1, "_source": { "crawl_date": "1982-05-16", "domain": "starbucks.com" } } ] } } } ] } }

Using Parent / Child Aggregation

If you can change the structure of the index, I would suggest creating an index with a parent / child relationship or nested documents.

If you do, you can select 10 different domains and get one (or more) specific pages of this URL.

Let me show you an example with parent / child (if you use meaning, you can just copy the paste):

First create mappings for documents:

 PUT /sites { "mappings": { "domain": {}, "page": { "_parent": { "type": "domain" }, "properties": { "crawl_date": { "type": "date" } } } } }

Insert multiple documents

 PUT sites/domain/barnesandnoble.com {} PUT sites/domain/starbucks.com {} PUT sites/domain/dailymail.co.uk {} POST /sites/page/_bulk { "index": { "_id": "barnesandnoble.com/test.html", "parent": "barnesandnoble.com" }} { "crawl_date": "1982-05-16" } { "index": { "_id": "barnesandnoble.com/test2.html", "parent": "barnesandnoble.com" }} { "crawl_date": "1982-05-16" } { "index": { "_id": "starbucks.com/index.html", "parent": "starbucks.com" }} { "crawl_date": "1982-05-16" } { "index": { "_id": "dailymail.co.uk/index.html", "parent": "dailymail.co.uk" }} {}

Search URLs to crawl

 POST /sites/domain/_search { "query": { "has_child": { "type": "page", "query": { "bool": { "filter": { "bool": { "should": [ { "bool": { "must_not": { "exists": { "field": "crawl_date" } } } }, { "range": { "crawl_date": { "lte": "2016-01-01" } } }] } } } }, "inner_hits": { "size": 1 } } } }

We execute a has_child request for the parent type and therefore only get different URLs for the parent type. To get specific pages, we need to add an internal_hits request that gives us child documents leading to hits of the parent type. If you set inner_hits to 1, you get only one page per domain. You can even add sorting to inner_hits request ... For example, you can sort by crawl_date .;)

The above search yields the following result:

 "hits": [ { "_index": "sites", "_type": "domain", "_id": "starbucks.com", "_score": 1, "_source": {}, "inner_hits": { "page": { "hits": { "total": 1, "max_score": 1.9664046, "hits": [ { "_index": "sites", "_type": "page", "_id": "starbucks.com/index.html", "_score": 1.9664046, "_routing": "starbucks.com", "_parent": "starbucks.com", "_source": { "crawl_date": "1982-05-16" } } ] } } } }, { "_index": "sites", "_type": "domain", "_id": "dailymail.co.uk", "_score": 1, "_source": {}, "inner_hits": { "page": { "hits": { "total": 1, "max_score": 1.9664046, "hits": [ { "_index": "sites", "_type": "page", "_id": "dailymail.co.uk/index.html", "_score": 1.9664046, "_routing": "dailymail.co.uk", "_parent": "dailymail.co.uk", "_source": {} } ] } } } }, { "_index": "sites", "_type": "domain", "_id": "barnesandnoble.com", "_score": 1, "_source": {}, "inner_hits": { "page": { "hits": { "total": 2, "max_score": 1.4142135, "hits": [ { "_index": "sites", "_type": "page", "_id": "barnesandnoble.com/test.html", "_score": 1.4142135, "_routing": "barnesandnoble.com", "_parent": "barnesandnoble.com", "_source": { "crawl_date": "1982-05-16" } } ] } } } } ]

Finally, let me point out one thing. Relations between parents and children are associated with a small expenditure of time of the request. If this is not a problem for your use case, I would choose this solution.

How to execute the filter "OR" in the unit?

Use of aggregations

Using Parent / Child Aggregation

More articles: