Use of aggregations
If you want to use aggregations, I would suggest using term aggregations to remove duplicates from your result set and as sub-aggregations, I would use top_hits aggregation , which gives the best result from the aggregated documents of each domain (by default, the rating for each document in the domain must be the same.)
Therefore, the query will look like this:
POST sites/page/_search { "size": 0, "aggs": { "filtered_domains": { "filter": { "bool": { "should": [ { "bool": { "must_not": { "exists": { "field": "crawl_date" } } } }, { "range": { "crawl_date": { "lte": "2016-01-01" } } } ] } }, "aggs": { "domains": { "terms": { "field": "domain", "size": 10 }, "aggs": { "pages": { "top_hits": { "size": 1 } } } } } } } }
Give you such results
"aggregations": { "filtered_domains": { "doc_count": 3, "domains": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "barnesandnoble.com", "doc_count": 2, "pages": { "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "test", "_type": "page", "_id": "barnesandnoble.com/test2.html", "_score": 1, "_source": { "crawl_date": "1982-05-16", "domain": "barnesandnoble.com" } } ] } } }, { "key": "starbucks.com", "doc_count": 1, "pages": { "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test", "_type": "page", "_id": "starbucks.com/index.html", "_score": 1, "_source": { "crawl_date": "1982-05-16", "domain": "starbucks.com" } } ] } } } ] } }
Using Parent / Child Aggregation
If you can change the structure of the index, I would suggest creating an index with a parent / child relationship or nested documents.
If you do, you can select 10 different domains and get one (or more) specific pages of this URL.
Let me show you an example with parent / child (if you use meaning, you can just copy the paste):
First create mappings for documents:
PUT /sites { "mappings": { "domain": {}, "page": { "_parent": { "type": "domain" }, "properties": { "crawl_date": { "type": "date" } } } } }
Insert multiple documents
PUT sites/domain/barnesandnoble.com {} PUT sites/domain/starbucks.com {} PUT sites/domain/dailymail.co.uk {} POST /sites/page/_bulk { "index": { "_id": "barnesandnoble.com/test.html", "parent": "barnesandnoble.com" }} { "crawl_date": "1982-05-16" } { "index": { "_id": "barnesandnoble.com/test2.html", "parent": "barnesandnoble.com" }} { "crawl_date": "1982-05-16" } { "index": { "_id": "starbucks.com/index.html", "parent": "starbucks.com" }} { "crawl_date": "1982-05-16" } { "index": { "_id": "dailymail.co.uk/index.html", "parent": "dailymail.co.uk" }} {}
Search URLs to crawl
POST /sites/domain/_search { "query": { "has_child": { "type": "page", "query": { "bool": { "filter": { "bool": { "should": [ { "bool": { "must_not": { "exists": { "field": "crawl_date" } } } }, { "range": { "crawl_date": { "lte": "2016-01-01" } } }] } } } }, "inner_hits": { "size": 1 } } } }
We execute a has_child request for the parent type and therefore only get different URLs for the parent type. To get specific pages, we need to add an internal_hits request that gives us child documents leading to hits of the parent type. If you set inner_hits to 1, you get only one page per domain. You can even add sorting to inner_hits request ... For example, you can sort by crawl_date .;)
The above search yields the following result:
"hits": [ { "_index": "sites", "_type": "domain", "_id": "starbucks.com", "_score": 1, "_source": {}, "inner_hits": { "page": { "hits": { "total": 1, "max_score": 1.9664046, "hits": [ { "_index": "sites", "_type": "page", "_id": "starbucks.com/index.html", "_score": 1.9664046, "_routing": "starbucks.com", "_parent": "starbucks.com", "_source": { "crawl_date": "1982-05-16" } } ] } } } }, { "_index": "sites", "_type": "domain", "_id": "dailymail.co.uk", "_score": 1, "_source": {}, "inner_hits": { "page": { "hits": { "total": 1, "max_score": 1.9664046, "hits": [ { "_index": "sites", "_type": "page", "_id": "dailymail.co.uk/index.html", "_score": 1.9664046, "_routing": "dailymail.co.uk", "_parent": "dailymail.co.uk", "_source": {} } ] } } } }, { "_index": "sites", "_type": "domain", "_id": "barnesandnoble.com", "_score": 1, "_source": {}, "inner_hits": { "page": { "hits": { "total": 2, "max_score": 1.4142135, "hits": [ { "_index": "sites", "_type": "page", "_id": "barnesandnoble.com/test.html", "_score": 1.4142135, "_routing": "barnesandnoble.com", "_parent": "barnesandnoble.com", "_source": { "crawl_date": "1982-05-16" } } ] } } } } ]
Finally, let me point out one thing. Relations between parents and children are associated with a small expenditure of time of the request. If this is not a problem for your use case, I would choose this solution.