Elasticsearch aggregation at URL

I am indexing documents with a field containing a URL:

[
    'myUrlField' => 'http://google.com/foo/bar'
]

Now, what I would like to extract from the elastics search is aggregation on the url field.

curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{
  "facets": {
    "groupByMyUrlField": {
      "terms": {
        "field": "myUrlField"
      }
    }
  }
}'

This is well and good, but the default analyzer tokeniziruet field, so that every part of the URL-address is a token, so I get treatment for http, google.com, fooand bar. But mostly, I just want the host name google.com.

Can I use faces to group using a specific token?

"field": "myUrlField.0"

or something like that?

A query for the not_analyzed index is also not suitable, because I want to group by host name, and not by unique URLs.

I would like to be able to do this in elasticsearch, and not in my client code. Thanks

+4
1

URL- :

URL- , tokenizer ( , not_analyzed ), , . , url preserve_original.

:

{
  "settings": {
    "analysis": {
      "filter": {
        "capture_domain_filter": {
          "type": "pattern_capture",
          "preserve_original": false,
          "flags": "CASE_INSENSITIVE",
          "patterns": [
            "https?:\/\/([^/]+)"
          ]
        }
      },
      "analyzer": {
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_domain_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "weblink": {
      "properties": {
        "url": {
          "type": "string",
          "analyzer": "domain_analyzer"
        }
      }
    }
  }
}

, URL- :

curl -sXGET http://localhost:9200/url_analyzer/_analyze\?analyzer\=domain_analyzer\&pretty -d 'http://en.wikipedia.org/wiki/Wikipedia' | grep token
  "tokens" : [ {
    "token" : "en.wikipedia.org",

, URL- , ( ).

curl -XGET "http://localhost:9200/url_analyzer/_search?pretty" -d'
{
  "aggregations": {
    "tokens": {
      "terms": {
        "field": "url"
      }
    }
  }
}'

:

"aggregations" : {
    "tokens" : {
      "buckets" : [ {
        "key" : "en.wikipedia.org",
        "doc_count" : 2
      }, {
        "key" : "www.elasticsearch.org",
        "doc_count" : 1
      } ]
    }

+5

Source: https://habr.com/ru/post/1542024/


All Articles