Changing the default analyzer in ElasticSearch or LogStash

I have data coming from Logstash that is parsed in overeager mode. Essentially, the "OS X 10.8" field will be broken down into "OS" , "X" and "10.8" . "10.8" I know that I could just change the display and reindex the existing data, but how to change the default analyzer (either in ElasticSearch or LogStash) to avoid this problem in future data?

Concrete Solution: I created a mapping for the type before I first sent the data to the new cluster.

IRC Solution: Create an Index Template

+6
source share
2 answers

As you know, elasticsearch uses a standard analyzer if the analyzer is not explicitly specified. Therefore, when setting up templates, you can configure your own analyzer called standard . And there you can set your own rules for tuning the analyzer, token, filter tokens.

Here are some useful links to help you better understand:

http://elasticsearch-users.115913.n3.nabble.com/How-we-can-change-Elasticsearch-default-analyzer-td4040411.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html

+7
source

According to this page, analyzers can be specified for each query, for each field, or for each index.

At index time , Elasticsearch will search for the analyzer in the following order:

  • An analyzer defined in field mapping .
  • An analyzer named default in the index settings.
  • Standard analyzer.

There are several more layers in query time :

  • An analyzer defined in the full-text query .
  • search_analyzer defined in the field mapping.
  • An analyzer defined in field mapping .
  • An analyzer named default_search in the index settings.
  • An analyzer named default in the index settings.
  • Standard analyzer.

On the other hand, this page points to an important thing:

The analyzer is registered under a logical name. It can then refer to mapping definitions or specific APIs. When none are specified, default values ​​are used. It is possible to determine which analyzers will be used by default when it cannot be output.

Thus, the only way to define a custom analyzer by default is to override one of the predefined analyzers , in this case the default analyzer means that we cannot use an arbitrary name for our analyzer, it must be called default

here is a simple example of setting the index:

 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "char_filter": { "charMappings": { "type": "mapping", "mappings": [ "\\u200C => " ] } }, "filter": { "persian_stop": { "type": "stop", "stopwords_path": "stopwords.txt" } }, "analyzer": { "default": {<--------- analyzer name must be default "tokenizer": "standard", "char_filter": [ "charMappings" ], "filter": [ "lowercase", "arabic_normalization", "persian_normalization", "persian_stop" ] } } } } } 
+6
source

Source: https://habr.com/ru/post/957380/


All Articles