Does ElasticSearch support Unicode / Chinese?

I search for text through ElasticSearch, and there is a problem with the query with the term type. What I do below is basically

  • Add a document with Chinese string (你好).
  • Query using the text method and return the document.
  • A query using the term method returns nothing.

So why is this happening? and how to solve it.

➜ curl -XPOST 'http://localhost:9200/test/test/' -d '{ "name" : "你好" }' { "ok": true, "_index": "test", "_type": "test", "_id": "VdV8K26-QyiSCvDrUN00Nw", "_version": 1 } 

 ➜ curl -XGET 'http://localhost:9200/test/test/_mapping?pretty=1' { "test" : { "properties" : { "name" : { "type" : "string" } } } } 

 ➜ curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0, "hits": [ { "_index": "test", "_type": "test", "_id": "VdV8K26-QyiSCvDrUN00Nw", "_score": 1.0, "_source": { "name": "你好" } } ] } } 

 ➜ curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{ "query": { "text": { "name": "你好" } } }' { "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8838835, "hits": [ { "_index": "test", "_type": "test", "_id": "VdV8K26-QyiSCvDrUN00Nw", "_score": 0.8838835, "_source": { "name": "你好" } } ] } } 

 ➜ curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{ "query": { "term": { "name": "你好" } } }' { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } } 
+6
source share
2 answers

From the ElasticSearch term query docs:

Matches documents with fields containing the term ( not parsed ).

The name fields are parsed by default, so it cannot be found by the term query (only finds fields that are not parsed). You can try and index another document using a different name (and not Chinese), and the term query cannot find it either. If you are currently wondering why the following search results return results:

 curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{"query" : {"term" : { "name" : "好" }}}' 

Because each token is not a parsed term for this. If you specified a document with the name "你,", you also will not find documents containing "好吗" or "你好", but you can find documents containing "你", "好" or "吗", the term query .

For the Chinese, you may need to pay special attention to the analyzer used. For me, the standard analyzer seems good enough, though (tokenize Chinese phrases by character by character, not by space).

+6
source

The default analyzer is not suitable for Asian languages. Try using the analyzer as follows: https://github.com/elasticsearch/elasticsearch-analysis-smartcn

+1
source

Source: https://habr.com/ru/post/957822/


All Articles