UTF8 encoding longer than maximum length 32766

I upgraded my Elasticsearch cluster from 1.1 to 1.2, and I have errors when indexing a few large rows.

{ "error": "IllegalArgumentException[Document contains at least one immense term in field=\"response_body\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[7b 22 58 48 49 5f 48 6f 74 65 6c 41 76 61 69 6c 52 53 22 3a 7b 22 6d 73 67 56 65 72 73 69]...']", "status": 500 } 

Index Display:

 { "template": "partner_requests-*", "settings": { "number_of_shards": 1, "number_of_replicas": 1 }, "mappings": { "request": { "properties": { "asn_id": { "index": "not_analyzed", "type": "string" }, "search_id": { "index": "not_analyzed", "type": "string" }, "partner": { "index": "not_analyzed", "type": "string" }, "start": { "type": "date" }, "duration": { "type": "float" }, "request_method": { "index": "not_analyzed", "type": "string" }, "request_url": { "index": "not_analyzed", "type": "string" }, "request_body": { "index": "not_analyzed", "type": "string" }, "response_status": { "type": "integer" }, "response_body": { "index": "not_analyzed", "type": "string" } } } } } 

I searched the documentation and did not find anything related to the maximum field size. In the section of basic types, I do not understand why I should "fix the analyzer" for the not_analyzed field.

+46
elasticsearch
Jun 03 '14 at 16:06
source share
8 answers

Thus, you have a problem with the maximum size for one term. When you set the not_analyzed field, it will refer to it as one term. The maximum size for a single term in the Lucene base index is 32,766 bytes, which I believe is hard-coded.

Your two main parameters: either change the type to binary, or continue using the string, but set the index type to "no."

+51
Jun 03 '14 at 18:07
source share

If you really want not_analyzed to enable the property because you want to do fine filtering, you can use "ignore_above": 256

Here is an example of how I use it in php:

 'mapping' => [ 'type' => 'multi_field', 'path' => 'full', 'fields' => [ '{name}' => [ 'type' => 'string', 'index' => 'analyzed', 'analyzer' => 'standard', ], 'raw' => [ 'type' => 'string', 'index' => 'not_analyzed', 'ignore_above' => 256, ], ], ], 

In your case, you probably want to do as John Petreon told you, and set "index": "no" , but for those who still find this question after I, like me, have searched for this Exception Your options are:

  • set "index": "no"
  • set "index": "analyze"
  • set "index": "not_analyzed" and "ignore_above": 256

It depends on how and how you want to filter this property.

+24
May 29 '15 at 6:49
source share

There is a better option than the one John posted. Because with this decision you can no longer search for meaning.

Back to the problem:

The problem is that the default field values ​​will be used as a single term (full line). If this term / string is longer than 32766 bytes, it cannot be stored in Lucene.

Older versions of Lucene only issue a warning if the conditions are too long (and ignore the value). Newer versions throw an exception. See Fix: https://issues.apache.org/jira/browse/LUCENE-5472

Decision:

The best option is to define a (custom) analyzer in a field with a long string value. The analyzer can split a long line into smaller lines / terms. This will solve the problem of too long deadlines.

Do not forget to add the analyzer to the "_all" field if you use this function.

Analyzers can be tested using the REST api. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

+6
Mar 03 '15 at 12:13
source share

I needed to change the index part of the display to no instead of not_analyzed . Thus, the value is not indexed. It remains available in the returned document (from search, get, ...), but I cannot request it.

+2
Jun 03 '14 at 16:47
source share

I ran into this problem changing my analyzer.

 { "index" : { "analysis" : { "analyzer" : { "standard" : { "tokenizer": "standard", "filter": ["standard", "lowercase", "stop"] } } } } } 
0
Mar 01 '16 at 23:54 on
source share

If you are using searchkick , upgrade elasticsearch to >= 2.2.0 and make sure you are using searchkick 1.3.4 or later.

This version of searchkick sets ignore_above = 256 by default, so you will not get this error when UTF> 32766.

It is discussed here .

0
Sep 12 '16 at 6:41
source share

In Solr v6 +, I changed the field type to text_general and it solved my problem.

 <field name="body" type="string" indexed="true" stored="true" multiValued="false"/> <field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/> 
0
Oct 13 '17 at 6:12
source share

Using logstash to index these long messages, I use this filter to truncate a long line:

  filter { ruby { code => "event.set('message_size',event.get('message').bytesize) if event.get('message')" } ruby { code => " if (event.get('message_size')) event.set('message', event.get('message')[0..9999]) if event.get('message_size') > 32000 event.tag 'long message' if event.get('message_size') > 32000 end " } } 

It adds a message_size field so that I can sort the longest messages by size.

It also adds a long message tag to those that exceed 32,000 KB, so I can easily select them.

This does not solve the problem if you intend to completely index these long messages, but if, like me, you do not want them to be in elasticsearch in the first place and want to track them in order to fix this, a working solution.

0
Oct 31 '17 at 4:31 on
source share



All Articles