Duplicate documents in the Elasticsearch index with the same _uid

We found several duplicate documents in one of our Elasticsearch indexes, and we could not solve the reason. There are two copies of each of the affected documents, and they have exactly the same _id , _type and _uid .

A GET request to /index-name/document-type/document-id just returns one copy, but searching for a document with such a request returns two results, which is pretty surprising:

 POST /index-name/document-type/_search { "filter": { "term": { "_id": "document-id" } } } 

Aggregation in the _uid field also identifies duplicate documents:

 POST /index-name/_search { "size": 0, "aggs": { "duplicates": { "terms": { "field": "_uid", "min_doc_count": 2 } } } } 

Duplicates are on different fragments. For example, a document can have one copy of the primary fragment 0 and one copy on the main fragment 1. We checked this by performing the aggregate query above on each fragment, in turn, using the preference parameter : it does not find duplicates within the same fragment.

Our best guess is that something went wrong with routing, but we don’t understand how these copies could be redirected to different fragments. According to the routing documentation , the default routing is based on the document identifier and must sequentially route the document to the same shard.

We do not use custom routing options that would override the default routing. We double-checked this, making sure that duplicate documents do not have a _routing field.

We also do not define the relationship between parents and children, which also affect routing. (See this question on the Elasticsearch forum , for example, which has the same symptoms as our problem. We don’t think the reason is the same because we don’t install the parents of the documents).

We fixed the immediate problem by reindexing into a new index that crushed duplicate documents. We still have the old index for debugging.

We did not find a way to replicate the problem. The new index correctly indexes the documents, and we tried to reprogram the night processing job, which also updates the documents, but it did not create more duplicates.

The cluster has 3 nodes, 3 primary fragments, and 1 replica (i.e. 3 replicas). minimum_master_nodes set to 2, which should prevent the split-brain problem. We are launching Elasticsearch 2.4 (which, as we know, is old - we plan to update soon).

Does anyone know what might cause these duplicates? Do you have any suggestions for debugging it?

+5
source share
1 answer

We have found the answer! The problem was that the index unexpectedly switched the hashing algorithm that it used for routing, and this led to some updated documents being stored on different shards to their original versions.

A GET request to /index-name/_settings showed the following:

 "version": { "created": "1070599", "upgraded": "2040699" }, "legacy": { "routing": { "use_type": "false", "hash": { "type": "org.elasticsearch.cluster.routing.DjbHashFunction" } } } 

β€œ1070599” refers to Elasticsearch 1.7, and β€œ2040699” refers to ES 2.4.

It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it already worked 2.4. This is the problem described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383

We think this caused a change:

  • When we upgraded the index from ES 1.7 to 2.4, we decided not to update Elasticsearch in place, as this could lead to downtime. Instead, we created a separate ES 2.4 cluster.

    We uploaded data to a new cluster using a tool that copied all index parameters, as well as data, including the version parameter, which you should not install in ES 2.4 .

  • When considering a recent issue, we managed to close and reopen the index. This usually saves all the data, but due to an incorrect version parameter, it made Elasticsearch think that the update was processed.

  • ES automatically sets the legacy.routing.hash.type parameter due to a false update. This meant that any data indexed after this point used the old DjbHashFunction instead of the standard Murmur3HashFunction , which was used for initial data routing.

This means that redoing the data into a new index was correct to fix the problem. The new index has the correct version setting and the lack of obsolete hash function settings:

 "version": { "created": "2040699" } 
+3
source

Source: https://habr.com/ru/post/1273089/


All Articles