Apply a tag to millions of documents using bulk / update methods

We have about 55,000,000 documents in our ElasticSearch instance. We have a CSV file with user_ids, the largest CSV has 9M entries. Our documents have user_id as a key, so this is convenient.

I post the question because I want to discuss and get the best option to do this, since there are various ways to solve this problem. We need to add a new “label” to the document if the user document does not already have it, for example, user marks “stackoverflow” or “github”.

  • There is a classic private endpoint update . This sounds slower since we need to iterate over 9M user_ids and call api call for each of them.
  • There is a bulk query that provides better performance, but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the party is too big, you know how we need to learn on the go.
  • Then there is an official open problem for the endpoint /update_by_query , which has a lot of traffic but no confirmation, which was implemented in the standard version.
  • This open release mentions the update_by_query plugin, which should provide some better processing, but there are old and open problems when users complain about performance and memory problems.
  • I'm not sure if this is possible on EL, but I thought that I would load all the CSV records into a separate index, and somehow would join the two indexes and apply a script that would add a tag if it doesn’t exist yet.

So, the question remains, what is the best way to do this, and if any of you have done this in the past, make sure you share your numbers / performance and how you would do otherwise this time.

+6
source share
5 answers

Using the aforementioned update plugin on demand , you simply call:

 curl -XPOST localhost:9200/index/type/_update_by_query -d '{ "query": {"filtered": {"filter":{ "not": {"term": {"tag": "github"}} }}}, "script": "ctx._source.label = \"github\"" }' 

The on-demand update plugin only accepts a script, not partial documents.

As for performance and memory issues, I think the best thing is to try.

+2
source

While waiting for an update on query support, I chose:

In addition, I store the tag data (your CSV) in a separate type of document and request from it and mark all new documents as they are created, i.e. No need to index and then update.

Python snippet to illustrate the approach:

 def actiongen(): docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id']) for doc in docs: yield { '_op_type': 'update', '_index': doc['_index'], '_type': doc['_type'], '_id': doc['_id'], 'doc': {'tags': tags}, } helpers.bulk(es, actiongen(), index=args.index, stats_only=True) 
+3
source

I would go with the main API with the caveat that you should try to update each document a minimum number of times. Updates are simply atomic deletions and add and reserve the deleted document as a tombstone until it is merged.

Sending a groovy script to complete the update probably makes the most sense here, so you don't have to fetch the document first.

0
source

Could you create a parent / child relationship in which you can add a type of "tags" that refers to your type of "posts" as a parent. This way, you won’t need to perform a complete revision of your data - just index each of the corresponding tags with the corresponding message identifier.

0
source

Very old thread. We uploaded the github page to implement an “update on demand” to make sure it was implemented in version 2.0, but unfortunately not. Thanks to the Teka plugin, if the update is small, it makes sense, but our use case was to update millions of documents daily based on certain complex queries. In the end, we moved on to the es-hadoop connector. Although the infrastructure here is a big big waybill, but parallelizing the process of collecting / updating / inserting a document through a spark helped us in any case. If anyone has any other suggestions discovered :) last year, I would like to hear about it.

0
source

Source: https://habr.com/ru/post/976874/


All Articles