Is there a predictable order when scanning / scrolling through the elastic search index?

When scanning the ElasticSearch index, it is not possible to apply any sorting according to the documentation . But is there any definition of the order of the results during this process? If so, is that predictable?

Background Information:

I need to do operations on 5M documents, each batch of 1000 documents takes about 1 minute to process. Since I can’t make sure that the process will be completed each time it is started, I would like it to take its work where it was last interrupted, for example. if the scroll result was sorted by ID (I know this is not the case), I would track the last processed ID in my code, and on the next start I would no longer process any document using ID <= lastProcessedId so that each document is regularly processed .

Btw: “document processing” I don’t want to write additional information back to the index, but update some other things in my database. Writing a timestamp in an indexed document would not help in my case, since one of the reasons for interrupting the process may be that the index is replaced with a new index (rebuilt from scratch). Writing a processed timestamp for a database is also not a desirable option for me, because iteration performance is the reason that I use the index to scroll in the first place ....

+4
source share
1 answer

No, the sort order is not predictable. I was going to suggest using timestamps, but then I read the rest of your question :)

Indeed, the only way to make the crawled search “renewable” is to split your documents into tranches in any field, such as a timestamp or identifier, and use the range query to scroll only one tranche at a time.

+4
source

Source: https://habr.com/ru/post/1469326/


All Articles