When scanning the ElasticSearch index, it is not possible to apply any sorting according to the documentation . But is there any definition of the order of the results during this process? If so, is that predictable?
Background Information:
I need to do operations on 5M documents, each batch of 1000 documents takes about 1 minute to process. Since I can’t make sure that the process will be completed each time it is started, I would like it to take its work where it was last interrupted, for example. if the scroll result was sorted by ID (I know this is not the case), I would track the last processed ID in my code, and on the next start I would no longer process any document using ID <= lastProcessedId so that each document is regularly processed .
Btw: “document processing” I don’t want to write additional information back to the index, but update some other things in my database. Writing a timestamp in an indexed document would not help in my case, since one of the reasons for interrupting the process may be that the index is replaced with a new index (rebuilt from scratch). Writing a processed timestamp for a database is also not a desirable option for me, because iteration performance is the reason that I use the index to scroll in the first place ....
source share