I need to write a program that performs a parallel search in a large space of possible states, with the discovery of new areas (and their search) in the process, as well as the study of some areas that end in the early stages with intermediate results obtained elsewhere to exclude the possibility of detection in them new useful results. The search is performed using multiple threads working in close collaboration with each other to avoid recalculating intermediate data.
The complex internal state (including the call stacks of several threads and the state synchronization primitives they use) must be maintained and updated throughout the process, and there is no obvious way to break the calculation into isolated pieces that can be executed sequentially, each of which saves and skips a small intermediate result to the next. In addition, there is no way to break the calculation into independent parallel flows without linking to each other, without imposing excessive costs due to the recalculation of a large amount of intermediate data.
Due to the large search domain, the program may work for several months until the final result. Therefore, during the execution of the program, there is a significant risk of a power outage, hardware or OS failure, which can lead to a complete loss of all the work that has been done so far. In this case, the program will need to restart all its calculations from scratch.
I need a solution that can prevent total data loss in such cases. I was thinking of a runtime engine / platform that constantly saves the current state of a process in a repository with fault-tolerant storage, such as a redundant disk array or database. But I understand that this approach can significantly slow down the process, even to the extent that there will be no benefit compared to the expected calculation time, including restarting due to possible failures.
In fact, I do not need an ideal solution that constantly saves the state of the program, and I can easily endure the loss of hours or even days of work. A possible heavyweight decision that comes to my mind is to run the program inside the virtual machine, saving its snapshots from time to time and restoring the machine after a possible host failure from a recent snapshot. This approach can also help restore program state after an accidental or preventable failure of the guest OS.
Is there a similar, but easier solution, limited to maintaining the state of one process? Or can you suggest any other approaches that can solve my problem?