Distributed, Synchronized Batch Processing

Question

Distributed, Synchronized Batch Processing

In our current Java project, we need to execute a batch set of a huge set of records. Once this processing is completed, it should start processing all the records again and again. This processing must be parallelized and also distributed among several nodes.

The records themselves are stored in the database. The use of a certain range of identifiers (for example, 1-10000) to identify the party will be sufficient.

From a high-level point of view, I see the following steps:

The subtask processes one batch of records.
The main task checks if any subtask is being performed. If not, create one subtask for each batch of records.

We use MongoDB very much and think about the persistent subtasks in it. Then each node can pick up auxiliary tasks that have not yet been completed, processes and marks the record as completed. After there are no incomplete subtasks, the main task again creates all the subtasks. Perhaps this will work, but we are looking for a solution in which we do not need to perform heavy synchronization.

Could this be a possible precedent for akka ?
Can akka-persistence be used to synchronize processing between different nodes?
Are there any other Java / JVM frameworks for this to work?

+5

java akka batch-processing distributed-computing

scho Apr 29 '17 at 9:40

source share

1 answer

Diego martinoia · Answer 1 · 2017-05-04T10:30:23+0000

Your question is too wide for the SO format. Plase read this manual in the future before asking questions, and do not ask your group members to vote for your question in order to inflate what is clearly an incorrect question (ê ° ͜ʖ °).

Anyway:

1) Yes, you can implement your requirements in Akka. In particular, since you mentioned several nodes, you look at the akka-cluster module (for node interaction), and you may also need akka-cluster-sharding (in case you want to store all the data in memory next to during processing) .

2) No, I would not highly recommend this. Although you can technically make your problem use akka-persistence to synchronize tasks, the goal of akka-persistence is simply to make the acting state persistent. Akka itself in its basic form is enough to solve all your synchronization problems. Just have a master actor create a worker for each subtask and control its completion.

3) Yes. Please note that the answer to this question is always yes, no matter what the job.

Distributed, Synchronized Batch Processing

More articles: