Erlang: planning work on a dynamic set of nodes

I need some tips writing a task scheduler in Erlang, which can distribute tasks (external os processes) across a set of work nodes. Work can last from a few milliseconds to several hours. The "scheduler" must be a global registry where tasks are performed, sorted, and then assigned and executed on connected "work nodes". Worker nodes should be able to register with the scheduler, reporting how many jobs they can process in parallel (slots). Worker nodes should be able to join and exit at any time.

Example:

  • Scheduler is waiting for 10 tasks
  • Work Node A connects and can process 3 jobs in parallel
  • Work Node B connects and can process 1 job in parallel
  • After some time, another working Node is added, which can process 2 tasks in parallel

Questions:

I seriously spent some time thinking about the problem, but I'm still not sure where to go. My real solution is to have a globally registered gen_server for a scheduler that keeps jobs in its state. Each Node worker spawns N worker processes and registers them with the scheduler. The worker then processes the job from the scheduler (which is an infinite blocking call using {noreply, ...} if there are currently no jobs).

Here are a few questions:

  • Is it recommended to assign each new job to an existing employee, knowing that I will have to rewrite the assignment to another employee at the time of connecting new employees? (I think this is how the Erlang SMP scheduler does something, but reassigning jobs seems like a big headache to me.)
  • Should I start a process for each working processing slot and where should this process live: in the Node scheduler or on the working node? Should the scheduler make rpc calls for the working Node, or would it be better if the work nodes pulled out new jobs and then completed them on their own?
  • And finally: this problem has already been solved and where to find the code for this? :-) I already tried RabbitMQ for scheduling tasks, but the custom sorting and deployment of tasks adds a lot of complexity.

Any advice is appreciated!

+4
source share
2 answers

After reading the answer in the comments, I still recommend using pool(3) :

  • Spawning 100k processes is not a big problem for Erlang, because spawning a process is much cheaper than on other systems.

  • One process for each task is a very good template in Erlang, it starts a new process, starting the task in the process, saving all the state in the process and ending the process after the task is completed.

  • Don’t worry about workflows that process a job and wait for a new one. This is the way to go if you use OS processes or threads because spawning is expensive, but in Erlang it only adds unnecessary complexity.

The pool object is useful as a low-level building block, the only thing that is missing your functionality is the ability to automatically launch additional nodes. What I would do is start with a pool and a fixed set of nodes to get the basic functions.

Then add additional logic that tracks the load on the nodes, for example. as well as the pool with statistics(run_queue) . If you find that all nodes exceed a certain loading threshold only slave:start/2,3 new node on an additional machine and use pool:attach/1 to add it to your pool.

This will not lead to the restoration of old work orders, but new tasks will be automatically transferred to the newly launched node from the moment it is stopped.

With this, you can quickly manage the distributed inbound pool task and the slower completely separate way to add and remove nodes.

If you work all this and still find out - after some benchmarking in the real world, please, you need to rebalance the tasks that you can always create in the main task cycles, after the rebalance message it can self-recover using the pool master, transferring its current state to as an argument.

The most important thing is to simply go ahead and build something simple, working and optimizing it later.

+4
source

My solution to the problem:

"distributor" is gen_server, "worker" is gen_server.

"distributor" launches "workers" using a slave: start_link, each "worker" starts with the parameter max_processes,

 "distributor" behavior: handle_call(submit,...) * put job to the queue, * cast itself check_queue handle_cast(check_queue,...) * gen_call all workers for load (current_processes / max_processes), * find the least busy, * if chosen worker load is < 1 gen_call(submit,...) worker with next job if any, remove job from the queue, "worker" behavior (trap_exit = true): handle_call(report_load, ...) * return current_process / max_process, handle_call(submit, ...) * spawn_link job, handle_call({'EXIT', Pid, Reason}, ...) * gen_cast distributor with check_queue 

Actually, this is more complicated, because I need to monitor the tasks that are being performed, kill them if necessary, but they are easy to implement in such an architecture.

This is not a dynamic set of nodes, but you can start a new node with a distributor when you need to.

PS It looks like a pool, but in my case I send port processes, so I need to limit them and better control what happens where.

+1
source

Source: https://habr.com/ru/post/1339975/


All Articles