How to create a distributed task scheduler?

I want to create a task scheduler cluster that contains several hosts for scheduling cron jobs. For example, a job that needs to run every 5 minutes is sent to the cluster, the cluster should indicate which host should perform the next start, making sure that

  • Disaster tolerance: if not all hosts are omitted, work must be started successfully.
  • Validity: only one host starts the next job.

Due to disaster tolerance, a task cannot be bound to a specific host. One way is that all hosts have polled the database table (of course, with a lock), this ensures that only one host gets the next job. Since it often locks the table, is there a better design?

+5
source share
5 answers

Use a Quartz frame . It has cron-type syntax, can be clustered, and only one of the hosts in the cluster will perform one task at a time. If the host or job does not work, the other host will repeat the pending job.

+1
source

Consider using AWS Simple Workflow Service if you are fine using AWS Web Services. The advantage over something like Quartz is that it is independent of the database you have to host, and can provide much more than planning. For example, it may trigger some actions that correct your cluster or page if planning for any reason is not possible. The following is an example cron workflow.

+1
source

Open Chronos ( https://mesos.imtqy.com/chronos/ ), which runs on top of Mesos - ( https://mesos.apache.org/ ).

+1
source

I needed something like this a long time ago when synchronization was done with floppy disks. You should clearly understand three things that seem simple, but in a distributed environment arent :-)

"Synchronization sections" If you get a network partition, which means that your cluster is divided into two separate sections that can interact within the sections, but not between the two sections, the "fire task exactly once" can be obtained only for each synchronization section.

"Disaster" If almost all the time all computers work and work very rarely, and the failure of the two is almost unthinkable, this is a completely different matter than each host works only part-time, connections are unstable or synchronization is performed by dial-up connections or using floppys. If you want to even deal with a network split, it becomes really very complicated. If you want to deal with malicious nodes, you have another problem.

"Validity period" Fry each work exactly once ... you need to synchronize faster than the job interval.

edit: Tipp for scheduling task schedulers. I have a large text file containing strings. Each line is a job task, starting with the job type, then the execution time, then the command, and finally the optional re-sending interval for the tasks to be repeated. Synchronization means merging. Completed tasks are deleted. If resubmission is enabled, a new task is added or added.

In an ideal world, each host is always connected to others, I would implement something like a token. If a master is not selected, one is selected by the hosts, and the master must plan everything until it sends out audio signals for some time. If there are two masters, they are negotiating for one of them to become a master (possibly a lower MAC address ... whatever).

If you have to deal with malicious nodes, you can use some solutions to solve byzantine gerenals problems. The wizard’s selection has already been tested fairly well for malicious hosts. With a small amount of rsa-krypto, the selected master can sign each command, re-send attacks can be processed with timestamps or growing indexes ... voila.

just like a story from a built-in programmer, not intended for today, everything is always connected with the world of the Internet: My big problem about 20 years ago was that the hosts were synchronized once an hour and once a day up to once a week or once a month. Thus, the solution should have different commands: 1. execute on each host on a given date (which is far enough in the future for synchronization) 2. execute on the host where "whoami" contains a specific substring. 3. run on a random node with a small probability and send a confirmation to everyone else that it has already been completed.

The third command type does something like "fire only once" if synchronization is much faster than the probability of execution. It does not need a master-slave architecture, and it works very well if you know synchronization.

+1
source

I was looking for Dkron (distributed task scheduling system). He has an apri and looks good. I plan to try it using the Dkron Website

+1
source

Source: https://habr.com/ru/post/1206727/


All Articles