We have a system with different types of tasks. Name them, for example:
job_1
job_2
job_3
All of them require different sets of parameters (and optional parameters). That is, we run job_1(x)for different x= A, B, C .... job_2runs for a set of parameters, which depends on the results job_1(x), and also job_2loads the data stored in job_A(x). Etc.
The result is a tree structure of dependencies. Now these tasks sometimes fail for one reason or another. So, if job_Afor x=Bfails, the tree branch will fail and should not start. All other branches should work.
All tasks are written in Python and use parallelism (based on spawning of SLURM tasks). They are scheduled using cron. This is obviously not very and has two main drawbacks:
- Debugging is very complicated. All tasks run regardless of whether the task ended higher in the tree or not. It is hard to understand where the problem is without a deep understanding of dependencies.
- If a higher task (for example
job_A) is not completed, it job_Bmay be scheduled to run and fail, or it will run based on an outdated date.
To solve this problem, we considered airflow for planning or visualization because it is written in Python and seems to fit our needs. I see various problems:
- (..
job_B job_A), (.. job_B(y) 100 job_A(x=A). 10 , , . 300 . , read. ? - parallelism ( , , , ), ?
- .
- , ?
? , (luigi, Azkaban ..), Hadoop ( , ). ? ?