Limiting the number of concurrent processes scheduled by the condor

I use condor to make batches of 100 processes in a few hours. After completing these processes, I need to run the next batch of runs with the results from the first batch, and this process is repeated dozens of times. My condor pool is> 100 cores, and I would like to limit my condor cluster to only 100 processes at a time, so the condor is just starting to work on the next process after completing one of the first processes. Is it possible?

+1
source share
2 answers

It is like you are just doing a task at breakpoints, and then the next task reads at that breakpoint and does some things and writes a new breakpoint, etc. 10 times. I'm not sure why you need to break it as you have, why not just have a script wrapper that looks for the checkpoint file and uses it or runs from scratch?

Another option is to use the "Requirements" in the sending file and display only 100 machines or cores that your work can run on. Sort of:

Requirements = (machine == "astrolab01") || (machine == "astrolab02") || (machine == "astrolab03") 

ensures that you never run more than three jobs at once. If these machines do not have multiple cores, then you need to do something like:

 Requirements = (name == " slot1@astrolab01 ") || (name == " slot1@astrolab02 ") 
+1
source

You need to use the DAG Manager - this allows you to define the relationship between parents and children between tasks, so you can wait for results from the first task to the start of the second task.

DAGman also has a MAX_JOBS_RUNNING parameter, which limits the total number of active jobs for you.

All of this is described in section 2.10 of manual 8.4. Most likely, you will need to use a script to create a DAG file and have an accessible place to store intermediate results from runs - it is impossible for tasks to transfer data directly from the parent to the child. The output is collected from the first run into the working directory, and then sent to the next job from the working directory.

+1
source

Source: https://habr.com/ru/post/1247101/


All Articles