Parallel but different calls to the slurm srun operation step do not work

I would like to run the same program on a large number of different input files. I could just imagine each as a separate Slurm view, but I don’t want to swamp the queue, immediately dropping 1000 units of tasks. I tried to figure out how to handle the same number of files, instead creating a selection first, and then within this distribution cycle across all files using srun, giving each call one core from the selection. The problem is that no matter what I do, only one work step is performed. The simplest test case I could come up with is:

#!/usr/bin/env bash

srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &

wait

No matter how many cores I assign:

time salloc -n 1 test
time salloc -n 2 test
time salloc -n 4 test

it always takes 4 seconds. Is it impossible to complete several task steps in parallel?

+4
2

, , node .

DefMemPerCPU RAM .

+3

, . script :

#!/usr/bin/env bash

time {
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
wait
}

salloc -n 1 test
salloc -n 2 test
salloc -n 4 test

, , srun: Job step creation temporarily disabled, retrying n<4.

+3

Source: https://habr.com/ru/post/1629510/


All Articles