Python subprocess returns invalid exit code

I wrote a script to run several processes (simple unit tests) for parallel operation. It will run N jobs with parallel num_workers processes at a time.

My first implementation ran processes in num_workers and seemed to work fine (I used the false command here to test the behavior)

 import subprocess errors = 0 num_workers = 10 N = 100 i = 0 while i < N: processes = [] for j in range(i, min(i+num_workers, N)): p = subprocess.Popen(['false']) processes.append(p) [p.wait() for p in processes] exit_codes = [p.returncode for p in processes] errors += sum(int(e != 0) for e in exit_codes) i += num_workers print(f"There were {errors}/{N} errors") 

However, tests do not take the same amount of time, so I sometimes wait for the slow test to finish. So I rewrote it to continue to assign tasks upon completion

 import subprocess import os errors = 0 num_workers = 40 N = 100 assigned = 0 completed = 0 processes = set() while completed < N: if assigned < N: p = subprocess.Popen(['false']) processes.add((assigned, p)) assigned += 1 if len(processes) >= num_workers or assigned == N: os.wait() for i, p in frozenset(processes): if p.poll() is not None: completed += 1 processes.remove((i, p)) err = p.returncode print(i, err) if err != 0: errors += 1 print(f"There were {errors}/{N} errors") 

However, this leads to incorrect results for the last few processes. For example, in the above example, it produces 98/100 errors instead of 100. I checked, and this has nothing to do with concurrency; The last 2 jobs returned with exit code 0 for some reason.

Why is this happening?

+5
source share
1 answer

The problem is os.wait() . Not only does it wait for the child process to exit: it also returns the pid and "exit status" of this child, as the documentation says. This requires waiting until the child process is complete; but as soon as the child is finished, his return code is no longer available for poll . Here is a simple test to reproduce the problem:

false_runner.py

 import os import subprocess p = subprocess.Popen(['false'], stderr=subprocess.DEVNULL) pid, retcode = os.wait() print("From os.wait: {}".format(retcode)) print("From popen object before poll: {}".format(p.returncode)) p.poll() print("From popen object after poll: {}".format(p.returncode)) 

Output

 njv@organon :~/tmp$ python false_runner.py From os.wait: 256 From Popen object before poll: None From Popen object after poll: 0 

The source code for _internal_poll , called by Popen.poll , makes it clear what happens here: when Popen tries to call _waitpid on its child pid, it gets ChildProcessError: [Errno 10] No child processes and assigns a returncode of 0, because in this moment there is no way to determine the return code of the child process.

The reason this happens only for the last two subprocesses in your example is because os.wait is only called for the case or assigned == N and only once or twice because your subprocess is running so fast. If you slow this down a bit, you will get more random behavior.

As for the fix: I would just replace os.wait() with sleep.

+1
source

Source: https://habr.com/ru/post/1275483/


All Articles