Handling death of an employee in a multiprocessor pool

Question

Handling death of an employee in a multiprocessor pool

I have a simple server:

from multiprocessing import Pool, TimeoutError import time import os if __name__ == '__main__': # start worker processes pool = Pool(processes=1) while True: # evaluate "os.getpid()" asynchronously res = pool.apply_async(os.getpid, ()) # runs in *only* one process try: print(res.get(timeout=1)) # prints the PID of that process except TimeoutError: print('worker timed out') time.sleep(5) pool.close() print("Now the pool is closed and no longer available") pool.join() print("Done")

If I run this, I will get something like:

 47292 47292

Then I kill 47292 while the server is running. A new workflow starts, but server output:

 47292 47292 worker timed out worker timed out worker timed out

The pool is still trying to send requests to the old workflow.

I have done some work with search signals both on the server and on the workers, and I can improve the behavior a bit, but the server still seems to wait for dead children at the end of the work (i.e. pool.join () never ends) after the worker is killed.

How can one die with workers?

The graceful shutdown of workers from the server process seems to work if none of the workers died.

(On Python 3.4.4, but will be happy to update if this helps.)

UPDATE: Interestingly, this working time error does not occur if the pool is created using processes = 2, and you kill one worker process, wait a few seconds and kill another. However, if you kill both workflows in quick succession, then the work timeout problem reappears.

Perhaps this is due to the fact that if a problem occurs, the destruction of the server process will lead to the launch of work processes.

+5

python multiprocessing shutdown pool worker

ivo Aug 1 '17 at 15:43

source share

1 answer

Thomas moreau · Accepted Answer · 2017-08-02T09:12:10+0000

This behavior comes from the multiprocessing.Pool design. When you kill a worker, you can kill someone who has call_queue.rlock . When this process is killed while holding the lock, no other process can ever read in call_queue anymore, breaking Pool , because it can no longer communicate with its worker.
Thus, there really is no way to kill an employee and be sure that your Pool will still be fine, because you may be at a dead end.

multiprocessing.Pool does not handle a dying worker. You can try instead of concurrent.futures.ProcessPoolExecutor (with a slightly different API), which handles the default process failure. When a process dies in ProcessPoolExecutor , the entire executor shuts down and you return a BrokenProcessPool error.

Note that in this implementation there are other deadlocks that need to be fixed in loky . (DISCLAIMER: I am the custodian of this library). In addition, loky allows you to resize an existing executor using the ReusablePoolExecutor and the _resize method. Let me know if you are interested, I can provide you some help starting with this package. (I realized that we still need to work a bit on the documentation ... 0_0)

Handling death of an employee in a multiprocessor pool

More articles: