I use Celery to run thousands of thousands of tasks, where each task takes several minutes. The code below is my simple replacement for multiprocessing.pool.Pool.map :
def map(task, data): """ Perform the *task* on *data* in distributed way. Blocks until finished. """ ret = celery_module.group(task.s(val) for val in data).apply_async() return ret.get(interval = 0.1)
It works like a charm until workers never break. But sometimes it happens that a node dies, taking with it several running tasks. Then what happens is that all other tasks end, the workers are inactive, but get waits forever for the results from the dead worker.
How to make dead tasks retry after some timeout? Tasks are idempotent, I don't care about duplicate executions at all. I tried playing with CELERY_ACKS_LATE and putting timeouts here and there, but nothing seemed to fix this. I feel like I missed something obvious, but I can't find what.
Edit: The transport used, both for the broker and for the results, is Redis.
source share