Restarting long tasks in a group after the death of an employee

Question

Restarting long tasks in a group after the death of an employee

I use Celery to run thousands of thousands of tasks, where each task takes several minutes. The code below is my simple replacement for multiprocessing.pool.Pool.map :

 def map(task, data): """ Perform the *task* on *data* in distributed way. Blocks until finished. """ ret = celery_module.group(task.s(val) for val in data).apply_async() return ret.get(interval = 0.1)

It works like a charm until workers never break. But sometimes it happens that a node dies, taking with it several running tasks. Then what happens is that all other tasks end, the workers are inactive, but get waits forever for the results from the dead worker.

How to make dead tasks retry after some timeout? Tasks are idempotent, I don't care about duplicate executions at all. I tried playing with CELERY_ACKS_LATE and putting timeouts here and there, but nothing seemed to fix this. I feel like I missed something obvious, but I can't find what.

Edit: The transport used, both for the broker and for the results, is Redis.

+4

python celery

lRem Nov 09 '12 at 11:15

source share

1 answer

Krzysztof Szularz · Answer 1 · 2012-11-22T08:03:25+0000

The correct behavior here should be to set a timeout, and when that dies, repeat the whole map task.

Restarting long tasks in a group after the death of an employee

More articles: