Celery: remote workers often lose touch

I have a Celery broker working on a cloud server (Django application), and two working on local servers in my office connected for NAT. Local workers often lose contact and need to be restarted in order to regain contact with the broker. Usually celeryd restart hangs on the first start, so I have to ctr + C and repeat it once or twice to restore it and connect. Workers record the two most common errors:

 [2014-08-03 00:08:45,398: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection... Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start blueprint.start(self) File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start step.start(parent) File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 796, in start c.loop(*c.loop_args()) File "/usr/local/lib/python2.7/dist-packages/celery/worker/loops.py", line 72, in asynloop next(loop) File "/usr/local/lib/python2.7/dist-packages/kombu/async/hub.py", line 320, in create_loop cb(*cbargs) File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 159, in on_readable reader(loop) File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 142, in _read raise ConnectionError('Socket was disconnected') ConnectionError: Socket was disconnected [2014-03-07 20:15:41,963: CRITICAL/MainProcess] Couldn't ack 11, reason:RecoverableConnectionError(None, 'connection already closed', None, '') Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 93, in ack_log_error self.ack() File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack self.channel.basic_ack(self.delivery_tag) File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1583, in basic_ack self._send_method((60, 80), args) File "/usr/local/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method raise RecoverableConnectionError('connection already closed') 

How do I debug this? Is the fact that workers are behind NAT? Is there a good tool to monitor whether work connections are lost? At least I could get them back by manually restarting the worker.

+6
source share
1 answer

Unfortunately, there is a problem with later versions in Celery + Kombu - the task handler is trying to use a closed connection. I worked around this as follows:

 CELERY_CONFIG = { 'CELERYD_MAX_TASKS_PER_CHILD': 1, 'CELERYD_PREFETCH_MULTIPLIER': 1, 'CELERY_ACKS_LATE': True, } 

CELERYD_MAX_TASKS_PER_CHILD - ensures that the employee will be restarted after completion of the task.

As for tasks that have already lost the connection, you cannot do anything right now. Perhaps this will be fixed in version 4. I just make sure that the tasks are as idempotent as possible.

+1
source

Source: https://habr.com/ru/post/973318/


All Articles