We have a script that periodically downloads documents from different sources. I'm going to move this to celery, but by doing this I wanted to use pooling at the same time, but I was not sure how to do it.
My current thought is to do something like this with Requests:
import celery import requests s = requests.session() @celery.task(retry=2) def get_doc(url): doc = s.get(url)
But I am worried that the connections will remain vague.
I really only need connections to stay open while I process new documents.
So something like this is possible:
import celery import requests def get_all_docs() docs = Doc.objects.filter(some_filter=True) s = requests.session() for doc in docs: t=get_doc.delay(doc.url, s) @celery.task(retry=2) def get_doc(url): doc = s.get(url)
However, in this case, I am not sure that the connection sessions will be saved in different instances or if the requests will create new connections after the etching / spilling is completed.
Finally, I could try experimental support for task decorators in the class method, so something like this:
import celery import requests class GetDoc(object): def __init__(self): self.s = requests.session() @celery.task(retry=2) def get_doc(url): doc = self.s.get(url)
The latter looks like the best approach, and I'm going to test it; however, I was wondering if someone here had already done something similar to this, or if not, one of you reading this might be a better fit than one of the above methods.