How can I do start_url in scrapy to consume from a message queue?

I am creating a scrapy project in which I have several spiders (Spider for each domain). Now the URLs that need to be cleared dynamically exit a user-defined request. so basically I don’t need to do extensive workarounds or even follow the links. there will be URLs coming one after another and I just need to extract using selectors. So I wondered if I could just pass the URLs to a message queue from which a scrapy spider could be swallowed, and everything would be alright. But I can’t figure it out. I checked

https://github.com/darkrho/scrapy-redis

but I feel that it is not suitable for my purposes, since I need several queues (one queue for each spider). As I understand it, one way is to override the start_requests method in the spider. But here I do not quite understand what to do (new for python and scrapy). Can I just treat it like a regular python script and use a method to use (any) message queue? In addition, I need a spider (s) running 24 * 7, and cleanup whenever there is a request in the queue. I decided that I should use signals and instead of throwing DontCloseSpider. but where am I doing this? I'm pretty lost. Please, help.

Here's the script I'm looking at:

User-> Query β†’ url from abc.com β†’ abc-spider

-> url from xyz.com -> xyz-spider -> url from ghi.com -> ghi-spider 

Now every URL has the same thing as every website. Therefore, I have selectors that do this in every spider. What I need is just a single-user scenario. when there are multi-user users, there will be many unrelated URLs for the same spider. so it will be something like this:

query1, query2, query3

abc.com β†’ url_abc1, url_abc2, url_abc3

xyz.com β†’ url_xyz1, url_xyz2, url_xyz3

ghi.com β†’ url_ghi1, url_ghi2, ​​url_ghi3

therefore, for each website, these URLs will arrive dynamically, which will be placed in the appropriate message queues. now each of the spiders designed for the website should consume its corresponding queue and give me scraping elements when there is a request in the message queue

+6
source share
1 answer

This is a very common and (IMO) great way to create radiation therapy as part of a data pipeline; I do this all the time.

You are correct that you want to override the spider start_requests() method. I don’t know how scrapy works if you specify start_requests () as well as the start_urls variable, but I would recommend just using start_requests () if you are using a dynamic source like a database.

Some sample code, untested, but should give you the right idea. Please let me know if you need more information. It also assumes your turn is being filled with another process.

 class ProfileSpider( scrapy.Spider ): name = 'myspider' def start_requests( self ): while( True ): yield self.make_requests_from_url( self._pop_queue() ) def _pop_queue( self ): while( True ): yield self.queue.read() 

This provides your turn as a generator. If you want to minimize the number of empty loops (because the queue can be empty a lot of time), you can add a sleep command or exponential shutdown in the _pop_queue loop. (If the queue is empty, sleep for a few seconds and try pressing again.)

Assuming there are no fatal errors in your code, I believe that this should not end due to the construction of loops / generators.

+4
source

Source: https://habr.com/ru/post/975627/


All Articles