I am creating a scrapy project in which I have several spiders (Spider for each domain). Now the URLs that need to be cleared dynamically exit a user-defined request. so basically I donβt need to do extensive workarounds or even follow the links. there will be URLs coming one after another and I just need to extract using selectors. So I wondered if I could just pass the URLs to a message queue from which a scrapy spider could be swallowed, and everything would be alright. But I canβt figure it out. I checked
https://github.com/darkrho/scrapy-redis
but I feel that it is not suitable for my purposes, since I need several queues (one queue for each spider). As I understand it, one way is to override the start_requests method in the spider. But here I do not quite understand what to do (new for python and scrapy). Can I just treat it like a regular python script and use a method to use (any) message queue? In addition, I need a spider (s) running 24 * 7, and cleanup whenever there is a request in the queue. I decided that I should use signals and instead of throwing DontCloseSpider. but where am I doing this? I'm pretty lost. Please, help.
Here's the script I'm looking at:
User-> Query β url from abc.com β abc-spider
-> url from xyz.com -> xyz-spider -> url from ghi.com -> ghi-spider
Now every URL has the same thing as every website. Therefore, I have selectors that do this in every spider. What I need is just a single-user scenario. when there are multi-user users, there will be many unrelated URLs for the same spider. so it will be something like this:
query1, query2, query3
abc.com β url_abc1, url_abc2, url_abc3
xyz.com β url_xyz1, url_xyz2, url_xyz3
ghi.com β url_ghi1, url_ghi2, ββurl_ghi3
therefore, for each website, these URLs will arrive dynamically, which will be placed in the appropriate message queues. now each of the spiders designed for the website should consume its corresponding queue and give me scraping elements when there is a request in the message queue