Scrapy - How to initiate multiple instances of the same spider process?

I get stuck when running multiple instances of the same spider. I want to run it as 1 url for 1 spider instance. I have to handle 50k urls, and for that I need to initiate separate instances for each. In my main spider script, I set the timepider timeut for 7 minutes to make sure that I do not crawl for a long time. See code below:

from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings import urlparse for start_url in all_urls: domain = urlparse.urlparse(start_url).netloc if domain.startswith('ww'): domain = domain.split(".",1)[1] process = CrawlerProcess(get_project_settings()) process.crawl('textextractor', start_url=start_url,allowed_domains=domain) process.start() 

It runs completely for the 1st url, bur after that, when the 2nd url is passed, it gives an error below:

 raise error.ReactorNotRestartable() ReactorNotRestartable 

Please suggest what I have to do to run it for multiple instances of the same spider. Also, I'm going to initiate multiple scrapy instances at a time using threads. Would this be a great approach?

+5
source share
2 answers

How about this

 process = CrawlerProcess(get_project_settings()) for start_url in all_urls: domain = urlparse.urlparse(start_url).netloc if domain.startswith('ww'): domain = domain.split(".",1)[1] process.crawl('textextractor', start_url=start_url,allowed_domains=domain) process.start() 
0
source

Is there a specific reason you want to run 50,000 spider instances? By default, Twisted allows you to run only one instance (unless you kill the whole process and restart it)

Secondly, β€œ1 URL for 1 spider instance” will cause huge overhead in memory. Instead, you should consider passing all the URLs to the same instance.

0
source

Source: https://habr.com/ru/post/1235869/


All Articles