Scrapy - How to initiate multiple instances of the same spider process?

Question

Scrapy - How to initiate multiple instances of the same spider process?

I get stuck when running multiple instances of the same spider. I want to run it as 1 url for 1 spider instance. I have to handle 50k urls, and for that I need to initiate separate instances for each. In my main spider script, I set the timepider timeut for 7 minutes to make sure that I do not crawl for a long time. See code below:

from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings import urlparse for start_url in all_urls: domain = urlparse.urlparse(start_url).netloc if domain.startswith('ww'): domain = domain.split(".",1)[1] process = CrawlerProcess(get_project_settings()) process.crawl('textextractor', start_url=start_url,allowed_domains=domain) process.start()

It runs completely for the 1st url, bur after that, when the 2nd url is passed, it gives an error below:

 raise error.ReactorNotRestartable() ReactorNotRestartable

Please suggest what I have to do to run it for multiple instances of the same spider. Also, I'm going to initiate multiple scrapy instances at a time using threads. Would this be a great approach?

+5

python web-scraping scrapy scrapy-spider

user3721618 Nov 13 '15 at 10:27

source share

2 answers

furas · Answer 1 · 2015-11-14T08:10:44+0000

How about this

 process = CrawlerProcess(get_project_settings()) for start_url in all_urls: domain = urlparse.urlparse(start_url).netloc if domain.startswith('ww'): domain = domain.split(".",1)[1] process.crawl('textextractor', start_url=start_url,allowed_domains=domain) process.start()

mannmann2 · Answer 2 · 2019-01-17T11:02:46+0000

Is there a specific reason you want to run 50,000 spider instances? By default, Twisted allows you to run only one instance (unless you kill the whole process and restart it)

Secondly, “1 URL for 1 spider instance” will cause huge overhead in memory. Instead, you should consider passing all the URLs to the same instance.

Scrapy - How to initiate multiple instances of the same spider process?

More articles: