I get stuck when running multiple instances of the same spider. I want to run it as 1 url for 1 spider instance. I have to handle 50k urls, and for that I need to initiate separate instances for each. In my main spider script, I set the timepider timeut for 7 minutes to make sure that I do not crawl for a long time. See code below:
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings import urlparse for start_url in all_urls: domain = urlparse.urlparse(start_url).netloc if domain.startswith('ww'): domain = domain.split(".",1)[1] process = CrawlerProcess(get_project_settings()) process.crawl('textextractor', start_url=start_url,allowed_domains=domain) process.start()
It runs completely for the 1st url, bur after that, when the 2nd url is passed, it gives an error below:
raise error.ReactorNotRestartable() ReactorNotRestartable
Please suggest what I have to do to run it for multiple instances of the same spider. Also, I'm going to initiate multiple scrapy instances at a time using threads. Would this be a great approach?
source share