How to name individual Scrapy spiders from another Python script

I have a script called algorithm.py and I want to be able to call Scrapy spiders during the script. File structure:

algorithm.py MySpiders /

where MySpiders is a folder containing several scripting projects. I would like to create the perform_spider1 (), perform_spider2 () methods ... which I can call in the .py algorithm.

How do I create this method?

I managed to call one spider using the following code, however this is not a method and it only works for one spider. I am a newbie in need of help!

import sys,os.path sys.path.append('path to spider1/spider1') from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy import log, signals from scrapy.xlib.pydispatch import dispatcher from spider1.spiders.spider1_spider import Spider1Spider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = RaListSpider() crawler = Crawler(Settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg('Running reactor...') reactor.run() # the script will block here log.msg('Reactor stopped.') 
+6
source share
2 answers

Just go through your spiders and configure them by calling configure , crawl and start , and only then call log.start() and reactor.run() . And scrapy will launch several spiders in the same process.

See the documentation and this thread for more details.

Also, consider running your spiders through scrapyd .

Hope this helps.

+5
source

Based on the good advice from alecxe, here is a possible solution.

 import sys,os.path sys.path.append('/path/ra_list/') sys.path.append('/path/ra_event/') from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy import log, signals from scrapy.xlib.pydispatch import dispatcher from ra_list.spiders.ra_list_spider import RaListSpider from ra_event.spiders.ra_event_spider import RaEventSpider spider_count = 0 number_of_spiders = 2 def stop_reactor_after_all_spiders(): global spider_count spider_count = spider_count + 1 if spider_count == number_of_spiders: reactor.stop() dispatcher.connect(stop_reactor_after_all_spiders, signal=signals.spider_closed) def crawl_resident_advisor(): global spider_count spider_count = 0 crawler = Crawler(Settings()) crawler.configure() crawler.crawl(RaListSpider()) crawler.start() crawler = Crawler(Settings()) crawler.configure() crawler.crawl(RaEventSpider()) crawler.start() log.start() log.msg('Running in reactor...') reactor.run() # the script will block here log.msg('Reactor stopped.') 
+2
source

Source: https://habr.com/ru/post/946826/


All Articles