Scrapy - Reactor not recovering

with:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

I have always successfully completed this process:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start() 

but since I moved this code to a function web_crawler(self), for example:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2) 

and started calling the method by creating an instance of the class, for example:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

and working:

test()

I get the following error:

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

what's wrong?

+7
source share
5 answers

You cannot restart the reactor, but you should be able to start it more times by selecting a separate process:

import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

Run it twice:

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

Result:

first run:
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"A day without sunshine is like, you know, night."
...

second run:
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"A day without sunshine is like, you know, night."
...
+10
source

, ReactorNotRestartable:
0) pip install crochet
1) import from crochet import setup
2) setup() -
3) 2 :
a) d.addBoth(lambda _: reactor.stop())
) reactor.run()

, 4+ , , . - . . Scrapy - 2 :

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

, , run_spider run_spider - .
, -, :)

+8

Scrapy documentation start() CrawlerProcess :

"[...] Twisted, REACTOR_THREADPOOL_MAXSIZE DNS DNSCACHE_ENABLED DNSCACHE_SIZE."

, , Twisted, . , jimmy-rig - ( ), , .

, , , , , - .

, , Scrapy Script .

+1

:

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler() , , , @Rejected.

getting results that perform one single process and storing both results in a tuple is the way here:

def __call__(self):
    result1, result2 = test.web_crawler()
+1
source

This solved my problem, put the code below after reactor.run()or process.start():

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)
0
source

Source: https://habr.com/ru/post/1665934/


All Articles