High number of execution requests for each block block being viewed

I have a workaround where each subpage contains more than 300 links that I need to follow. Scanning slows down after a minute or so / sometimes it scans 0 pages per minute.

If I was looking for a scan with pages of 10-50 links per page, then the same problem does not appear.

I set up 10 concurrent_requests and 10 processed elements and theradpool reactor with 400. That means up to 3,000 outputs per 10 processed elements ...

Logging shows that the analysis function takes more than 70 seconds per page. Logging shows that the required time here comes from the yield (up to 2 seconds per crop).

Does scrapy seem to be waiting for the engine? or did something like this complete one task and is ready to process a new exit request? Adding queries to the scheduler will not take much time, so it seems to me that profitability expects something else.

Any clues you need to configure or what is going wrong?

Is it possible to give requests in bulk, instead of giving each independently? Is it possible to add them to the scheduler without giving in to them?

Some additional information: - It makes no difference if I use scrapy-redis or just a disk-based scheduler. - Due to javascript rendering, loading a bypass page can take up to 10 seconds. - The machine is disabled - If I give more processor resources, this will not speed up the work.

telnet-> est ()

time()-engine.start_time                        : 676.0599975585938
engine.has_capacity()                           : False
len(engine.downloader.active)                   : 7
engine.scraper.is_idle()                        : False
engine.spider.name                              : onetwothree
engine.spider_is_idle(engine.spider)            : False
engine.slot.closing                             : False
len(engine.slot.inprogress)                     : 28
len(engine.slot.scheduler.dqs or [])            : AttributeError (exception)
len(engine.slot.scheduler.mqs)                  : AttributeError (exception)
len(engine.scraper.slot.queue)                  : 0
len(engine.scraper.slot.active)                 : 21
engine.scraper.slot.active_size                 : 3878605
engine.scraper.slot.itemproc_size               : 0
engine.scraper.slot.needs_backout()             : False

AttributeError seems to come from scrapy-redis plugin, 
without scrapy counts up the pages to request in mgs.

est if len (engine.downloader.active) goes to 0

time()-engine.start_time                        : 7236.464096784592
engine.has_capacity()                           : False
len(engine.downloader.active)                   : 0
engine.scraper.is_idle()                        : False
engine.spider.name                              : onetwothree
engine.spider_is_idle(engine.spider)            : False
engine.slot.closing                             : False
len(engine.slot.inprogress)                     : 25
len(engine.slot.scheduler.dqs or [])            : AttributeError (exception)
len(engine.slot.scheduler.mqs)                  : AttributeError (exception)
len(engine.scraper.slot.queue)                  : 0
len(engine.scraper.slot.active)                 : 25
engine.scraper.slot.active_size                 : 5357134
engine.scraper.slot.itemproc_size               : 0
engine.scraper.slot.needs_backout()             : True

Seeker Code:

class robo2Spider(Spider):
    http_pass = None
    http_user = None
    dont_redirect = True
    start_urls = []

    def __init__(self, *args, **kwargs):

        # ... some config ...

        self.start_urls = self.custom_settings["TEST_URLS"]
        # Don't Follow links in test mode

    def start_requests(self):
        for url in self.start_urls:
            r = self.get_request(url)
            yield r

    def parse(self, response):
        # some extraction and co...
        yield from self.scrape_data(response)

    def scrape_data(self, response):
        start_time = time.time()

        # more extraction, build item

        extract_links = util.extract_links_from_response(response, self.query_pars_to_ignore)
        logging.info(
            "--- logging time 1: %s --- %s seconds ---" % (response.url, time.time() - start_time))

        request_links = []
        for link in extract_links:

            if (not link["nofollow"]) and (l.get_output_value("crawl_meta_nofollow") != "nofollow"):
                r = self.get_request(link["url"])
                request_links.append(r)

        yield from request_links

        logging.info(
            "--- logging time 2 (takes up to 70 sec): %s --- %s seconds ---" % (response.url, time.time() - start_time))

        yield l.load_item()


    def get_request(self, url, rit=None, splash_retry=None):

        # ... setting meta & co ...
        meta = {}
        splash_args = {}
        return SplashRequest(url=url, callback=self.parse, meta=meta,
                             args=splash_args, http_status_from_error_code=True,
                             endpoint='execute', slot_policy=scrapy_splash.SlotPolicy.SCRAPY_DEFAULT)

, :    extract_links:

    if (not link["nofollow"]) and (l.get_output_value("crawl_meta_nofollow") != "nofollow"):
        r = self.get_viu_request(link["url"])
        request_links.append(r)
        yield r
+4

Source: https://habr.com/ru/post/1682498/


All Articles