Dynamically add to allowed_domains in Scrapy spider

I have a spider that starts with a small list allowed_domainsat the beginning of the spider. I need to add additional domains dynamically to this whitelist, as spiders continue from within the parser, but the next code snippet fails because subsequent requests are still filtered. Is there an update in the package allowed_domains?

class APSpider(BaseSpider):
name = "APSpider"

allowed_domains = ["www.somedomain.com"]

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

...

def parse(self, response):
    soup = BeautifulSoup( response.body )

    for link_tag in soup.findAll('td',{'class':'half-width'}):
        _website = link_tag.find('a')['href']
        u = urlparse.urlparse(_website)
        self.allowed_domains.append(u.netloc)

        yield Request(url=_website, callback=self.parse_secondary_site)

...
+3
source share
2 answers

You can try something like the following:

class APSpider(BaseSpider):
name = "APSpider"

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

def __init__(self):
    self.allowed_domains = None

def parse(self, response):
    soup = BeautifulSoup( response.body )

    if not self.allowed_domains:
        for link_tag in soup.findAll('td',{'class':'half-width'}):
            _website = link_tag.find('a')['href']
            u = urlparse.urlparse(_website)
            self.allowed_domains.append(u.netloc)

            yield Request(url=_website, callback=self.parse_secondary_site)

    if response.url in self.allowed_domains:
        yield Request(...)

...
+1
source

(At the very moment this answer is written, the latest version scrapyis equal 1.0.3. This answer will work for all recent versions scrapy)

OffsiteMiddleware allowed_domains spider_opened, allowed_domains .
, allowed_domains .

, :

  • allowed_domains .
  • OffsiteMiddleware.

, # 2:

# Refresh the regex cache for `allowed_domains`
for mw in self.crawler.engine.scraper.spidermw.middlewares:
    if isinstance(mw, scrapy.spidermiddlewares.offsite.OffsiteMiddleware):
        mw.spider_opened(self)

, self .

. :

+7

Source: https://habr.com/ru/post/1795629/


All Articles