I have a spider that starts with a small list allowed_domains
at the beginning of the spider. I need to add additional domains dynamically to this whitelist, as spiders continue from within the parser, but the next code snippet fails because subsequent requests are still filtered. Is there an update in the package allowed_domains
?
class APSpider(BaseSpider):
name = "APSpider"
allowed_domains = ["www.somedomain.com"]
start_urls = [
"http://www.somedomain.com/list-of-websites",
]
...
def parse(self, response):
soup = BeautifulSoup( response.body )
for link_tag in soup.findAll('td',{'class':'half-width'}):
_website = link_tag.find('a')['href']
u = urlparse.urlparse(_website)
self.allowed_domains.append(u.netloc)
yield Request(url=_website, callback=self.parse_secondary_site)
...
source
share