Scrapy resolves all subdomains

Question

Scrapy resolves all subdomains

I want to use scrapy for the website so that the pages are divided into many subdomains. I know what I need CrawlSpiderwith Rule, but I need the rule to be just “allow all subdomains and allow parsers to process themselves according to the data” (which means - in the example item_links are in different subdomains) / p>

code example:

def parse_page(self, response):
    sel = Selector(response)
    item_links = sel.xpath("XXXXXXXXX").extract()
    for item_link in item_links:
            item_request = Request(url=item_link,
                                     callback=self.parse_item)
            yield item_request

def parse_item(self, response):
    sel = Selector(response)

** EDIT ** Just to make the question clear, I want the ability to crawl everything * .example.com → which means not to receive Filtered offsite request to 'foo.example.com'

** OTHER CHANGE ** After @agstudy's answer, make sure you remember to delete allowed_domains = ["www.example.com"]

+4

python scrapy

Boaz Jun 01 '14 at 13:57

source share

2 answers

, allowed_domains Spider, allowed_domains = ['example.com']. example.com, foo.example.com.

+3

bartaelterman 06 . '15 13:03

agstudy · Accepted Answer · 2014-06-02T08:13:30+0000

allow_domains :

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)

:

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)

URL-, :

www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/

Scrapy resolves all subdomains

More articles: