Scrapy filters unique URLs as duplicate URLs

Urls:

I use CrawlSpider with these rules:

rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',), ), callback='parse_product'),
)`

I do not understand this behavior, can someone explain, please? The same code worked last week. Using Scrapy version 1.3.0

+4
source share
1 answer

@paul trmbrth, -, . Scrapy , . , link 'a' html javascript:

<a href='javascript:gtm.traceProductClick("/en-sa/mobiles/smartphones/samsung-galaxy-s7-32gb-dual-sim-lte-gold-188024">

, :

    def _process_value(value):
    m = re.search('javascript:gtm.traceProductClick\("(.*?)"', value)
    if m:
        return m.group(1)


rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(
        allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',),
        process_value=_process_value
    ), callback='parse_product'),
)

URL-, 'href' 'a', , . @paul trmbrth

+3

Source: https://habr.com/ru/post/1679394/


All Articles