Scrapy filters unique URLs as duplicate URLs

Question

Scrapy filters unique URLs as duplicate URLs

Urls:

http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=1
http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=2 are unique, but scrapy filters these URLs as duplicates, rather than clearing them.

I use CrawlSpider with these rules:

rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',), ), callback='parse_product'),
)`

I do not understand this behavior, can someone explain, please? The same code worked last week. Using Scrapy version 1.3.0

+4

python scrapy scrapy-spider

javed Jun 15 '17 at 23:03

source share

1 answer

javed · Accepted Answer · 2017-06-20T19:07:00+0000

@paul trmbrth, -, . Scrapy , . , link 'a' html javascript:

<a href='javascript:gtm.traceProductClick("/en-sa/mobiles/smartphones/samsung-galaxy-s7-32gb-dual-sim-lte-gold-188024">

, :

    def _process_value(value):
    m = re.search('javascript:gtm.traceProductClick\("(.*?)"', value)
    if m:
        return m.group(1)


rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(
        allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',),
        process_value=_process_value
    ), callback='parse_product'),
)

URL-, 'href' 'a', , . @paul trmbrth

Scrapy filters unique URLs as duplicate URLs

More articles: