How to create a href-based LinkExtractor rule in Scrapy

Question

How to create a href-based LinkExtractor rule in Scrapy

I am trying to create a simple crawler with Scrapy (scrapy.org). As shown in the example item.php . How can I write a rule that resolves a URL that always starts with http://example.com/category/ , but in the GET the page parameter must be there with any number of digits with a different parameter. The order of these parameters is random. Please help. How can I write such a rule?

Several valid values:

Below is the code:

 import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/category/'] rules = ( Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item

+5

python regex web-scraping scrapy

Sandesh Dec 6 '14 at 11:11

source share

2 answers

try to do it

 import re p = re.compile(ur'<[^>]+href="((http:\/\/example.com\/category\/)([^"]+))"', re.MULTILINE) test_str = u"<a class=\"youarehere\" href=\"http://example.com/category/?sort=newest\">newest</a>\n \n<a href=\"http://example.com/category/?sot=frequent\">frequent</a>" re.findall(p, test_str)

live demonstration

-2

Ahosan Karim Asik Dec 6 '14 at 11:24

source share

alecxe · Accepted Answer · 2014-12-07T00:56:17+0000

The test for http://example.com/category/ at the beginning of the line and the page parameter with one or more digits in the value:

 Rule(LinkExtractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),

Demo (using your example URLs):

 >>> import re >>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)') >>> should_match = [ ... 'http://example.com/category/?sort=az&page=1', ... 'http://example.com/category/?page=1&sort=az&cache=1', ... 'http://example.com/category/?page=1&sort=az#' ... ] >>> for url in should_match: ... print "Matches" if pattern.search(url) else "Doesn't match" ... Matches Matches Matches

How to create a href-based LinkExtractor rule in Scrapy

More articles: