The test for http://example.com/category/ at the beginning of the line and the page parameter with one or more digits in the value:
Rule(LinkExtractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),
Demo (using your example URLs):
>>> import re >>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)') >>> should_match = [ ... 'http://example.com/category/?sort=az&page=1', ... 'http://example.com/category/?page=1&sort=az&cache=1', ... 'http://example.com/category/?page=1&sort=az#' ... ] >>> for url in should_match: ... print "Matches" if pattern.search(url) else "Doesn't match" ... Matches Matches Matches
source share