I am trying to keep track of the pages of this website where the generation of the next page number is rather strange. Instead of the usual indexing, the following pages look like this:
new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5
and as a result, my scraper gets into the loop and never stops, clearing the elements of such pages:
DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`
etc. Although the scraper elements are correct and consistent with the purpose, the crawler never stops, turning to pages again and again.
my crawler looks like this:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from mymobile.items import MymobileItem
class MmobySpider(CrawlSpider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
, callback="parse_items", follow=True),)
def parse_items(self, response):
sel = Selector(response)
titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
items = []
for t in titles:
url = t.xpath('tr//a/@href').extract()
item = MymobileItem()
item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
item["url"] = urljoin("http://mymobile.ge/new/", url[0])
items.append(item)
return(items)
any suggestion how can i tame it?
source
share