Scrap site using scrapy

Question

Scrap site using scrapy

I am trying to abandon a website using scrapy, but I had a problem replacing all the products from this website since it uses endless scrolling ...

I can refuse data only for 52 elements, but there are 3824 of them.

hxs.select("//span[@class='itm-Catbrand strong']").extract() hxs.select("//span[@class='itm-price ']").extract() hxs.select("//span[@class='itm-title']").extract()

If I use hxs.select("//div[@id='content']/div/div/div").extract() Then it extracts a list of integer elements, but it will not filter further .... How do I delete all items ...

I tried this but the same result. Please tell me where I am wrong ...

 def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body for n in [2,3,4,5,6]: req = Request(url="http://www.jabong.com/men/shoes/?page=" + n, headers = {"Referer": "http://www.jabong.com/men/shoes/", "X-Requested-With": response.header['X-Requested-With']}) return req

+6

python html scrapy

vaibhav jain May 15, '13 at 8:12

source share

1 answer

Xion345 · Accepted Answer · 2013-05-15T08:57:21+0000

As you might have guessed, this site uses javascript to load more elements when scrolling through a page.

Using the developer tools included in my browser (Ctrl-Maj i for chromium), I saw on the "Network" tab that the javascript script included in the page performs the following requests to load more elements:

 GET http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc...

The web server responds with documents of the following type:

 <li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 "> <div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);">Quick view</div> <div class="itm-qlInsert tooltip-qlist highlightStar" onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog'); return false;" > <div class="starHrMsg"> <span class="starHrMsgArrow">&nbsp;</span> Save for later </div> </div> <a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html" onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);"> <span class="lazyImage"> <span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4"> <noscript><img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript> </span> </span> <span class="itm-budgeFlag offInside"><span class="flagBrdLeft"></span>OFFER INSIDE</span> <span class="itm-Catbrand strong">Phosphorus</span> <span class="itm-title"> Black Moccasins </span>

These documents contain more elements.

So, in order to get a complete list of elements, you will have to return Request objects in your Spider's parse method (see the Spider Class documentation ) to indicate that it should load more data:

 def parse(self, response): # ... Extract items in the page using extractors n = number of the next "page" to parse # You get get n by using response.url, extracting the number # at the end and adding 1 # It is VERY IMPORTANT to set the Referer and X-Requested-With headers # here because that how the website detects if the request was made by javascript # or direcly by following a link. req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n, headers = {"Referer": "http://www.website-your-are-crawling.com/men/shoes/", "X-Requested-With": "XMLHttpRequest"}) return req # and your items

Oh, and by the way (in case you want to test), you can't just download http://www.website-your-are-crawling.com/men/shoes/?page=2 in your browser to find out what it will return because the website will redirect you to a global page (i.e. http://www.website-your-are-crawling.com/men/shoes/ ) if the X-Requested-With header is different XMLHttpRequest .

Scrap site using scrapy

More articles: