You can also solve it with ScrapyJS (no need for selenium and real browser):
This library provides Scrapy + JavaScript integration using Splash.
Follow the installation instructions for Splash and ScrapyJS , launch the surge ScrapyJS container:
$ docker run -p 8050:8050 scrapinghub/splash
Put the following settings in settings.py :
SPLASH_URL = 'http://192.168.59.103:8050' DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
And here is your sample spider that can see size availability information:
# -*- coding: utf-8 -*- import scrapy class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["koovs.com"] start_urls = ( 'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376', ) def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, self.parse, meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } }) def parse(self, response): for option in response.css("div.select-size select.sizeOptions option")[1:]: print option.xpath("text()").extract()
Here is what is printed on the console:
[u / 34 -- Not Available'] [u'L / 40 -- Not Available'] [u'L / 42']
alecxe May 21 '15 at 15:56 2015-05-21 15:56
source share