Scrambling dynamic content using python-scrapy

Disclaimer: I saw many other similar messages in StackOverflow and tried to do it the same way, but they don't seem to work on this website.

I am using Python-Scrapy to retrieve data from koovs .com.

However, I cannot get the size of the product that is dynamically generated. In particular, if someone can help me a bit to get the "Inaccessible" shortcut in the drop-down menu on this link, I would be grateful.

I can get the list of sizes statically, but by doing this, I get only a list of sizes, but not the one that is available.

+28
python web-scraping scrapy
May 20 '15 at 9:27
source share
4 answers

You can also solve it with ScrapyJS (no need for selenium and real browser):

This library provides Scrapy + JavaScript integration using Splash.

Follow the installation instructions for Splash and ScrapyJS , launch the surge ScrapyJS container:

 $ docker run -p 8050:8050 scrapinghub/splash 

Put the following settings in settings.py :

 SPLASH_URL = 'http://192.168.59.103:8050' DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' 

And here is your sample spider that can see size availability information:

 # -*- coding: utf-8 -*- import scrapy class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["koovs.com"] start_urls = ( 'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376', ) def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, self.parse, meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } }) def parse(self, response): for option in response.css("div.select-size select.sizeOptions option")[1:]: print option.xpath("text()").extract() 

Here is what is printed on the console:

 [u / 34 -- Not Available'] [u'L / 40 -- Not Available'] [u'L / 42'] 
+37
May 21 '15 at 15:56
source share

From what I understand, size availability is determined dynamically in javascript executed in the browser. Scrapy is not a browser and cannot execute javascript.

If you are switching to a selenium browser automation tool , here is a sample code:

 from selenium import webdriver from selenium.webdriver.support.select import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC browser = webdriver.Firefox() # can be webdriver.PhantomJS() browser.get('http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376') # wait for the select element to become visible select_element = WebDriverWait(browser, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.select-size select.sizeOptions"))) select = Select(select_element) for option in select.options[1:]: print option.text browser.quit() 

He prints:

 S / 34 -- Not Available L / 40 -- Not Available L / 42 

Please note that instead of Firefox you can use other web servers such as Chrome or Safari. There is also the option to use the PhantomJS headless browser.

You can also combine Scrapy with selenium, if necessary, see

+6
May 21 '15 at
source share

I ran into this problem and easily solved by following these steps:

pip set splash
pipettes - pip install scrapyjs

download and install docker-toolbox

open docker-quickterminal and login

 $ docker run -p 8050:8050 scrapinghub/splash 

To set SPLASH_URL, check the default IP address configured on the docker machine by entering $ docker-machine ip default (my IP address was 192.168.99.100)

 SPLASH_URL = 'http://192.168.99.100:8050' DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' 

What is it!

+1
Jun 05 '17 at 14:11
source share

You must interpret website JSON , examples of scrapy.readthedocs and testingcan.imtqy.com

 import scrapy import json class QuoteSpider(scrapy.Spider): name = 'quote' allowed_domains = ['quotes.toscrape.com'] page = 1 start_urls = ['http://quotes.toscrape.com/api/quotes?page=1] def parse(self, response): data = json.loads(response.text) for quote in data["quotes"]: yield {"quote": quote["text"]} if data["has_next"]: self.page += 1 url = "http://quotes.toscrape.com/api/quotes?page={}".format(self.page) yield scrapy.Request(url=url, callback=self.parse) 
0
May 25 '19 at 18:38
source share



All Articles