Scrambling dynamic content using python-scrapy

Question

Scrambling dynamic content using python-scrapy

Disclaimer: I saw many other similar messages in StackOverflow and tried to do it the same way, but they don't seem to work on this website.

I am using Python-Scrapy to retrieve data from koovs .com.

However, I cannot get the size of the product that is dynamically generated. In particular, if someone can help me a bit to get the "Inaccessible" shortcut in the drop-down menu on this link, I would be grateful.

I can get the list of sizes statically, but by doing this, I get only a list of sizes, but not the one that is available.

+28

python web-scraping scrapy

Pravesh Jain May 20 '15 at 9:27

source share

4 answers

From what I understand, size availability is determined dynamically in javascript executed in the browser. Scrapy is not a browser and cannot execute javascript.

If you are switching to a selenium browser automation tool , here is a sample code:

 from selenium import webdriver from selenium.webdriver.support.select import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC browser = webdriver.Firefox() # can be webdriver.PhantomJS() browser.get('http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376') # wait for the select element to become visible select_element = WebDriverWait(browser, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.select-size select.sizeOptions"))) select = Select(select_element) for option in select.options[1:]: print option.text browser.quit()

He prints:

 S / 34 -- Not Available L / 40 -- Not Available L / 42

Please note that instead of Firefox you can use other web servers such as Chrome or Safari. There is also the option to use the PhantomJS headless browser.

You can also combine Scrapy with selenium, if necessary, see

selenium with scrapy for dynamic page
scrapy-webdriver
seleniumcrawler

+6

alecxe May 21 '15 at

source share

I ran into this problem and easily solved by following these steps:

pip set splash
pipettes - pip install scrapyjs

download and install docker-toolbox

open docker-quickterminal and login

 $ docker run -p 8050:8050 scrapinghub/splash

To set SPLASH_URL, check the default IP address configured on the docker machine by entering $ docker-machine ip default (my IP address was 192.168.99.100)

 SPLASH_URL = 'http://192.168.99.100:8050' DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

What is it!

+1

Srivardhan Cholkar Jun 05 '17 at 14:11

source share

You must interpret website JSON , examples of scrapy.readthedocs and testingcan.imtqy.com

 import scrapy import json class QuoteSpider(scrapy.Spider): name = 'quote' allowed_domains = ['quotes.toscrape.com'] page = 1 start_urls = ['http://quotes.toscrape.com/api/quotes?page=1] def parse(self, response): data = json.loads(response.text) for quote in data["quotes"]: yield {"quote": quote["text"]} if data["has_next"]: self.page += 1 url = "http://quotes.toscrape.com/api/quotes?page={}".format(self.page) yield scrapy.Request(url=url, callback=self.parse)

0

Alexis Hassan Mejia Moreno May 25 '19 at 18:38

source share

alecxe · Accepted Answer · 2015-05-21 15:56

You can also solve it with ScrapyJS (no need for selenium and real browser):

This library provides Scrapy + JavaScript integration using Splash.

Follow the installation instructions for Splash and ScrapyJS , launch the surge ScrapyJS container:

 $ docker run -p 8050:8050 scrapinghub/splash

Put the following settings in settings.py :

 SPLASH_URL = 'http://192.168.59.103:8050' DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

And here is your sample spider that can see size availability information:

 # -*- coding: utf-8 -*- import scrapy class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["koovs.com"] start_urls = ( 'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376', ) def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, self.parse, meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } }) def parse(self, response): for option in response.css("div.select-size select.sizeOptions option")[1:]: print option.xpath("text()").extract()

Here is what is printed on the console:

 [u / 34 -- Not Available'] [u'L / 40 -- Not Available'] [u'L / 42']

Scrambling dynamic content using python-scrapy

More articles: