Scraping a website that requires scrolling down

I am trying to clean this site here:

However, scrolling down is required to collect additional data. I have no idea how to scroll down using a beautiful soup or python. Does anyone know about this?

The code is a bit of a mess, but here it is.

import scrapy from scrapy.selector import Selector from testtest.items import TesttestItem import datetime from selenium import webdriver from bs4 import BeautifulSoup from HTMLParser import HTMLParser import re import time class MLStripper(HTMLParser): class MySpider(scrapy.Spider): name = "A1Locker" def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data() allowed_domains = ['https://www.a1lockerrental.com'] start_urls = ['http://www.a1lockerrental.com/self-storage/mo/st- louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units? category=all'] def parse(self, response): url='http://www.a1lockerrental.com/self-storage/mo/st- louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units? category=Small' driver = webdriver.Firefox() driver.get(url) html = driver.page_source soup = BeautifulSoup(html, 'html.parser') url2='http://www.a1lockerrental.com/self-storage/mo/st-louis/4427- meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium' driver2 = webdriver.Firefox() driver2.get(url2) html2 = driver.page_source soup2 = BeautifulSoup(html2, 'html.parser') #soup.append(soup2) #print soup items = [] inside = "Indoor" outside = "Outdoor" inside_units = ["5 x 5", "5 x 10"] outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x 20","10 x 25","10 x 30"] sizeTagz = soup.findAll('span',{"class":"sss-unit-size"}) sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"}) #print soup.findAll('span',{"class":"sss-unit-size"}) rateTagz = soup.findAll('p',{"class":"unit-special-offer"}) specialTagz = soup.findAll('span',{"class":"unit-special-offer"}) typesTagz = soup.findAll('div',{"class":"unit-info"},) rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"}) specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"}) typesTagz2 = soup2.findAll('div',{"class":"unit-info"},) yield {'date': datetime.datetime.now().strftime("%m-%d-%y"), 'name': "A1Locker" } size = [] for n in range(len(sizeTagz)): print len(rateTagz) print len(typesTagz) if "Outside" in (typesTagz[n]).get_text(): size.append(re.findall(r'\d+', (sizeTagz[n]).get_text())) size.append(re.findall(r'\d+', (sizeTagz2[n]).get_text())) print "logic hit" for i in range(len(size)): yield { #soup.findAll('p',{"class":"icon-bg"}) #'name': soup.find('strong', {'class':'high'}).text 'size': size[i] #"special": (specialTagz[n]).get_text(), #"rate": re.findall(r'\d+',(rateTagz[n]).get_text()), #"size": i.css(".sss-unit-size::text").extract(), #"types": "Outside" } driver.close() 

The desired code output is to display the data collected from this web page: http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit- sizes-prices # / units? category = all

To do this, scroll down to view the rest of the data. At least that's how it will be in my mind.

Thanks DM123

+5
source share
2 answers

There is a webdriver function that provides this feature. BeautifulSoup does nothing but parse the site.

Check this out: http://webdriver.io/api/utility/scroll.html

+2
source

The website you are trying to clear loads the content dynamically using JavaScript. Unfortunately, many web scrapers, such as beautiful soup, cannot execute JavaScript on their own. However, there are many options in the form of headless browsers. Classic PhantomJS , but maybe you should take a look at this great list of options on GitHub , some of which can go well with beautiful soup like Selenium.

Keeping Selenium in mind, an answer at fooobar.com/questions/1270757 / ... may also help.

+1
source

Source: https://habr.com/ru/post/1270756/


All Articles