Python 3.5 | Website data scrambling

I want to clear a certain part of the Kickstarter.com website

I need project name strings. The website is structured and each project has this line.

<div class="Project-title">
Run codeHide result

My code looks like this:

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")

#Scraping "Project Title" (project-title)
project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a')
title = project_title[0].text
print (title)
Run codeHide result

If I use soup.find_all or set a different value in the line Project_title [0] instead of zero, Python shows an error.

I need a list with all the project names of this Website. For instance:

  • Superbook: Turn your smartphone into a laptop for $ 99.
  • Weight: Weighing
  • Minsk Kafon Drone World First and Only Complete
  • Omega2 Weather Monitoring System: $ 5 PC with Wi-Fi Powered by Linux
+4
3

find() . , findAll

,

project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)

h6 class project-title. .

, , , - .

edit: , , a , findAll

:

project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]

, project.findChildren('a') . (if [] False)

edit: (class project-blurb), HTML.

<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>

project-blurb. , , , project_elements :

project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]
+2

, , -. , .

-, pyimagesearch scrapy.

, - .

+1

, , css, h6 :

soup = BeautifulSoup(thepage,"html.parser")


print [a.text for a in soup.select("section.staff-picks h6.project-title a")]

:

[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']

find find_all:

project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])

h6 , , .

0
source

Source: https://habr.com/ru/post/1649017/


All Articles