Web clips using python 3?

I am trying to learn python 3.x so that I can clean websites. People recommended using Beautiful Soup 4 or lxml.html. Can someone point me in the right direction with a tutorial or examples for BeautifulSoup with python 3.x?

Thank you for your help.

+4
source share
1 answer

I actually just wrote a complete web cleaning guide that includes sample code in Python. I wrote and tested on Python 2.7, but both the packages I used (queries and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame .

Here is some code to get you started with web scraping in Python:

import sys import requests from BeautifulSoup import BeautifulSoup def scrape_google(keyword): # dynamically build the URL that we'll be making a request to url = "http://www.google.com/search?q={term}".format( term=keyword.strip().replace(" ", "+"), ) # spoof some headers so the request appears to be coming from a browser, not a bot headers = { "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "accept-encoding": "gzip,deflate,sdch", "accept-language": "en-US,en;q=0.8", } # make the request to the search url, passing in the the spoofed headers. r = requests.get(url, headers=headers) # assign the response to a variable r # check the status code of the response to make sure the request went well if r.status_code != 200: print("request denied") return else: print("scraping " + url) # convert the plaintext HTML markup into a DOM-like structure that we can search soup = BeautifulSoup(r.text) # each result is an <li> element with class="g" this is our wrapper results = soup.findAll("li", "g") # iterate over each of the result wrapper elements for result in results: # the main link is an <h3> element with class="r" result_anchor = result.find("h3", "r").find("a") # print out each link in the results print(result_anchor.contents) if __name__ == "__main__": # you can pass in a keyword to search for when you run the script # be default, we'll search for the "web scraping" keyword try: keyword = sys.argv[1] except IndexError: keyword = "web scraping" scrape_google(keyword) 

If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article about switching from Python 2 to Python 3 can be useful.

+14
source

Source: https://habr.com/ru/post/1483058/


All Articles