I'm a little new, but I found BeautifulSoup 4 really good, and I study and use it with requests and lxml modules. the query module is designed to get url and lxml (you can also use the built-in html.parser for parsing, but lxml faster, I think) for parsing.
Simple use:
import requests from bs4 import BeautifulSoup url = 'someUrl' response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml')
Not a simple example of how to get href from html:
links = set() for link in soup.find_all('a'): if 'href' in link.attrs: links.add(link)
You will then get set with unique links from your URL.
Another example is how you can parse certain parts of html, for example, if you want to separate all the <p> tags that have the testClass class:
list_of_p = [] for p in soup.find_all('p', {'class': 'testClass'}): for item in p: list_of_p.append(item)
and much more you can do with it as simple as it seems.
source share