Python 3 Web Search Features

I'm new to Python, so I'm sorry if this is a newbies question.

I am trying to create a program using webscraping, and I noticed that Python 3 has significantly fewer web cleaning modules than Python 2.x.

Beautiful soup, mechanization and violin - the three modules recommended to me - they all seem incompatible.

I am wondering if anyone on this forum has a good option for webscraping using python 3.

Any suggestions would be greatly appreciated.

Thanks, Will

+5
source share
2 answers

lxml.html runs on Python 3, and at least you get html parsing.

BeautifulSoup 4, which is under construction, should support Python 3 (I worked a bit on this).

+3
source

I'm a little new, but I found BeautifulSoup 4 really good, and I study and use it with requests and lxml modules. the query module is designed to get url and lxml (you can also use the built-in html.parser for parsing, but lxml faster, I think) for parsing.

Simple use:

 import requests from bs4 import BeautifulSoup url = 'someUrl' response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml') 

Not a simple example of how to get href from html:

 links = set() for link in soup.find_all('a'): if 'href' in link.attrs: links.add(link) 

You will then get set with unique links from your URL.

Another example is how you can parse certain parts of html, for example, if you want to separate all the <p> tags that have the testClass class:

 list_of_p = [] for p in soup.find_all('p', {'class': 'testClass'}): for item in p: list_of_p.append(item) 

and much more you can do with it as simple as it seems.

0
source

Source: https://habr.com/ru/post/894776/


All Articles