How to get the content of an HTML page in Python

I loaded the webpage into an html file. I am wondering what is the easiest way to get the contents of this page. In terms of content, I mean that I need the lines displayed by the browser.

To be clear:

Input:

<html><head><title>Page title</title></head> <body><p id="firstpara" align="center">This is paragraph <b>one</b>. <p id="secondpara" align="blah">This is paragraph <b>two</b>. </html> 

Conclusion:

 Page title This is paragraph one. This is paragraph two. 

together:

 from BeautifulSoup import BeautifulSoup import re def removeHtmlTags(page): p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''') return p.sub('', page) def removeHtmlTags2(page): soup = BeautifulSoup(page) return ''.join(soup.findAll(text=True)) 

Related

+4
source share
6 answers

Parse HTML with Beautiful Soup .

To get all the text without tags, try:

 ''.join(soup.findAll(text=True)) 
+12
source

Personally, I use lxml because it is a Swiss army knife ...

  from lxml import html

 print html.parse ('http: //someurl.at.domain') .xpath ('// body') [0] .text_content ()

This tells lxml to extract the page, find the <body> , then extract and print all the text.

I do a lot of page parsing, and regular expression is the wrong solution most of the time, unless it requires a one-time use. If the page author changes his HTML code, you run the risk of breaking the regular expression. The parser is much more likely to continue working.

The big problem with the parser is to find out how to access sections of the document you are using, but there are many XPATH tools that you can use in your browser, which makes the task easier.

+7
source

You want to see Extracting data from HTML documents - Immersion in Python , because HERE it does (almost) exactly what you want.

+2
source

The best modules for this task: lxml or html5lib; Beautifull Soap is no longer worth using. And for recursive models, regular expressions are definitely the wrong method.

+1
source

If I ask the question correctly, this can be done using the urlopen urllib function. Just view this function to open the URL and read the response, which will be the html code of this page.

-2
source

The fastest way to get a useful browser display pattern is to remove any tags from html and print the rest. This can be done, for example, using python re .

-3
source

Source: https://habr.com/ru/post/1303667/


All Articles