How to get the content of an HTML page in Python

Question

How to get the content of an HTML page in Python

I loaded the webpage into an html file. I am wondering what is the easiest way to get the contents of this page. In terms of content, I mean that I need the lines displayed by the browser.

To be clear:

Input:

<html><head><title>Page title</title></head> <body><p id="firstpara" align="center">This is paragraph <b>one</b>. <p id="secondpara" align="blah">This is paragraph <b>two</b>. </html>

Conclusion:

 Page title This is paragraph one. This is paragraph two.

together:

 from BeautifulSoup import BeautifulSoup import re def removeHtmlTags(page): p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''') return p.sub('', page) def removeHtmlTags2(page): soup = BeautifulSoup(page) return ''.join(soup.findAll(text=True))

Removing Python HTML
Extract text from an HTML file using Python
What is a lightweight python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (Rubys Sanitize equivalent)
Open RegEx tags, with the exception of stand-alone XHTML tags (known do not use regex to parse html rant)

+4

python html parsing

Yin zhu Mar 10 '10 at 12:32

source share

6 answers

Personally, I use lxml because it is a Swiss army knife ...

  from lxml import html

 print html.parse ('http: //someurl.at.domain') .xpath ('// body') [0] .text_content ()

This tells lxml to extract the page, find the <body> , then extract and print all the text.

I do a lot of page parsing, and regular expression is the wrong solution most of the time, unless it requires a one-time use. If the page author changes his HTML code, you run the risk of breaking the regular expression. The parser is much more likely to continue working.

The big problem with the parser is to find out how to access sections of the document you are using, but there are many XPATH tools that you can use in your browser, which makes the task easier.

+7

the tin man Mar 10 '10 at 19:43

source share

You want to see Extracting data from HTML documents - Immersion in Python , because HERE it does (almost) exactly what you want.

+2

Pratik deoghare Mar 10 '10 at 13:15

source share

The best modules for this task: lxml or html5lib; Beautifull Soap is no longer worth using. And for recursive models, regular expressions are definitely the wrong method.

+1

Christian hausknecht Mar 10 '10 at 12:49

source share

If I ask the question correctly, this can be done using the urlopen urllib function. Just view this function to open the URL and read the response, which will be the html code of this page.

-2

Ankit Mar 10 '10 at 12:46

source share

The fastest way to get a useful browser display pattern is to remove any tags from html and print the rest. This can be done, for example, using python re .

-3

Alexander Gessler Mar 10 '10 at 12:34

source share

Oddthinking · Accepted Answer · 2010-03-10T12:35:10+0000

Parse HTML with Beautiful Soup .

To get all the text without tags, try:

 ''.join(soup.findAll(text=True))

How to get the content of an HTML page in Python

Related

More articles: