Personally, I use lxml because it is a Swiss army knife ...
from lxml import html
print html.parse ('http: //someurl.at.domain') .xpath ('// body') [0] .text_content ()
This tells lxml to extract the page, find the <body> , then extract and print all the text.
I do a lot of page parsing, and regular expression is the wrong solution most of the time, unless it requires a one-time use. If the page author changes his HTML code, you run the risk of breaking the regular expression. The parser is much more likely to continue working.
The big problem with the parser is to find out how to access sections of the document you are using, but there are many XPATH tools that you can use in your browser, which makes the task easier.
source share