What is a lightweight python library that can eliminate HTML tags? (and only text)

Question

What is a lightweight python library that can eliminate HTML tags? (and only text)

I know that NLTK has this. But anything else?

-1

python

TIMEX Oct 25 '09 at 8:31

source share

4 answers

BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

On the home page:

Beautiful Soup is a Python HTML / XML parser designed to quickly transform projects such as screen scripting. Three functions make it powerful:

A beautiful soup will not strangle if you give it a bad markup. It provides a parsing tree that makes about the same meaning as the original document. This is usually good enough to collect the necessary data and run away.
Beautiful Soup offers some simple Pythonic methods and idioms for navigating, searching, and modifying the parsing tree: tools for opening a document and extracting what you need. You do not need to create your own parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You do not need to think about encodings unless the encoding is specified in the document, and Beautiful Soup cannot auto-detect it. Then you just need to specify the source encoding.

+4

Boris Gorelik Oct 25 '09 at 8:59

source share

You might want to take a look at the Strip-o-Gram conversion library: http://pypi.python.org/pypi/stripogram/1.5

usage example from readme.txt file:

from stripogram import html2text, html2safehtml mylumpofdodgyhtml # a lump of dodgy html ;-) # Only allow <b>, <a>, <i>, <br>, and <p> tags mylumpofcoolcleancollectedhtml = html2safehtml(mylumpofdodgyhtml,valid_tags=("b", "a", "i", "br", "p")) # Don't process <img> tags, just strip them out. Use an indent of 4 spaces # and a page that 80 characters wide. mylumpoftext = html2text(mylumpofcoolcleancollectedhtml,ignore_tags=("img",),indent_width=4,page_width=80)

+1

twils Oct 25 '09 at 10:22

source share

If your licensing allows this, you can use html2text (asciinator) (GPL).

0

ChristopheD Oct 25 '09 at 10:02

source share

Adrien plisson · Accepted Answer · 2009-10-25T08:56:49+0000

The standard python module html.parser should allow you to parse simple html content and eliminate tags. you only need to output HTMLParser and then overload all the handle _ * () methods so that they display or delete the content depending on the surrounding element tags.

What is a lightweight python library that can eliminate HTML tags? (and only text)

More articles: