What is a lightweight python library that can eliminate HTML tags? (and only text)

I know that NLTK has this. But anything else?

-1
source share
4 answers

The standard python module html.parser should allow you to parse simple html content and eliminate tags. you only need to output HTMLParser and then overload all the handle _ * () methods so that they display or delete the content depending on the surrounding element tags.

+4
source

BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

On the home page:

Beautiful Soup is a Python HTML / XML parser designed to quickly transform projects such as screen scripting. Three functions make it powerful:

  • A beautiful soup will not strangle if you give it a bad markup. It provides a parsing tree that makes about the same meaning as the original document. This is usually good enough to collect the necessary data and run away.
  • Beautiful Soup offers some simple Pythonic methods and idioms for navigating, searching, and modifying the parsing tree: tools for opening a document and extracting what you need. You do not need to create your own parser for each application.
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You do not need to think about encodings unless the encoding is specified in the document, and Beautiful Soup cannot auto-detect it. Then you just need to specify the source encoding.
+4
source

You might want to take a look at the Strip-o-Gram conversion library: http://pypi.python.org/pypi/stripogram/1.5

usage example from readme.txt file:

from stripogram import html2text, html2safehtml mylumpofdodgyhtml # a lump of dodgy html ;-) # Only allow <b>, <a>, <i>, <br>, and <p> tags mylumpofcoolcleancollectedhtml = html2safehtml(mylumpofdodgyhtml,valid_tags=("b", "a", "i", "br", "p")) # Don't process <img> tags, just strip them out. Use an indent of 4 spaces # and a page that 80 characters wide. mylumpoftext = html2text(mylumpofcoolcleancollectedhtml,ignore_tags=("img",),indent_width=4,page_width=80) 
+1
source

If your licensing allows this, you can use html2text (asciinator) (GPL).

0
source

Source: https://habr.com/ru/post/1303672/


All Articles