Use html5lib to convert an HTML fragment to plain text.

Question

Use html5lib to convert an HTML fragment to plain text.

Is there an easy way to use the html5lib library for Python to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p>

to

 Hello World. Greetings from Mars.

+6

python html html5lib

Jason christa Dec 31 '11 at 0:19

source share

3 answers

I use html2text , which converts it to plain text (in Markdown format).

 from html2text import HTML2Text handler = HTML2Text() html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br> <br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat: <br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis. </li><li>Sed tincidunt nulla.</li></ul> At massa tempus, quis \r\nvehicula odio laoreet.<br>""" text = handler.handle(html) >>> text u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n \n\n * Egestas non quis lorem.\n * Nam id lobortis felis.\n * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'

+3

seddonym Nov 11 '13 at 10:58

source share

You can bind the result of the itertext() method.

Example:

 import html5lib d = html5lib.parseFragment( '<p>Hello World. Greetings from <strong>Mars.</strong></p>') s = ''.join(d.itertext()) print(s)

Conclusion:

  Hello World  Greetings from Mars.

0

maxschlepzig Apr 19 '17 at 16:34

source share

Niklas B. · Accepted Answer · 2011-12-31T00:37:05+0000

With lxml as a backend parser:

 import html5lib body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>" doc = html5lib.parse(body, treebuilder="lxml") print doc.text_content()

Honestly, this is actually a hoax, as it is equivalent to the following (only the relevant parts change):

 from lxml import html doc = html.fromstring(body) print doc.text_content()

If you really need the html5lib :

 from lxml.html import html5parser doc = html5parser.fromstring(body) print doc.xpath("string()")

Use html5lib to convert an HTML fragment to plain text.

More articles: