Use html5lib to convert an HTML fragment to plain text.

Is there an easy way to use the html5lib library for Python to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p> 

to

 Hello World. Greetings from Mars. 
+6
source share
3 answers

With lxml as a backend parser:

 import html5lib body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>" doc = html5lib.parse(body, treebuilder="lxml") print doc.text_content() 

Honestly, this is actually a hoax, as it is equivalent to the following (only the relevant parts change):

 from lxml import html doc = html.fromstring(body) print doc.text_content() 

If you really need the html5lib :

 from lxml.html import html5parser doc = html5parser.fromstring(body) print doc.xpath("string()") 
+12
source

I use html2text , which converts it to plain text (in Markdown format).

 from html2text import HTML2Text handler = HTML2Text() html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br> <br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat: <br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis. </li><li>Sed tincidunt nulla.</li></ul> At massa tempus, quis \r\nvehicula odio laoreet.<br>""" text = handler.handle(html) >>> text u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n \n\n * Egestas non quis lorem.\n * Nam id lobortis felis.\n * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n' 
+3
source

You can bind the result of the itertext() method.

Example:

 import html5lib d = html5lib.parseFragment( '<p>Hello World. Greetings from <strong>Mars.</strong></p>') s = ''.join(d.itertext()) print(s) 

Conclusion:

  Hello World  Greetings from Mars. 
0
source

Source: https://habr.com/ru/post/904780/


All Articles