How to add persistent space to existing HTML using Python?

I just started working on a website that is filled with pages with all their HTML on one line, which is a real pain to read and work. I am looking for a tool (preferably a Python library) that will accept HTML data and return the same HTML without changes, except for adding lines and corresponding indentation. (All tags, markup, and content must be intact.)

The library should not process invalid HTML; First I pass the HTML through html5lib , so it will get well-formed HTML. However, as mentioned above, I would prefer that it does not change any real markup; I trust html5lib and prefer that it handles the aspect of correctness.

Firstly, does anyone know if this is possible with only html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I have seen some people recommend HTML Tidy, but I'm not sure if it can only be configured to change spaces. (Will he do anything except insert spaces if the correct HTML has been accepted for him?)

+3
source share
3 answers

Algorithm

  • Split html into some presentation
  • Serialize view back to html

Example html5lib parser with BeautifulSoup tree constructor

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""

soup = parser.parse(c)
print soup.prettify()

Output:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  ......
 </body>
</html>
+2
source

. . , , , , , Beautiful Soup. ( , Beautiful Soup html5lib 1.0.) Amarghosh; . html5lib, , - , , toprettyxml(). :

from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO

def tidy_html(text):
  """Returns a well-formatted version of input HTML."""

  p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
  dom_tree = p.parseFragment(text)

  # using cStringIO for fast string concatenation
  pretty_HTML = StringIO()

  node = dom_tree.firstChild
  while node:
    node_contents = node.toprettyxml(indent='  ')
    pretty_HTML.write(node_contents)
    node = node.nextSibling

  output = pretty_HTML.getvalue()
  pretty_HTML.close()
  return output

:

>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
  <i>
    bold, italic
  </i>
</b>
<div>
  a div
</div>

, toprettyxml() on dom_tree ? HTML, , HTML, <head> <body>. , parseFragment(), , DocumentFragment ( ). , writexml() ( toprettyxml()), , .

+2

html xml, DOM.

from xml.dom.minidom import parse, parseString

#if you have html string in a variable
html = parseString(theHtmlString)

#or parse the html file
html = parse(htmlFileName)

print html.toprettyxml()

toprettyxml() , . , writexml().

+1

Source: https://habr.com/ru/post/1733152/


All Articles