I am clearing the output of some html
, which probably came from WYSIWYG. There are tons of empty formatting tags that I would like to get rid of for the sled.
eg.
<em></em> Here some text <strong> </strong> and here more <em> <span></span></em>
Thanks to Regular-Expressions.info I have a neat reverse replication regular expression to expand one layer at a time
# Returns a string minus one level of empty formatting tags def remove_empty_html_tags(input_string): return re.sub(r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>', r'\1', input_string)
However, I would like to be able to expand all layers at once for <em> <span></span></em>
, and there are potentially 5+ layers of nested empty tags.
Is there a way to group backref a la (?:<?P<tagBackRef>strong|span|em)\b[^>]>(\s)*)+
(or something else) and use it later with (</(?P=tagBackRef>)+
to remove several nested but corresponding empty html
tags?
For posterity:
It was probably an XY question in which the tool I was hoping to use for the result I wanted was not the one anyone else would choose. Henry's answer answered the question, but he and everyone else will point you to the html parser over the regular expression for html analysis. =)
source share