Python group backlinks

I am clearing the output of some html , which probably came from WYSIWYG. There are tons of empty formatting tags that I would like to get rid of for the sled.

eg.

 <em></em> Here some text <strong> </strong> and here more <em> <span></span></em> 

Thanks to Regular-Expressions.info I have a neat reverse replication regular expression to expand one layer at a time

 # Returns a string minus one level of empty formatting tags def remove_empty_html_tags(input_string): return re.sub(r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>', r'\1', input_string) 

However, I would like to be able to expand all layers at once for <em> <span></span></em> , and there are potentially 5+ layers of nested empty tags.

Is there a way to group backref a la (?:<?P<tagBackRef>strong|span|em)\b[^>]>(\s)*)+ (or something else) and use it later with (</(?P=tagBackRef>)+ to remove several nested but corresponding empty html tags?

For posterity:

It was probably an XY question in which the tool I was hoping to use for the result I wanted was not the one anyone else would choose. Henry's answer answered the question, but he and everyone else will point you to the html parser over the regular expression for html analysis. =)

+4
source share
2 answers

If you really do not want to use an HTML parser and you are not too concerned about speed (which I suppose is re not, or you will not use regular expressions to clear your HTML), you can simply change the code that you already wrote. Just put your replacement in a loop (or recursion, your preference) and return when you don't change anything.

 # Returns a string minus all levels of empty formatting tags def remove_empty_html_tags(input_string): matcher = r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>' old_string = input_string new_string = re.sub(matcher, r'\1', old_string) while new_string != old_string: old_string = new_string new_string = re.sub(matcher, r'\1', new_string) return new_string 
+1
source

This is much easier to do with an HTML parser, for example BeautifulSoup , for example:

 from bs4 import BeautifulSoup soup = BeautifulSoup(""" <body> <em></em> Here some <span><strong>text</strong></span> <strong> </strong> and here more <em> <span></span></em> </body> """) for element in soup.findAll(name=['strong', 'span', 'em']): if element.find(True) is None and (not element.string or not element.string.strip()): element.extract() print soup 

prints:

 <html><body> Here some <span><strong>text</strong></span> and here more <em> </em> </body></html> 

As you can see, all span , strong and em tags with empty tags (or those with only spaces) have been removed.

See also:

+4
source

Source: https://habr.com/ru/post/1502922/


All Articles