How can I delete <p> </p> using the python subdirectory

I have an html file and I want to replace empty paragraphs with a space.

mystring = "This <p></p><p>is a test</p><p></p><p></p>" result = mystring.sub("<p></p>" , "&nbsp;") 

This does not work.

+3
python string html
Mar 23 2018-11-11T00:
source share
6 answers

Please do not try to parse HTML with regular expressions . To do this, use the appropriate parsing module, for example htmlparser or BeautifulSoup . Eliminate short learning curve and benefits:

  • Your parsing code will be more robust by handling corner cases that you might not have considered that would not work with a regular expression
  • For future HTML parsing / enumeration tasks, you will be able to do something faster, so the investment time will also be calculated in the end.

You will not regret! Profit guaranteed!

+10
Mar 23 2018-11-23T00:
source share

I believe that it is always nice to give an example of how to do this with a real parser, as well as just repeat the sound advice that Eli Bendersky gives in his answer.

Here is an example of how to remove empty <p> elements using lxml . lxml HTMLParser HTMLParser very well with HTML.

 from lxml import etree from StringIO import StringIO input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>''' parser = etree.HTMLParser() tree = etree.parse(StringIO(input), parser) for p in tree.xpath("//p"): if len(p): continue t = p.text if not (t and t.strip()): p.getparent().remove(p) print etree.tostring(tree.getroot(), pretty_print=True) 

... which produces the result:

 <html> <body> <p>This </p> <p>is a test</p> <p> <b>Bye.</b> </p> </body> </html> 

Please note that I am not reading the question correctly when I answer this, and I only delete empty <p> elements without replacing them with   . With lxml, I'm not sure of an easy way to do this, so I asked another question:

  • How to replace an element with text in lxml?
+5
Mar 23 2018-11-11T00:
source share

I think parsing module would be redundant for this particular problem

just this function:

 >>> mystring = "This <p></p><p>is a test</p><p></p><p></p>" >>> mystring.replace("<p></p>","&nbsp;") 'This &nbsp;<p>is a test</p>&nbsp;&nbsp;' 
+2
Mar 23 2018-11-23T00:
source share

What should I do if <p> is entered as <p> or < p > or an attribute is added or an empty <P/> tag is specified using the syntax? Tag Support Pyparsing HTML supports all of these options:

 from pyparsing import makeHTMLTags, replaceWith, withAttribute mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>' p,pEnd = makeHTMLTags("P") emptyP = p.copy().setParseAction(withAttribute(empty=True)) null_paragraph = emptyP | p+pEnd null_paragraph.setParseAction(replaceWith("&nbsp;")) print null_paragraph.transformString(mystring) 

Print

 This &nbsp;<p>is a test</p>&nbsp;&nbsp;&nbsp; 
+2
Mar 23 '11 at 15:56
source share

using regexp?

 import re result = re.sub("<p>\s*</p>","&nbsp;", mystring, flags=re.MULTILINE) 

compile regex if you use it often.

+1
Mar 23 2018-11-23T00:
source share

I wrote this code:

 from lxml import etree from StringIO import StringIO html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>""" document = etree.iterparse(StringIO(html_tags), html=True) for a, e in document: if not (e.text and e.text.strip()) and len(e) == 0: e.getparent().remove(e) print etree.tostring(document.root) 
0
Apr 12 2018-12-12T00:
source share



All Articles