How can I delete <p> </p> using the python subdirectory
I have an html file and I want to replace empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>" result = mystring.sub("<p></p>" , " ") This does not work.
Please do not try to parse HTML with regular expressions . To do this, use the appropriate parsing module, for example htmlparser or BeautifulSoup . Eliminate short learning curve and benefits:
- Your parsing code will be more robust by handling corner cases that you might not have considered that would not work with a regular expression
- For future HTML parsing / enumeration tasks, you will be able to do something faster, so the investment time will also be calculated in the end.
You will not regret! Profit guaranteed!
I believe that it is always nice to give an example of how to do this with a real parser, as well as just repeat the sound advice that Eli Bendersky gives in his answer.
Here is an example of how to remove empty <p> elements using lxml . lxml HTMLParser HTMLParser very well with HTML.
from lxml import etree from StringIO import StringIO input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>''' parser = etree.HTMLParser() tree = etree.parse(StringIO(input), parser) for p in tree.xpath("//p"): if len(p): continue t = p.text if not (t and t.strip()): p.getparent().remove(p) print etree.tostring(tree.getroot(), pretty_print=True) ... which produces the result:
<html> <body> <p>This </p> <p>is a test</p> <p> <b>Bye.</b> </p> </body> </html> Please note that I am not reading the question correctly when I answer this, and I only delete empty <p> elements without replacing them with   . With lxml, I'm not sure of an easy way to do this, so I asked another question:
- How to replace an element with text in lxml?
I think parsing module would be redundant for this particular problem
just this function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>" >>> mystring.replace("<p></p>"," ") 'This <p>is a test</p> ' What should I do if <p> is entered as <p> or < p > or an attribute is added or an empty <P/> tag is specified using the syntax? Tag Support Pyparsing HTML supports all of these options:
from pyparsing import makeHTMLTags, replaceWith, withAttribute mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>' p,pEnd = makeHTMLTags("P") emptyP = p.copy().setParseAction(withAttribute(empty=True)) null_paragraph = emptyP | p+pEnd null_paragraph.setParseAction(replaceWith(" ")) print null_paragraph.transformString(mystring) This <p>is a test</p> using regexp?
import re result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE) compile regex if you use it often.
I wrote this code:
from lxml import etree from StringIO import StringIO html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>""" document = etree.iterparse(StringIO(html_tags), html=True) for a, e in document: if not (e.text and e.text.strip()) and len(e) == 0: e.getparent().remove(e) print etree.tostring(document.root)