How can I delete using the python subdirectory

Question

How can I delete using the python subdirectory

I have an html file and I want to replace empty paragraphs with a space.

mystring = "This <p></p><p>is a test</p><p></p><p></p>" result = mystring.sub("<p></p>" , "&nbsp;")

This does not work.

+3

python string html

topless Mar 23 2018-11-11T00:

source share

6 answers

I believe that it is always nice to give an example of how to do this with a real parser, as well as just repeat the sound advice that Eli Bendersky gives in his answer.

Here is an example of how to remove empty  elements using lxml . lxml HTMLParser HTMLParser very well with HTML.

 from lxml import etree from StringIO import StringIO input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>''' parser = etree.HTMLParser() tree = etree.parse(StringIO(input), parser) for p in tree.xpath("//p"): if len(p): continue t = p.text if not (t and t.strip()): p.getparent().remove(p) print etree.tostring(tree.getroot(), pretty_print=True)

... which produces the result:

 <html> <body> <p>This </p> <p>is a test</p> <p> <b>Bye.</b> </p> </body> </html>

Please note that I am not reading the question correctly when I answer this, and I only delete empty  elements without replacing them with . With lxml, I'm not sure of an easy way to do this, so I asked another question:

How to replace an element with text in lxml?

+5

Mark Longair Mar 23 2018-11-11T00:

source share

I think parsing module would be redundant for this particular problem

just this function:

 >>> mystring = "This <p></p><p>is a test</p><p></p><p></p>" >>> mystring.replace("<p></p>","&nbsp;") 'This &nbsp;<p>is a test</p>&nbsp;&nbsp;'

+2

Xavier Combelle Mar 23 2018-11-23T00:

source share

What should I do if  is entered as  or  or an attribute is added or an empty  tag is specified using the syntax? Tag Support Pyparsing HTML supports all of these options:

 from pyparsing import makeHTMLTags, replaceWith, withAttribute mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>' p,pEnd = makeHTMLTags("P") emptyP = p.copy().setParseAction(withAttribute(empty=True)) null_paragraph = emptyP | p+pEnd null_paragraph.setParseAction(replaceWith("&nbsp;")) print null_paragraph.transformString(mystring)

Print

 This &nbsp;<p>is a test</p>&nbsp;&nbsp;&nbsp;

+2

PaulMcG Mar 23 '11 at 15:56

source share

using regexp?

 import re result = re.sub("<p>\s*</p>","&nbsp;", mystring, flags=re.MULTILINE)

compile regex if you use it often.

+1

Yannick Loiseau Mar 23 2018-11-23T00:

source share

I wrote this code:

 from lxml import etree from StringIO import StringIO html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>""" document = etree.iterparse(StringIO(html_tags), html=True) for a, e in document: if not (e.text and e.text.strip()) and len(e) == 0: e.getparent().remove(e) print etree.tostring(document.root)

0

swietyy Apr 12 2018-12-12T00:

source share

Eli Bendersky · Accepted Answer · 2011-03-23 13:56

Please do not try to parse HTML with regular expressions . To do this, use the appropriate parsing module, for example htmlparser or BeautifulSoup . Eliminate short learning curve and benefits:

Your parsing code will be more robust by handling corner cases that you might not have considered that would not work with a regular expression
For future HTML parsing / enumeration tasks, you will be able to do something faster, so the investment time will also be calculated in the end.

You will not regret! Profit guaranteed!

How can I delete <p> </p> using the python subdirectory

More articles: