How to save & quot; and "when parsing xml using bs4 python

I use bs4 to parse an XML file and write it again to a new XML file.

Input file:

<tag1>
  <tag2 attr1="a1">&quot; example text &quot;</tag2>
  <tag3>
    <tag4 attr2="a2">&quot; example text &quot;</tag4>
    <tag5>
      <tag6 attr3="a3">&apos; example text &apos;</tag6>
    </tag5>
  </tag3>
</tag1>

Script:

soup = BeautifulSoup(open("input.xml"), "xml")
f = open("output.xml", "w") 
f.write(soup.encode(formatter='minimal'))
f.close()

Conclusion:

<tag1>
  <tag2 attr1="a1"> " example text "  </tag2>
  <tag3>
    <tag4 attr2="a2"> " example text " </tag4>
    <tag5>
      <tag6 attr3="a3"> ' example text ' </tag6>
    </tag5>
  </tag3>
</tag1>

I want to save &quot;and &apos;. I tried to use all the formatting encoding options - Minimal, xml, html, none. But none of them solved this problem.

Then I tried to replace &quot;manually.

for tag in soup.find_all(text=re.compile("\"")):
    res = tag.string
    res1 = res.replace("\"","&quot;")
    tag.string.replaceWith(res1)

But it gave the result below

<tag1>
  <tag2 attr1="a1"> &amp;quot; example text &amp;quot;  </tag2>
  <tag3>
    <tag4 attr2="a2"> &amp;quot; example text &amp;quot; </tag4>
    <tag5>
      <tag6 attr3="a3"> &apos; example text &apos; </tag6>
    </tag5>
  </tag3>
</tag1>

It replaces and by &amp;. I'm confused here. Please help me resolve this.

+4
source share
1 answer

Custom Encoding and Formatting Output

, .

from bs4 import BeautifulSoup
from bs4.dammit import EntitySubstitution

def custom_formatter(string):
    """add &quot; and &apos; to entity substitution"""
    return EntitySubstitution.substitute_html(string).replace('"','&quot;').replace("'",'&apos;')

input_file = '''<tag1>
  <tag2 attr1="a1">&quot; example text &quot;</tag2>
  <tag3>
    <tag4 attr2="a2">&quot; example text &quot;</tag4>
    <tag5>
      <tag6 attr3="a3">&apos; example text &apos;</tag6>
    </tag5>
  </tag3>
</tag1>
'''

soup = BeautifulSoup(input_file, "xml")

print soup.encode(formatter=custom_formatter)

<?xml version="1.0" encoding="utf-8"?>
<tag1>
<tag2 attr1="a1">&quot; example text &quot;</tag2>
<tag3>
<tag4 attr2="a2">&quot; example text &quot;</tag4>
<tag5>
<tag6 attr3="a3">&apos; example text &apos;</tag6>
</tag5>
</tag3>
</tag1>

, EntitySubstitution.substitute_html(), & &amp; s.

+1

Source: https://habr.com/ru/post/1584039/


All Articles