Python: parsing an XML document while preserving entities

I wanted to ask what well-known existing Python 2.x libraries exist for parsing an XML document with an embedded DTD without automatically expanding objects. (The file in question is for those interested: JMdict .)

Lxml seems to have some ability not to parse entities, but the last time I tried, the objects just turned into spaces. I just looked for this and found pxdom as another alternative that I can try, but since it is pure Python it seems a lot slower than I would like.

Is there anything else out there?

+3
source share
4 answers

, ; , , , XML.

, , , . re.finditer . . , .

+1

lxml , , , . :

 
from lxml import etree

XML = """
<!DOCTYPE root [
<!ENTITY abc "123">
]>
<root>
&abc;
</root>"""

parser = etree.XMLParser(resolve_entities=False)

root = etree.fromstring(XML, parser)
print "Entity not resolved:"
print etree.tostring(root)
print

print "Entity resolved:"
root = etree.fromstring(XML)
print etree.tostring(root)

:

Entity not resolved:
<root>
&abc;
</root>

Entity resolved:
<root>
123
</root>
+3

-, BeautifulStoneSoup BeautifulSoup .

, , ( ).

+1

, . TEI XML, ,

&some_exotic_char;

DTD. , , XML.

BeautifulSoup , XML :

with open('outfile.xml','w') as outfile:
    outfile.write(soup.prettify())

" ", utf8-, , . , XML, prettify ( ).

, , , , Perl XML :: LibXML.

$parser->expand_entities(0);

entities will not be expanded. And writing the XML back to the file will keep the original layout intact.

use XML::LibXML;
my $parser = new XML::LibXML;
$parser->validation(0);
$parser->load_ext_dtd(1);
$parser->expand_entities(0);
my $doc  = $parser->parse_file('infile.xml');

... # do whatever you need to do

open my $out, '>', 'outfile.xml';
binmode $out;
print $out $doc->toString();
close $out;

Perl XML :: LibXML saved my day.

0
source

Source: https://habr.com/ru/post/1760075/


All Articles