Unescape _xHHHH_ XML escape sequences using Python

Question

Unescape _xHHHH_ XML escape sequences using Python

I use Python 2.x [non-negotiable] to read XML documents [created by others] that allow the contents of many elements to contain characters that are not valid XML characters, escaping them using the _xHHHH_ , for example. ASCII BEL aka U + 0007 is represented by the 7-character sequence u"_x0007_" . Neither functionality that allows the representation of any old character in the document, nor the way of escaping, can be negotiated. I parse documents using cElementTree or lxml [half-turn].

Here is my best attempt to maximize parser output:

 import re def unescape(s, subber=re.compile(r'_x[0-9A-Fa-f]{4,4}_').sub, repl=lambda mobj: unichr(int(mobj.group(0)[2:6], 16)), ): if "_" in s: return subber(repl, s) return s

The above shifts by observing the very low frequency “_” in typical text and is better than doubling the speed, avoiding regular expression if necessary.

Question: Are there any better ideas?

+4

python xml escaping

John machin Jun 23 '09 at 3:39

source share

1 answer

Alex martelli · Accepted Answer · 2009-06-23T04:03:04+0000

You can also check for '_x' , not just _ , which doesn't really matter, but no doubt the two-character sequence is even less common than a single underscore. Besides such details, you seem to be making the best of a bad situation!

Unescape _xHHHH_ XML escape sequences using Python

More articles: