I use Python 2.x [non-negotiable] to read XML documents [created by others] that allow the contents of many elements to contain characters that are not valid XML characters, escaping them using the _xHHHH_ , for example. ASCII BEL aka U + 0007 is represented by the 7-character sequence u"_x0007_" . Neither functionality that allows the representation of any old character in the document, nor the way of escaping, can be negotiated. I parse documents using cElementTree or lxml [half-turn].
Here is my best attempt to maximize parser output:
import re def unescape(s, subber=re.compile(r'_x[0-9A-Fa-f]{4,4}_').sub, repl=lambda mobj: unichr(int(mobj.group(0)[2:6], 16)), ): if "_" in s: return subber(repl, s) return s
The above shifts by observing the very low frequency “_” in typical text and is better than doubling the speed, avoiding regular expression if necessary.
Question: Are there any better ideas?
source share