Replace numeric character references in an XML document with Python

I am struggling with the following problem: I have an XML string containing the following tag, and I want to convert this using cElementTree into a valid XML document:

<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296; #57154;#55296;#57136;</tag> 

but each sign of # is preceded by the sign of a and, and therefore, the conclusion is as follows: πŒ°πŒ½πŒ³πŒ°πŒΏπ‚πŒ°

This is a Unicode string, and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser Error using Perl XML :: DOM module, reference to invalid character number "/ a>)

I tried a different regex to match these numeric symbolic links. For example, I tried the following (Python) regex:

 RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;') 

This works in a regular python session, but as soon as I use the same regular expression in my code, it doesn't work, apparently because these numeric characters have been interpreted (and displayed as fields or question marks).

I also tried the unescape function from http://effbot.org/zone/re-sub.htm , but that won't work either.

Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?

+3
source share
1 answer

Eurgh. You have surrogates (UTF-16 code units in the D800-DFFF range) that some fools incorrectly encoded separately, instead of using a pair of codes for one character. It would be ideal to replace this mess with what it should look like:

 <tag>&#66352;&#66365;&#66355;&#66352;&#66367;&#66370;&#66352;</tag> 

Or, just as true, in literal characters (if you have a font that can display the Gothic alphabet):

 <tag>πŒ°πŒ½πŒ³πŒ°πŒΏπ‚πŒ°</tag> 

As a general rule, it would be better to perform replacement operations such as this on disassembled text nodes to avoid the erratic use of non-symbolic links in other places, such as comments or UIs. However, of course, this is not possible in this case, since it is not XML at all. You could try to fix it with a rough regular expression, although it would be better to find out where the invalid entry comes from and kick the person in charge until he corrects it.

 >>> def lenient_deccharref(m): ... return unichr(int(m.group(1))) ... >>> tag= '<tag>&#55296;&#57136;&#55296;&#57149;&#55296;&#57139;&#55296;&#57136;&#55296;&#57151;&#55296;&#57154;&#55296;&#57136;</tag>' >>> re.sub('&#(\d+);', lenient_deccharref, tag).encode('utf-8') '<tag>\xf0\x90\x8c\xb0\xf0\x90\x8c\xbd\xf0\x90\x8c\xb3\xf0\x90\x8c\xb0\xf0\x90\x8c\xbf\xf0\x90\x8d\x82\xf0\x90\x8c\xb0</tag>' 

This is the correct encoding of UTF-8 πŒ°πŒ½πŒ³πŒ°πŒΏπ‚πŒ° . The utf-8 codec allows you to encode a sequence of surrogates to fix UTF-8 even on a wide Unicode platform where surrogates should not appear on the line in the first place.

 >>> _.decode('utf-8') u'<tag>\U00010330\U0001033d\U00010333\U00010330\U0001033f\U00010342\U00010330</tag>' 
+4
source

Source: https://habr.com/ru/post/946931/


All Articles