Eurgh. You have surrogates (UTF-16 code units in the D800-DFFF range) that some fools incorrectly encoded separately, instead of using a pair of codes for one character. It would be ideal to replace this mess with what it should look like:
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Or, just as true, in literal characters (if you have a font that can display the Gothic alphabet):
<tag>π°π½π³π°πΏππ°</tag>
As a general rule, it would be better to perform replacement operations such as this on disassembled text nodes to avoid the erratic use of non-symbolic links in other places, such as comments or UIs. However, of course, this is not possible in this case, since it is not XML at all. You could try to fix it with a rough regular expression, although it would be better to find out where the invalid entry comes from and kick the person in charge until he corrects it.
>>> def lenient_deccharref(m): ... return unichr(int(m.group(1))) ... >>> tag= '<tag>��������������</tag>' >>> re.sub('&#(\d+);', lenient_deccharref, tag).encode('utf-8') '<tag>\xf0\x90\x8c\xb0\xf0\x90\x8c\xbd\xf0\x90\x8c\xb3\xf0\x90\x8c\xb0\xf0\x90\x8c\xbf\xf0\x90\x8d\x82\xf0\x90\x8c\xb0</tag>'
This is the correct encoding of UTF-8 π°π½π³π°πΏππ° . The utf-8 codec allows you to encode a sequence of surrogates to fix UTF-8 even on a wide Unicode platform where surrogates should not appear on the line in the first place.
>>> _.decode('utf-8') u'<tag>\U00010330\U0001033d\U00010333\U00010330\U0001033f\U00010342\U00010330</tag>'
source share