Detect / remove unpaired surrogate character in Python 2 + GTK

Question

Detect / remove unpaired surrogate character in Python 2 + GTK

In Python 2.7, I can successfully convert the Unicode string "abc\udc34xyz" to UTF-8 (the result is "abc\xed\xb0\xb4xyz" ). But when I pass the UTF-8 string, for example. pango_parse_markup() or g_convert_with_fallback() , I get errors like "Invalid byte sequence in conversion input". The GTK / Pango functions seem to detect an “unpaired surrogate” in the string and (right?) Reject it.

Python 3 does not even allow converting a Unicode string to UTF-8 (error: "'utf-8' codec cannot encode the character '\ udc34' at position 3: surrogates are not allowed"), but I can run "abc\udc34xyz".encode("utf8", "replace") to get a valid UTF8 string, and a single surrogate is replaced by another character. This is good for me, but I need a solution for Python 2.

So the question is: in Python 2.7, how can I convert this Unicode string to UTF-8 by replacing a single surrogate with some replacement character like U + FFFD? It is preferable to use only standard Python functions and GTK / GLib / G functions ...

Btw. Iconv can convert the string to UTF8, but just removes the bad character instead of replacing U + FFFD.

+6

python unicode utf-8 glib gtk

oliver Sep 7 '13 at 12:18

source share

2 answers

Mark tolonen · Answer 1 · 2013-09-07T14:04:57+0000

You can make replacements yourself before encoding:

 import re lone = re.compile( ur'''(?x) # verbose expression (allows comments) ( # begin group [\ud800-\udbff] # match leading surrogate (?![\udc00-\udfff]) # but only if not followed by trailing surrogate ) # end group | # OR ( # begin group (?<![\ud800-\udbff]) # if not preceded by leading surrogate [\udc00-\udfff] # match trailing surrogate ) # end group ''') u = u'abc\ud834\ud82a\udfcdxyz' print repr(u) b = lone.sub(ur'\ufffd',u).encode('utf8') print repr(b) print repr(b.decode('utf8'))

Output:

 u'abc\ud834\U0001abcdxyz' 'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz' u'abc\ufffd\U0001abcdxyz'

Sean fujiwara · Answer 2 · 2016-08-05T07:39:09+0000

Here is what fixed for me:

invalid_string.encode('utf16').decode('utf16', 'replace')

My understanding is that surrogate pairs are part of UTF-16, and therefore encoding / decoding with UTF-8 does nothing.

Detect / remove unpaired surrogate character in Python 2 + GTK

More articles: