In Python 2.7, I can successfully convert the Unicode string "abc\udc34xyz" to UTF-8 (the result is "abc\xed\xb0\xb4xyz" ). But when I pass the UTF-8 string, for example. pango_parse_markup() or g_convert_with_fallback() , I get errors like "Invalid byte sequence in conversion input". The GTK / Pango functions seem to detect an “unpaired surrogate” in the string and (right?) Reject it.
Python 3 does not even allow converting a Unicode string to UTF-8 (error: "'utf-8' codec cannot encode the character '\ udc34' at position 3: surrogates are not allowed"), but I can run "abc\udc34xyz".encode("utf8", "replace") to get a valid UTF8 string, and a single surrogate is replaced by another character. This is good for me, but I need a solution for Python 2.
So the question is: in Python 2.7, how can I convert this Unicode string to UTF-8 by replacing a single surrogate with some replacement character like U + FFFD? It is preferable to use only standard Python functions and GTK / GLib / G functions ...
Btw. Iconv can convert the string to UTF8, but just removes the bad character instead of replacing U + FFFD.
source share