Working With Wacky Encodings In Python

Question

Working With Wacky Encodings In Python

I have a Python script that retrieves data from many sources (databases, files, etc.). Presumably all strings are unicode, but what I get is any change in the following topic (as returned repr()):

u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'

Is there any reliable way to take any four of the lines above and return the correct unicode string?

u'D\xe9cor' # --> Décor

The only way that I can think of right now - use eval(), replace()and deep burning shame that will never wash off.

+3

python encoding unicode character-encoding

Tyson Jun 07 '10 at 5:42

source share

3 answers

, , .

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

+2

Ignacio Vazquez-Abrams 07 . '10 6:17

Here's the solution I came to, before I saw KennyTM itself, a more concise solution:

def ensure_unicode(string):
    try:
        string = string.decode('string-escape').decode('string-escape')
    except UnicodeEncodeError:
        string = string.encode('raw_unicode_escape')

    return unicode(string, 'utf-8')

+1

Tyson Jun 07 '10 at 6:35

source share

kennytm · Accepted Answer · 2010-06-07T05:58:51+0000

UTF-8. .decode, unicode.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'

'D\\xc3\\xa9cor'.

>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

, , unicode, str.

>>> def conv(s):
...   if isinstance(s, unicode):
...     s = s.encode('iso-8859-1')
...   return s.decode('string-escape').decode('utf-8')
... 
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']

Working With Wacky Encodings In Python

More articles: