What is the best way to define a unicode string decoding method in python

Question

What is the best way to define a unicode string decoding method in python

I was wondering how to determine Unicode encoding.

I know that I read about it somewhere, I just don’t remember whether it was possible or not, but I want to believe that there is a way.

Say I have a Unicode with Latin encoding, I would like to dynamically encode it with the same encoding used to decode it ...

Frankly, I would like to turn it into unicode utf-8, without spoiling the characters before working with it.

i.e:

latin1_unicode = 'åäö'.decode('latin-1') utf8_unicode = latin.encode('latin-1').decode('utf-8')

+4

python unicode codec

Jaylev Jan 26 '12 at 10:18

source share

1 answer

Alien life form · Answer 1 · 2012-01-26T11:08:16+0000

If "unicode" in the "definition of unicode encoding" is a python data type, then you cannot do this, since "encoding" refers to the original byte patterns that represented the string when it was entered (for example, reading from a file, a database that you call). By the time it becomes python's “unicode” type (internal representation), the string is either decoded behind the lines or throws a decoding exception because the sequence of bytes is not biased with system encoding.

Shadyabhi's answer relates to a common case in which you read bytes from a file (which can be very well packed into a string , not a python Unicode string), and you need to guess what encoding they were saved in. Strictly speaking, you cannot have a “python latin1 unicode string”: a python unicode string has no encoding (encoding can be defined as a process that translates a character into a byte pattern and decodes it as a reverse process; the decoded sring has no no encoding - though it can be encoded in several ways for storage / external representation purposes).

For example, on my machine:

 In [35]: sys.stdin.encoding Out[35]: 'UTF-8' In [36]: a='è'.decode('UTF-8') In [37]: b='è'.decode('latin-1') In [38]: a Out[38]: u'\xe8' In [39]: b Out[39]: u'\xc3\xa8' In [41]: sys.stdout.encoding Out[41]: 'UTF-8' In [42]: print b #it garbage Ã¨ In [43]: print a #it OK è

This means that in your example, latin1_unicode will contain garbage if the default encoding is UTF-8 or UTF-16 or something other than latin1.

So what you can (want) to do:

Determine the encoding of the data source - possibly using one of the Shadyabhi methods
Decode data according to (1), save it in python unicode strings
Encode it using the source encoding (if it suits your needs) or some other encoding of your choice.

What is the best way to define a unicode string decoding method in python

More articles: