If "unicode" in the "definition of unicode encoding" is a python data type, then you cannot do this, since "encoding" refers to the original byte patterns that represented the string when it was entered (for example, reading from a file, a database that you call). By the time it becomes python's “unicode” type (internal representation), the string is either decoded behind the lines or throws a decoding exception because the sequence of bytes is not biased with system encoding.
Shadyabhi's answer relates to a common case in which you read bytes from a file (which can be very well packed into a string , not a python Unicode string), and you need to guess what encoding they were saved in. Strictly speaking, you cannot have a “python latin1 unicode string”: a python unicode string has no encoding (encoding can be defined as a process that translates a character into a byte pattern and decodes it as a reverse process; the decoded sring has no no encoding - though it can be encoded in several ways for storage / external representation purposes).
For example, on my machine:
In [35]: sys.stdin.encoding Out[35]: 'UTF-8' In [36]: a='è'.decode('UTF-8') In [37]: b='è'.decode('latin-1') In [38]: a Out[38]: u'\xe8' In [39]: b Out[39]: u'\xc3\xa8' In [41]: sys.stdout.encoding Out[41]: 'UTF-8' In [42]: print b
This means that in your example, latin1_unicode will contain garbage if the default encoding is UTF-8 or UTF-16 or something other than latin1.
So what you can (want) to do:
- Determine the encoding of the data source - possibly using one of the Shadyabhi methods
- Decode data according to (1), save it in python unicode strings
- Encode it using the source encoding (if it suits your needs) or some other encoding of your choice.
source share