What is the best way to define a unicode string decoding method in python

I was wondering how to determine Unicode encoding.

I know that I read about it somewhere, I just don’t remember whether it was possible or not, but I want to believe that there is a way.

Say I have a Unicode with Latin encoding, I would like to dynamically encode it with the same encoding used to decode it ...

Frankly, I would like to turn it into unicode utf-8, without spoiling the characters before working with it.

i.e:

latin1_unicode = 'åäö'.decode('latin-1') utf8_unicode = latin.encode('latin-1').decode('utf-8') 
+4
source share
1 answer

If "unicode" in the "definition of unicode encoding" is a python data type, then you cannot do this, since "encoding" refers to the original byte patterns that represented the string when it was entered (for example, reading from a file, a database that you call). By the time it becomes python's “unicode” type (internal representation), the string is either decoded behind the lines or throws a decoding exception because the sequence of bytes is not biased with system encoding.

Shadyabhi's answer relates to a common case in which you read bytes from a file (which can be very well packed into a string , not a python Unicode string), and you need to guess what encoding they were saved in. Strictly speaking, you cannot have a “python latin1 unicode string”: a python unicode string has no encoding (encoding can be defined as a process that translates a character into a byte pattern and decodes it as a reverse process; the decoded sring has no no encoding - though it can be encoded in several ways for storage / external representation purposes).

For example, on my machine:

 In [35]: sys.stdin.encoding Out[35]: 'UTF-8' In [36]: a='è'.decode('UTF-8') In [37]: b='è'.decode('latin-1') In [38]: a Out[38]: u'\xe8' In [39]: b Out[39]: u'\xc3\xa8' In [41]: sys.stdout.encoding Out[41]: 'UTF-8' In [42]: print b #it garbage è In [43]: print a #it OK è 

This means that in your example, latin1_unicode will contain garbage if the default encoding is UTF-8 or UTF-16 or something other than latin1.

So what you can (want) to do:

  • Determine the encoding of the data source - possibly using one of the Shadyabhi methods
  • Decode data according to (1), save it in python unicode strings
  • Encode it using the source encoding (if it suits your needs) or some other encoding of your choice.
+1
source

Source: https://habr.com/ru/post/1393129/


All Articles