Unicode Python Strings and Interactive Python Interpreter

I am trying to understand how python 2.5 deals with unicode strings. Although I now think that I have a good idea of ​​how I should handle them in code, I don’t quite understand what happens behind the scenes, especially when you enter lines at the prompt of the interpreter.

Thus, python pre 3.0 has two types for strings: str (byte strings) and unicode , which are both derived from basestring . The default type for strings is str .

str objects have no idea of ​​their actual encoding, they are just bytes. Either you yourself encoded a Unicode string, and therefore you know what encoding they are in, or you read a stream of bytes, the encoding of which you also know in advance (not necessary). You can guess the encoding of a byte string whose encoding is unknown to you, but there is simply no reliable way to figure this out. It is best to decrypt earlier, use unicode code throughout the code, and encode late.

It's good. But are the lines entered into the interpreter really encoded for you behind your back? Given that my understanding of strings in Python is correct, what does the python method / parameter use to make this decision?

The source of my confusion is the different results that I get when I try to do the same on my python system, and on my editor built into the python console.

  # Editor (Sublime Text) >>> s = "La caƱa de EspaƱa" >>> s 'La ca\xc3\xb1a de Espa\xc3\xb1a' >>> s.decode("utf-8") u'La ca\xf1a de Espa\xf1a' >>> sys.getdefaultencoding() 'ascii' # Windows python interpreter >>> s= "La caƱa de EspaƱa" >>> s 'La ca\xa4a de Espa\xa4a' >>> s.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte >>> sys.getdefaultencoding() 'ascii' 
+4
source share
3 answers

Let me expand Ignacio's answer: in both cases there is an extra layer between Python and you: in one case it is Sublime Text, and in the other - cmd.exe . The difference in the behavior that you see is not due to Python, but because of the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).

So, when you type Ʊ , Sublime Text sends '\xc3\xb1' to Python, while cmd.exe sends \xa4 . [I'm just here, omitting details that are not relevant to the issue.].

However, Python is aware of this. From cmd.exe you will probably get something like:

 >>> import sys >>> sys.stdin.encoding 'cp437' 

whereas in Sublime Text you get something like

 >>> import sys >>> sys.stdin.encoding 'utf-8' 
+7
source

The interpreter uses the built-in command line encoding to enter text. In your case, this is CP437:

 >>> print '\xa4'.decode('cp437') Ʊ 
+3
source

You are confused because the editor and interpreter themselves use different encodings. The python interpreter uses your system by default (in this case cp437 ), while your editor uses utf-8 .

Note that the difference disappears if you specify a unicode string, for example:

 # Windows python interpreter >>> s = "La caƱa de EspaƱa" >>> s 'La ca\xa4a de Espa\xa4a' >>> s = u"La caƱa de EspaƱa" >>> s u'La ca\xf1a de Espa\xf1a' 

The moral of the story? Encodings are complicated. Make sure you know what encodes the source files, or is safe to play, always using an escaped version of special characters.

0
source

Source: https://habr.com/ru/post/1303735/


All Articles