The encoding used for the letters u ""
Consider the following example:
>>> s = u"" >>> s u'\xe1\xe0\xe1\xe0' >>> print s áàáà I use cp1251 in standby encoding, but it looks like the interpreter is actually using latin1 to create a Unicode string:
>>> print s.encode('latin1') Why is that? Is there a specification for this behavior?
CPython 2.7.
Edit
The code I was really looking for
>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0' True It seems that when unicode encoding with the latin1 codec, all unicode points less than 256 are simply left behind, as this leads to the bytes I entered earlier.
When you enter a character such as in the terminal, you see , but what is really entered is a sequence of bytes.
Since your terminal encoding is cp1251 , entering a leads to a sequence of bytes equal to the unicode of a encoded in cp1251 :
In [219]: "".decode('utf-8').encode('cp1251') Out[219]: '\xe1\xe0\xe1\xe0' (Note: I use utf-8 above because my terminal encoding is utf-8 , not cp1251 . For me, the "".decode('utf-8') is just unicode for the .)
Since typing a leads to a sequence of bytes \xe1\xe0\xe1\xe0 , when you type u"" into the terminal, Python gets u'\xe1\xe0\xe1\xe0' . That's why you see
>>> s u'\xe1\xe0\xe1\xe0' This unicode seems to represent áàáà .
And when you type
>>> print s.encode('latin1') encoding latin1 converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0' . The terminal receives a sequence of bytes '\xe1\xe0\xe1\xe0' and decodes them with cp1251 , thus printing a :
In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251')) Try:
>>> s = "" (without u ). Or,
>>> s = "".decode('cp1251') make s unicode . Or, use a verbose but very explicit (and terminal coding agnostic ):
>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}' Or short but less clear
>>> s = u'\u0431\u0430\u0431\u0430'