The encoding used for the letters u ""

Question

The encoding used for the letters u ""

Consider the following example:

>>> s = u"" >>> s u'\xe1\xe0\xe1\xe0' >>> print s áàáà

I use cp1251 in standby encoding, but it looks like the interpreter is actually using latin1 to create a Unicode string:

 >>> print s.encode('latin1')

Why is that? Is there a specification for this behavior?

CPython 2.7.

Edit

The code I was really looking for

 >>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0' True

It seems that when unicode encoding with the latin1 codec, all unicode points less than 256 are simply left behind, as this leads to the bytes I entered earlier.

+6

python encoding unicode

Roman bodnarchuk Jan 15 '12 at 19:50

source share

1 answer

unutbu · Accepted Answer · 2012-01-15T22:09:09+0000

When you enter a character such as in the terminal, you see , but what is really entered is a sequence of bytes.

Since your terminal encoding is cp1251 , entering a leads to a sequence of bytes equal to the unicode of a encoded in cp1251 :

 In [219]: "".decode('utf-8').encode('cp1251') Out[219]: '\xe1\xe0\xe1\xe0'

(Note: I use utf-8 above because my terminal encoding is utf-8 , not cp1251 . For me, the "".decode('utf-8') is just unicode for the .)

Since typing a leads to a sequence of bytes \xe1\xe0\xe1\xe0 , when you type u"" into the terminal, Python gets u'\xe1\xe0\xe1\xe0' . That's why you see

 >>> s u'\xe1\xe0\xe1\xe0'

This unicode seems to represent áàáà .

And when you type

 >>> print s.encode('latin1')

encoding latin1 converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0' . The terminal receives a sequence of bytes '\xe1\xe0\xe1\xe0' and decodes them with cp1251 , thus printing a :

 In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))

Try:

 >>> s = ""

(without u ). Or,

 >>> s = "".decode('cp1251')

make s unicode . Or, use a verbose but very explicit (and terminal coding agnostic ):

 >>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'

Or short but less clear

 >>> s = u'\u0431\u0430\u0431\u0430'

The encoding used for the letters u ""

More articles: