I am trying to understand how python 2.5 deals with unicode strings. Although I now think that I have a good idea of āāhow I should handle them in code, I donāt quite understand what happens behind the scenes, especially when you enter lines at the prompt of the interpreter.
Thus, python pre 3.0 has two types for strings: str (byte strings) and unicode , which are both derived from basestring . The default type for strings is str .
str objects have no idea of āātheir actual encoding, they are just bytes. Either you yourself encoded a Unicode string, and therefore you know what encoding they are in, or you read a stream of bytes, the encoding of which you also know in advance (not necessary). You can guess the encoding of a byte string whose encoding is unknown to you, but there is simply no reliable way to figure this out. It is best to decrypt earlier, use unicode code throughout the code, and encode late.
It's good. But are the lines entered into the interpreter really encoded for you behind your back? Given that my understanding of strings in Python is correct, what does the python method / parameter use to make this decision?
The source of my confusion is the different results that I get when I try to do the same on my python system, and on my editor built into the python console.
# Editor (Sublime Text) >>> s = "La caƱa de EspaƱa" >>> s 'La ca\xc3\xb1a de Espa\xc3\xb1a' >>> s.decode("utf-8") u'La ca\xf1a de Espa\xf1a' >>> sys.getdefaultencoding() 'ascii'
source share