Python UTF-8 conversion issue

Question

Python UTF-8 conversion issue

I have saved some UTF-8 characters in my database. For instance. 'α' in the name field

Through Django ORM, when I read this, I get something like

>>> p.name u'\xce\xb1' >>> print p.name Î±

I was hoping for 'α'.

After some digging, I think if I did

 >>> a = 'α' >>> a '\xce\xb1'

So, when Python tries to display '\ xce \ xb1', I get alpha, but when it tries to display u '\ xce \ xb1', is this double encoding?

Why did I get u '\ xce \ xb1' in the first place? Is there a way that I can just return '\ xce \ xb1'?

Thanks. My UTF-8 and unicode handling skills really need some help ...

+6

python django encoding unicode utf-8

Overclocked May 18, '11 at 20:36

source share

5 answers

MBarsi · Answer 1 · 2011-05-18T20:41:55+0000

Try putting a unicode signature u in front of your string, for example. u'YOUR_ALFA_CHAR' and revise the encoding of the database since Django always supports UTF-8.

Mu mind · Answer 2 · 2011-05-19T13:02:02+0000

It seems that you have separate bytes of the UTF-8 encoded string interpreted as Unicode code pages. You can "decode" your string from this strange form with:

 p.name = ''.join(chr(ord(x)) for x in p.name)

or maybe

 p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')

One way to get your lines to be “encoded” in this form

 ''.join(unichr(ord(x)) for x in '\xce\xb1')

although I feel that your lines really fell into this state by different components of your system, I do not agree with the encoding used.

You will probably have to fix the source of the bad “encoding”, and not just fix the data in your database. And the above code may be good at converting your bad data once, but I would advise you not to embed this code in your Django application.

karmakaze · Answer 3 · 2011-05-18T22:45:17+0000

The problem is that p.name was incorrectly stored and / or read in the database.

The small Unicode alpha encoding U + 03B1 and p.name must be printed as u '\ x03b1', or if you are using a terminal with Unicode support, the actual alpha character can be printed in quotation marks. Note the difference between u '\ xce \ xb1' and u '\ xceb1'. The first line is two character strings, and the second is in one character string. I have no idea how the “03” byte of UTF-8 was translated to “CE”.

Brian M. hunt · Answer 4 · 2011-05-18T20:51:38+0000

Try converting the encoding with p.name.encode('latin-1') . Here is a demo:

 >>> print u'\xce\xb1' Î± >>> print u'\xce\xb1'.encode('latin-1') α >>> print '\xce\xb1' α >>> '\xce\xb1' == u'\xce\xb1'.encode('latin1') True

See str.encode and Standard Encodings for more information.

Dov Grobgeld · Answer 5 · 2011-05-18T20:52:50+0000

You can turn any sequence of bytes into the internal Unicode representation through the decoding function:

 print '\xce\xb1'.decode('utf-8')

This allows you to import a sequence of bytes from any source, and then turn it into a Python Unicode string.

Link: http://docs.python.org/library/stdtypes.html#string-methods

Python UTF-8 conversion issue

More articles: