Python UTF-8 conversion issue

I have saved some UTF-8 characters in my database. For instance. 'α' in the name field

Through Django ORM, when I read this, I get something like

>>> p.name u'\xce\xb1' >>> print p.name α 

I was hoping for 'α'.

After some digging, I think if I did

 >>> a = 'α' >>> a '\xce\xb1' 

So, when Python tries to display '\ xce \ xb1', I get alpha, but when it tries to display u '\ xce \ xb1', is this double encoding?

Why did I get u '\ xce \ xb1' in the first place? Is there a way that I can just return '\ xce \ xb1'?

Thanks. My UTF-8 and unicode handling skills really need some help ...

+6
source share
5 answers

Try putting a unicode signature u in front of your string, for example. u'YOUR_ALFA_CHAR' and revise the encoding of the database since Django always supports UTF-8.

+2
source

It seems that you have separate bytes of the UTF-8 encoded string interpreted as Unicode code pages. You can "decode" your string from this strange form with:

 p.name = ''.join(chr(ord(x)) for x in p.name) 

or maybe

 p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8') 

One way to get your lines to be “encoded” in this form

 ''.join(unichr(ord(x)) for x in '\xce\xb1') 

although I feel that your lines really fell into this state by different components of your system, I do not agree with the encoding used.

You will probably have to fix the source of the bad “encoding”, and not just fix the data in your database. And the above code may be good at converting your bad data once, but I would advise you not to embed this code in your Django application.

+2
source

The problem is that p.name was incorrectly stored and / or read in the database.

The small Unicode alpha encoding U + 03B1 and p.name must be printed as u '\ x03b1', or if you are using a terminal with Unicode support, the actual alpha character can be printed in quotation marks. Note the difference between u '\ xce \ xb1' and u '\ xceb1'. The first line is two character strings, and the second is in one character string. I have no idea how the “03” byte of UTF-8 was translated to “CE”.

+1
source

Try converting the encoding with p.name.encode('latin-1') . Here is a demo:

 >>> print u'\xce\xb1' α >>> print u'\xce\xb1'.encode('latin-1') α >>> print '\xce\xb1' α >>> '\xce\xb1' == u'\xce\xb1'.encode('latin1') True 

See str.encode and Standard Encodings for more information.

0
source

You can turn any sequence of bytes into the internal Unicode representation through the decoding function:

 print '\xce\xb1'.decode('utf-8') 

This allows you to import a sequence of bytes from any source, and then turn it into a Python Unicode string.

Link: http://docs.python.org/library/stdtypes.html#string-methods

0
source

Source: https://habr.com/ru/post/888470/


All Articles