Question / Question about WAVY DASH Python UTF-16 Encoding

Today I was doing some kind of work and ran into a problem when something โ€œlooked funnyโ€. I interpreted some string data as utf-8 and checked the encoded form. Data came from ldap (in particular, Active Directory) via python-ldap. There are no surprises.

So, I met a sequence of bytes '\ xe3 \ x80 \ xb0' several times, which when decoding as utf-8 is the unicode code number 3030 ( wavy dash ). I need string data in utf-16, so I naturally converted it via .encode ('utf-16'). Unfortunately, it seems that python doesn't like this character:

D:\> python Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> u"\u3030" u'\u3030' >>> u"\u3030".encode("utf-8") '\xe3\x80\xb0' >>> u"\u3030".encode("utf-16-le") '00' >>> u"\u3030".encode("utf-16-be") '00' >>> '\xe3\x80\xb0'.decode('utf-8') u'\u3030' >>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16') '\xff\xfe00' >>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8') u'00' 

IronPython doesn't seem to be a fan either:

 D:\ipy IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053 Type "help", "copyright", "credits" or "license" for more information. >>> u"\u3030" u'\u3030' >>> u"\u3030".encode('utf-8') u'\xe3\x80\xb0' >>> u"\u3030".encode('utf-16-le') '00' 

If someone can tell me what exactly is happening here, that would be very helpful.

+4
source share
4 answers

This seems to be the right behavior. The character u '\ u3030' when encoding in UTF-16 is the same as the encoding '00' in UTF-8. It looks weird, but itโ€™s right.

"\ xff \ xfe" you can only see "Byte rating" .

Are you sure you want a wavy dash, and not some other character? If you were hoping for another character, this could be due to the fact that it was already incorrectly entered into your application.

+2
source

But it decodes well:

 >>> u"\u3030".encode("utf-16-le") '00' >>> '00'.decode("utf-16-le") u'\u3030' 

This means that the UTF-16 encoding of this character matches the ASCII code for "0". You can also represent it using "\ x30 \ x30":

 >>> '00' == '\x30\x30' True 
+2
source

Two things confuse you here (threw me away too):

  • utf-16 and utf-32 encodings use the specification unless you specify which byte order to use using utf-16-be, etc. This is \ xff \ xfe in the second last line.
  • '00 'is two digits of the character zero . This is not a null character. In any case, this will print differently:

     >>> '\0\0' '\x00\x00' 
+1
source

There is a basic error in the above code example. Remember that you encode Unicode to an encoded string, and you decode from the encoded string back to Unicode. So you do:

 '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8') 

which translates into the following steps:

 '\xe3\x80\xb0' # (some string) .decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030' .encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, ie '00' .decode('utf-8') # OOPS! decode using the wrong encoding here! 

u '\ u3030' is really encoded as '00' (ascii zero twice) in UTF-16LE, but you somehow think it is a zero byte ('\ 0') or something like that.

Remember that you cannot reach one character if you encode it and decode using another encoding:

 >>> import unicodedata as ud >>> c= unichr(193) >>> ud.name(c) 'LATIN CAPITAL LETTER A WITH ACUTE' >>> ud.name(c.encode("cp1252").decode("cp1253")) 'GREEK CAPITAL LETTER ALPHA' 

In this code, I encoded on Windows-1252 and decoded from Windows-1253. In your code, you were encoded in UTF-16LE and decoded from UTF-8.

0
source

Source: https://habr.com/ru/post/1301379/


All Articles