Question / Question about WAVY DASH Python UTF-16 Encoding

Question

Question / Question about WAVY DASH Python UTF-16 Encoding

Today I was doing some kind of work and ran into a problem when something “looked funny”. I interpreted some string data as utf-8 and checked the encoded form. Data came from ldap (in particular, Active Directory) via python-ldap. There are no surprises.

So, I met a sequence of bytes '\ xe3 \ x80 \ xb0' several times, which when decoding as utf-8 is the unicode code number 3030 ( wavy dash ). I need string data in utf-16, so I naturally converted it via .encode ('utf-16'). Unfortunately, it seems that python doesn't like this character:

D:\> python Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> u"\u3030" u'\u3030' >>> u"\u3030".encode("utf-8") '\xe3\x80\xb0' >>> u"\u3030".encode("utf-16-le") '00' >>> u"\u3030".encode("utf-16-be") '00' >>> '\xe3\x80\xb0'.decode('utf-8') u'\u3030' >>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16') '\xff\xfe00' >>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8') u'00'

IronPython doesn't seem to be a fan either:

 D:\ipy IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053 Type "help", "copyright", "credits" or "license" for more information. >>> u"\u3030" u'\u3030' >>> u"\u3030".encode('utf-8') u'\xe3\x80\xb0' >>> u"\u3030".encode('utf-16-le') '00'

If someone can tell me what exactly is happening here, that would be very helpful.

+4

python encoding unicode utf-8 utf-16

Noname Feb 15 '10 at 21:48

source share

4 answers

But it decodes well:

 >>> u"\u3030".encode("utf-16-le") '00' >>> '00'.decode("utf-16-le") u'\u3030'

This means that the UTF-16 encoding of this character matches the ASCII code for "0". You can also represent it using "\ x30 \ x30":

 >>> '00' == '\x30\x30' True

+2

huin Feb 15 '10 at 22:10

source share

Two things confuse you here (threw me away too):

utf-16 and utf-32 encodings use the specification unless you specify which byte order to use using utf-16-be, etc. This is \ xff \ xfe in the second last line.
'00 'is two digits of the character zero . This is not a null character. In any case, this will print differently:
```
 >>> '\0\0' '\x00\x00' 
```

+1

Rhamphoryncus Feb 15 '10 at 22:27

source share

There is a basic error in the above code example. Remember that you encode Unicode to an encoded string, and you decode from the encoded string back to Unicode. So you do:

 '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

which translates into the following steps:

 '\xe3\x80\xb0' # (some string) .decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030' .encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, ie '00' .decode('utf-8') # OOPS! decode using the wrong encoding here!

u '\ u3030' is really encoded as '00' (ascii zero twice) in UTF-16LE, but you somehow think it is a zero byte ('\ 0') or something like that.

Remember that you cannot reach one character if you encode it and decode using another encoding:

 >>> import unicodedata as ud >>> c= unichr(193) >>> ud.name(c) 'LATIN CAPITAL LETTER A WITH ACUTE' >>> ud.name(c.encode("cp1252").decode("cp1253")) 'GREEK CAPITAL LETTER ALPHA'

In this code, I encoded on Windows-1252 and decoded from Windows-1253. In your code, you were encoded in UTF-16LE and decoded from UTF-8.

0

tzot Mar 07 '10 at 9:38

source share

Mark byers · Accepted Answer · 2010-02-15T21:56:19+0000

This seems to be the right behavior. The character u '\ u3030' when encoding in UTF-16 is the same as the encoding '00' in UTF-8. It looks weird, but it’s right.

"\ xff \ xfe" you can only see "Byte rating" .

Are you sure you want a wavy dash, and not some other character? If you were hoping for another character, this could be due to the fact that it was already incorrectly entered into your application.

Question / Question about WAVY DASH Python UTF-16 Encoding

More articles: