Today I was doing some kind of work and ran into a problem when something โlooked funnyโ. I interpreted some string data as utf-8 and checked the encoded form. Data came from ldap (in particular, Active Directory) via python-ldap. There are no surprises.
So, I met a sequence of bytes '\ xe3 \ x80 \ xb0' several times, which when decoding as utf-8 is the unicode code number 3030 ( wavy dash ). I need string data in utf-16, so I naturally converted it via .encode ('utf-16'). Unfortunately, it seems that python doesn't like this character:
D:\> python Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> u"\u3030" u'\u3030' >>> u"\u3030".encode("utf-8") '\xe3\x80\xb0' >>> u"\u3030".encode("utf-16-le") '00' >>> u"\u3030".encode("utf-16-be") '00' >>> '\xe3\x80\xb0'.decode('utf-8') u'\u3030' >>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16') '\xff\xfe00' >>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8') u'00'
IronPython doesn't seem to be a fan either:
D:\ipy IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053 Type "help", "copyright", "credits" or "license" for more information. >>> u"\u3030" u'\u3030' >>> u"\u3030".encode('utf-8') u'\xe3\x80\xb0' >>> u"\u3030".encode('utf-16-le') '00'
If someone can tell me what exactly is happening here, that would be very helpful.
source share