Getting bytes from unicode string in python

I have a 16 bit Unicode byte order string presented as u'\u4132' ,

how can i split it into integers 41 and 32 in python?

+8
python unicode byte
Nov 21 '10 at 7:01
source share
6 answers

Here are a few different ways you may wish.

Python 2:

 >>> chars = u'\u4132'.encode('utf-16be') >>> chars 'A2' >>> ord(chars[0]) 65 >>> '%x' % ord(chars[0]) '41' >>> hex(ord(chars[0])) '0x41' >>> ['%x' % ord(c) for c in chars] ['41', '32'] >>> [hex(ord(c)) for c in chars] ['0x41', '0x32'] 

Python 3:

 >>> chars = '\u4132'.encode('utf-16be') >>> chars b'A2' >>> chars = bytes('\u4132', 'utf-16be') >>> chars # Just the same. b'A2' >>> chars[0] 65 >>> '%x' % chars[0] '41' >>> hex(chars[0]) '0x41' >>> ['%x' % c for c in chars] ['41', '32'] >>> [hex(c) for c in chars] ['0x41', '0x32'] 
+15
Nov 21 '10 at 23:15
source share
  • Java: "\u4132".getBytes("UTF-16BE")
  • Python 2: u'\u4132'.encode('utf-16be')
  • Python 3: '\u4132'.encode('utf-16be')

These methods return an array of bytes, which can be easily converted to an int array. But note that the codes above U+FFFF will be encoded using two blocks of code (so with UTF-16BE this means 32 bits or 4 bytes).

+4
Nov 21 '10 at 19:09
source share

"Those" are not integers, it is a hexadecimal number that represents a code point .

If you want to get an integer representation of a code point, you need to use ord(u'\u4132') , if now you want to convert it back to a Unicode character, use unicode() , which will return a Unicode string.

+2
Nov 21 '10 at 7:12
source share
 >>> c = u'\u4132' >>> '%x' % ord(c) '4132' 
+2
Nov 21 '10 at 21:25
source share

Dirty hack: repr(u'\u4132') will return "u'\\u4132'"

+1
Nov 21 '10 at 19:44
source share

Pass the Unicode character to ord() to get its code point, and then split that code point into separate bytes with int.to_bytes() , and then format the output as you want:

 list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big'))) 

returns: ['0', '0', '41', '32']

 list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big'))) 

returns: ['0', '1', 'f4', 'a9']

As I mentioned in another comment, utf16 code point encoding will not work properly for code points outside BMP (the base multilingual plane), since UTF16 will require a surrogate pair to encode these code points.

0
Jul 02 '18 at 19:43
source share



All Articles