Getting bytes from unicode string in python

Question

Getting bytes from unicode string in python

I have a 16 bit Unicode byte order string presented as u'\u4132' ,

how can i split it into integers 41 and 32 in python?

+8

python unicode byte

altunyurt Nov 21 '10 at 7:01

source share

6 answers

Java: "\u4132".getBytes("UTF-16BE")
Python 2: u'\u4132'.encode('utf-16be')
Python 3: '\u4132'.encode('utf-16be')

These methods return an array of bytes, which can be easily converted to an int array. But note that the codes above U+FFFF will be encoded using two blocks of code (so with UTF-16BE this means 32 bits or 4 bytes).

+4

Roland Illig Nov 21 '10 at 19:09

source share

"Those" are not integers, it is a hexadecimal number that represents a code point .

If you want to get an integer representation of a code point, you need to use ord(u'\u4132') , if now you want to convert it back to a Unicode character, use unicode() , which will return a Unicode string.

+2

Ivo Wetzel Nov 21 '10 at 7:12

source share

 >>> c = u'\u4132' >>> '%x' % ord(c) '4132'

+2

jfs Nov 21 '10 at 21:25

source share

Dirty hack: repr(u'\u4132') will return "u'\\u4132'"

+1

seriyPS Nov 21 '10 at 19:44

source share

Pass the Unicode character to ord() to get its code point, and then split that code point into separate bytes with int.to_bytes() , and then format the output as you want:

 list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32']

 list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9']

As I mentioned in another comment, utf16 code point encoding will not work properly for code points outside BMP (the base multilingual plane), since UTF16 will require a surrogate pair to encode these code points.

0

Danilo Souza Morães Jul 02 '18 at 19:43

source share

Chris Morgan · Accepted Answer · 2010-11-21 23:15

Here are a few different ways you may wish.

Python 2:

 >>> chars = u'\u4132'.encode('utf-16be') >>> chars 'A2' >>> ord(chars[0]) 65 >>> '%x' % ord(chars[0]) '41' >>> hex(ord(chars[0])) '0x41' >>> ['%x' % ord(c) for c in chars] ['41', '32'] >>> [hex(ord(c)) for c in chars] ['0x41', '0x32']

Python 3:

 >>> chars = '\u4132'.encode('utf-16be') >>> chars b'A2' >>> chars = bytes('\u4132', 'utf-16be') >>> chars # Just the same. b'A2' >>> chars[0] 65 >>> '%x' % chars[0] '41' >>> hex(chars[0]) '0x41' >>> ['%x' % c for c in chars] ['41', '32'] >>> [hex(c) for c in chars] ['0x41', '0x32']

Getting bytes from unicode string in python

More articles: