Get the number of bytes needed for a Unicode string

Question

Get the number of bytes needed for a Unicode string

I have a Korean string encoded as Unicode, for example u'정정' . How do I know how many bytes are required to represent this string?

I need to know the exact number of bytes since I use the string to notify iOS push and has a payload size limit.

len('정정') does not work because it returns the number of characters, not the number of bytes.

+6

python string unicode cjk

jasondinh Aug 6 '12 at 17:11

source share

3 answers

The number of bytes required to represent unicode depends on the encoding you use.

 >>> s = u'정정' >>> len(s) 2 >>> len(s.encode('UTF-8')) 6 >>> len(s.encode('UTF-16')) 6 >>> len(s.encode('UTF-32')) 12

If you intend to reuse the encoding result, I recommend that you encode it once, and then pull out its len and reuse the already encoded result later.

+4

zigg Aug 6 '12 at 17:17

source share

Make sure you use the correct standard encoding .

If you do not, you can always decodedString = myString.decode('UTF-8') (replace UTF-8 with the correct encoding string, which you can find from the previous link, if not UTF-8) to get a string in the format where len(decodedString) should return the correct number

0

Hans z Aug 6 '12 at 17:17

source share

Martijn pieters · Accepted Answer · 2012-08-06T17:17:43+0000

You need to know what encoding you want to measure the byte size in:

 >>> print u'\uC815\uC815' 정정 >>> print len(u'\uC815\uC815') 2 >>> print len(u'\uC815\uC815'.encode('UTF-8')) 6 >>> print len(u'\uC815\uC815'.encode('UTF-16-LE')) 4 >>> print len(u'\uC815\uC815'.encode('UTF-16')) 6 >>> print len(u'\uC815\uC815'.encode('UTF-32-LE')) 8 >>> print len(u'\uC815\uC815'.encode('UTF-32')) 12

You really want to check out the Python Unicode HOWTO to fully appreciate the difference between a unicode object and its byte encoding.

Another great article Absolute Minimum Every software developer Absolutely, should know positively about Unicode and character sets (no excuses!) , Joel Spolsky (one of the people behind the stack overflow).

Get the number of bytes needed for a Unicode string

More articles: