Get the number of bytes needed for a Unicode string

I have a Korean string encoded as Unicode, for example u'μ •μ •' . How do I know how many bytes are required to represent this string?

I need to know the exact number of bytes since I use the string to notify iOS push and has a payload size limit.

len('μ •μ •') does not work because it returns the number of characters, not the number of bytes.

+6
source share
3 answers

You need to know what encoding you want to measure the byte size in:

 >>> print u'\uC815\uC815' μ •μ • >>> print len(u'\uC815\uC815') 2 >>> print len(u'\uC815\uC815'.encode('UTF-8')) 6 >>> print len(u'\uC815\uC815'.encode('UTF-16-LE')) 4 >>> print len(u'\uC815\uC815'.encode('UTF-16')) 6 >>> print len(u'\uC815\uC815'.encode('UTF-32-LE')) 8 >>> print len(u'\uC815\uC815'.encode('UTF-32')) 12 

You really want to check out the Python Unicode HOWTO to fully appreciate the difference between a unicode object and its byte encoding.

Another great article Absolute Minimum Every software developer Absolutely, should know positively about Unicode and character sets (no excuses!) , Joel Spolsky (one of the people behind the stack overflow).

+14
source

The number of bytes required to represent unicode depends on the encoding you use.

 >>> s = u'μ •μ •' >>> len(s) 2 >>> len(s.encode('UTF-8')) 6 >>> len(s.encode('UTF-16')) 6 >>> len(s.encode('UTF-32')) 12 

If you intend to reuse the encoding result, I recommend that you encode it once, and then pull out its len and reuse the already encoded result later.

+4
source

Make sure you use the correct standard encoding .

If you do not, you can always decodedString = myString.decode('UTF-8') (replace UTF-8 with the correct encoding string, which you can find from the previous link, if not UTF-8) to get a string in the format where len(decodedString) should return the correct number

0
source

Source: https://habr.com/ru/post/922234/


All Articles