Python Unicode - only UTF-16?

Question

Python Unicode - only UTF-16?

I was happy in the Python world, knowing that I do everything in Unicode and encoding as UTF-8 when I need to output something to the user. Then one of my colleagues sent me this article in UTF-8, and it confused me.

The author of the article points out that UCS-2, the Unicode view that Python uses, is synonymous with UTF-16. He even says that Python uses UTF-16 for the internal representation of strings.

The author also recognizes himself as an amateur and Windows developer and states that the way MS handled character encodings over the years has led to this group being the most confusing, so maybe it's just his own confusion. I dont know...

Can someone explain what the state of UTF-16 and Unicode are in Python? Are they synonyms, and if not, how?

+4

python unicode character-encoding utf-16

Endophage Oct 26 '12 at 10:55

source share

1 answer

nneonneo · Accepted Answer · 2012-10-26T23:03:33+0000

The internal representation of the Unicode string in Python (versions 2.2 to 3.2) depends on whether Python was compiled in wide or narrow modes. Most Python builds are narrow (you can check with sys.maxunicode - it's 65535 on narrow lines and 1114111 on wide lines).

Broadly constructed strings are internal sequences of 4-byte characters, meaning they use UTF-32 encoding. All code points have exactly one widescreen character.

Narrow-string strings are internal double-byte character sequences using UTF-16. Characters outside the BMP (code points U + 10000 and above) are stored using ordinary UTF-16 surrogate pairs:

 >>> q = u'\U00010000' >>> len(q) 2 >>> q[0] u'\ud800' >>> q[1] u'\udc00' >>> q u'\U00010000'

Please note that UTF-16 and UCS-2 do not match. UCS-2 is a fixed-width encoding: each code point is encoded as 2 bytes. Therefore, UCS-2 cannot encode code points outside of BMP. UTF-16 - coding with variable width; code points outside the BMP are encoded using a character pair called a surrogate pair.

Note that all this changes in 3.3, with the implementation of PEP 393 . Unicode strings are now represented using characters wide enough to hold the largest code point — 8 bits for ASCII strings, 16 bits for BMP strings, and 32 bits otherwise. This eliminates the wide / narrow delimiter, and also helps reduce memory usage when many ASCII strings are used.

Python Unicode - only UTF-16?

More articles: