The internal representation of the Unicode string in Python (versions 2.2 to 3.2) depends on whether Python was compiled in wide or narrow modes. Most Python builds are narrow (you can check with sys.maxunicode - it's 65535 on narrow lines and 1114111 on wide lines).
Broadly constructed strings are internal sequences of 4-byte characters, meaning they use UTF-32 encoding. All code points have exactly one widescreen character.
Narrow-string strings are internal double-byte character sequences using UTF-16. Characters outside the BMP (code points U + 10000 and above) are stored using ordinary UTF-16 surrogate pairs:
>>> q = u'\U00010000' >>> len(q) 2 >>> q[0] u'\ud800' >>> q[1] u'\udc00' >>> q u'\U00010000'
Please note that UTF-16 and UCS-2 do not match. UCS-2 is a fixed-width encoding: each code point is encoded as 2 bytes. Therefore, UCS-2 cannot encode code points outside of BMP. UTF-16 - coding with variable width; code points outside the BMP are encoded using a character pair called a surrogate pair.
Note that all this changes in 3.3, with the implementation of PEP 393 . Unicode strings are now represented using characters wide enough to hold the largest code point β 8 bits for ASCII strings, 16 bits for BMP strings, and 32 bits otherwise. This eliminates the wide / narrow delimiter, and also helps reduce memory usage when many ASCII strings are used.
source share