Substring selection using multibyte Unicode characters in Python 3

Question

Substring selection using multibyte Unicode characters in Python 3

I'm a little confused about how Python 3 handles multibyte Unicode characters. Here is an example with emoji:

In [1]: print('☺️')
☺️

In [2]: print(len('☺️'))
2

In [3]: print('☺️'[0])
☺

In [4]: print('☺️'[1])
️

In [5]: print(len('👩🏾‍💼'))
4

Since I am working on a small emojis hobby project, this causes certain problems for me, since I would rather deal with emojis as single-character strings, rather than treating them as multi-character strings, as Python 3 does. Why doesn't Python 3 recognize this as a single character, and how do I work and work with emojis in a way that I would prefer?

If this is more likely a problem with my terminal or REPL, I use the macOS Sierra terminal with iPython 5.1.0.

+4

python python-3.x unicode emoji

Jimmy c Jan 22 '17 at 13:36

1

Leon · Answer 1 · 2017-01-22T15:07:42+0000

, ('☺️' '👩🏾💼') :

>>> '☺️'[0]
'☺'
>>> '☺️'[1]
'️'
>>> '☺️'[1].encode('unicode_escape')
b'\\ufe0f'                             # !!!!!!!!!!
>>> '👩🏾‍💼'[0]
'👩'
>>> '👩🏾‍💼'[1]
'🏾'
>>> '👩🏾‍💼'[2]
'\u200d'                               # !!!!!!!!!!
>>> '👩🏾‍💼'[3]
'💼'

'\ufe0f' (U+FE0F) -16,

. , dingbat emoji, U+FE0F .

'\u200d' (U+200D) Zero Width Joiner.

Substring selection using multibyte Unicode characters in Python 3

More articles: