Python3 emoji characters as unicode

I have a line in python3 that has emojis in it, and I want to treat emojis as a unicode representation. I need to do some manipulation of emoji in this format.

s = '😬 😎 hello'

This applies to each emoji as his own character, so len(s) == 9&&s[0] == 😬

I want to change the format of the string so that it is at unicode points, so that

s = '😬 😎 hello'
u = to_unicode(s)   # Some function to change the format.
print(u) # '\ud83d\ude2c \ud83d\ude0e hello'
u[0] == '\ud83d' and u[1] == '\ude2c'
len(u) == 11

Any thoughts on creating a function to_unicodethat will take s and change it to u? I could think about how / unicode lines work in python3, so any help / corrections would be appreciated.

+4
source share
2 answers

, , UTF-16 .

s = '\U0001f62c \U0001f60e hello'

def pairup(b):
    return [(b[i] << 8 | b[i+1]) for i in range(0, len(b), 2)]

def utf16(c):
    e = c.encode('utf_16_be')
    return ''.join(chr(x) for x in pairup(e))

u = ''.join(utf16(c) for c in s)
print(repr(u))
print(u[0] == '\ud83d' and u[1] == '\ude2c')
print(len(u))

'\ud83d\ude2c \ud83d\ude0e hello'
True
11

, , , , . , .

+4

, , Unicode BMP :

#!/usr/bin/env python3
import re

def as_surrogates(astral):
    b = astral.group().encode('utf-16be')
    return ''.join([b[i:i+2].decode('utf-16be', 'surrogatepass')
                    for i in range(0, len(b), 2)])

s = '\U0001f62c \U0001f60e hello'
u = re.sub(r'[^\u0000-\uFFFF]+', as_surrogates, s)
print(ascii(u))
assert u.encode('utf-16', 'surrogatepass').decode('utf-16') == s

'\ud83d\ude2c \ud83d\ude0e hello'
+1

Source: https://habr.com/ru/post/1609338/


All Articles