Python: find equivalent surrogate pair from unicode unicode char

The answer presented here is: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair like '\ud83d\ude4f'BMIC unicode (answer "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')). I would like to know how to do this in reverse order. How can I, using Python, find the equivalent surrogate pair from a character other than BMP, convert '\U0001f64f'(🙏) back to '\ud83d\ude4f'. I could not find a clear answer to this question.

+4
source share
2 answers

You will have to manually replace each point without BMP with a surrogate pair. You can do this with a regex:

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Demo:

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'
+3
source

This is a bit complicated, but here is a one-line converter for converting a single character:

>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'

Converting a mix of characters requires surrounding this expression with another:

>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'
+3
source

Source: https://habr.com/ru/post/1658717/


All Articles