I get some data from the API (telegram bot) that I use. I am using the python-telegram-bot library that interacts with the Telegram Bot api . Data is returned in UTF-8 encoding in JSON format. Example (fragment):
{'message': {'text': '👨\u200d👩\u200d👦\u200d👦http://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}
You can see that "entities" contain a single entity of type url and has length and offset. Now say that I want to extract the link URL in the "text" attribute:
data = {'message': {'text': '👨\u200d👩\u200d👦\u200d👦http://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}
entities = data['entities']
for entity in entities:
start = entity['offset']
end = start + entity['length']
print('Url: ', text[start:end])
However, the above code returns:, '://google.com/æøå'which is clearly not the actual URL.
The reason for this is that the offset and length are indicated at the UTF-16 code points. So my question is: is there a way to work with UTF-16 code points in python? I do not need more than being able to count them.
I already tried:
text.encode('utf-8').decode('utf-16')
But this gives an error: UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xa5 in position 48: truncated data
Any help would be greatly appreciated. I am using python 3.5, but since it would be nice for a unified library to get it working in python 2.x.
source
share