UTF-16 code number in python

Question

UTF-16 code number in python

I get some data from the API (telegram bot) that I use. I am using the python-telegram-bot library that interacts with the Telegram Bot api . Data is returned in UTF-8 encoding in JSON format. Example (fragment):

{'message': {'text': '👨\u200d👩\u200d👦\u200d👦http://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}

You can see that "entities" contain a single entity of type url and has length and offset. Now say that I want to extract the link URL in the "text" attribute:

data = {'message': {'text': '👨\u200d👩\u200d👦\u200d👦http://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}
entities = data['entities']
for entity in entities:
    start = entity['offset']
    end = start + entity['length']
    print('Url: ', text[start:end])

However, the above code returns:, '://google.com/æøå'which is clearly not the actual URL.
The reason for this is that the offset and length are indicated at the UTF-16 code points. So my question is: is there a way to work with UTF-16 code points in python? I do not need more than being able to count them.

I already tried:

text.encode('utf-8').decode('utf-16')

But this gives an error: UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xa5 in position 48: truncated data

Any help would be greatly appreciated. I am using python 3.5, but since it would be nice for a unified library to get it working in python 2.x.

+4

python python-3.x encoding utf-8 utf-16

bomjacob Sep 01 '16 at 20:13

source share

1 answer

Martijn Pieters · Accepted Answer · 2016-09-01T20:32:18+0000

Python JSON, UTF-8, Python (Unicode), UTF-8.

UTF-16, . utf-16-le, utf-16-be, :

>>> len(text.encode('utf-16-le')) // 2
32

, UTF-16, , :

text_utf16 = text.encode('utf-16-le')
for entity in entities:
    start = entity['offset']
    end = start + entity['length']
    entity_text = text_utf16[start * 2:end * 2].decode('utf-16-le')
    print('Url: ', entity_text)

UTF-16 code number in python

More articles: