Convert indices in str to indices in bytearray

I have text, process it and find the offset for some words in the text. These offsets will be used by another application, and this application will work with text as a sequence of bytes, so str indexes will be erroneous for it.

Example:

>>> text = ""Hello there!" He said"
>>> text[7:12]
'there'
>>> text.encode('utf-8')[7:12]
>>> b'o the'

So how can I convert indices into a string into indices in bytearray encoded?

+4
source share
2 answers

Encode substrings and get their lengths in bytes:

text = ""Hello there!" He said"
start = len(text[:7].encode('utf-8'))
count = len(text[7:12].encode('utf-8'))
text.encode('utf-8')[start:start+count]

It gives b'there'.

+4
source

This should work:

def byte_array_index(s, str_index): 
    return len(s[:str_index].encode('utf-8'))
+1
source

Source: https://habr.com/ru/post/1693588/


All Articles