Python: the correct way to reference a unicode string index

Not sure if this is exactly the problem, but I'm trying to insert a tag in the first letter of a Unicode string, and it seems like this doesn't work. Could this be because Unicode indexes work differently than regular strings?

Now my code is:

for index, paragraph in enumerate(intro[2:-2]):
    intro[index] = bold_letters(paragraph, 1)

def bold_letters(string, index):
    return "<b>"+string[0]+"</b>"+string[index:]

And I get the output as follows:

<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים. 

Unicode seems to be messed up when I try to insert an HTML tag. I tried messing with the insertion position, but made no progress.

An example of the desired conclusion (Hebrew goes from right to left):

>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"

By the way, this is for Python 2

+4
source share
3 answers

, byte, i.e String Python (2.x).

, Python (2.x), Unicode, . , , , . String String.

UTF8 raw encoding Unicode ( , Unicode UTF8, , ) , , , .. , .

def bold_letters(string, index):
    string = string.decode('utf8')
    string "<b>"+string[0]+"</b>"+string[index:]
    return string.encode('utf8')

ASCII, UTF8 - ASCII. , Unicode Python, , http://nedbatchelder.com/text/unipain.html

Python 3.x String - Unicode, .

+6

Unicode. - UTF-8 . Unicode ( , , BMP Python 2... 65536 ):

#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
    f.write(t)

:

<b>ה</b>קדמה

Chrome out.htm :

enter image description here

+3

The behavior you see indicates that you have a byte string instead of a Unicode string - your code should work if it was a Unicode string, the unicode string indexes work the same as the "normal" ascii strings. In python 3 at least.

+2
source

Source: https://habr.com/ru/post/1653011/


All Articles