Convert hexadecimal character (ligature) to utf-8 character

I had text content that is being converted from a pdf file. There are several unwanted characters in the text, and I want to convert them to utf-8 characters.

For instance; An “artificial immune system” is being transformed as an “artificial immune system." fi is converted as a single character, and I used gdex to find out the ascii value of the character, but I don't know how to replace it with a real value in all the contents.

+4
source share
1 answer

I assume that you see ligatures - professional fonts have glyphs that combine several individual characters into one (better looking) glyph. Therefore, instead of writing "f" and "i" as two characters, the font has one "philipic" character. Compare "fi" (two letters) with "fi" (one character).

In Python, you can use the unicodedata module to handle late text in Unicode. You can also use NFKD normal form conversion to separate ligatures:

 >>> import unicodedata >>> unicodedata.name(u'\uFB01') 'LATIN SMALL LIGATURE FI' >>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System') u'Artificial Immune System' 

So, normalizing your strings with NFKD should help you. If you find that this is too much, then my best suggestion is to make a small table of ligature mappings that you want to split and replace the ligatures manually:

 >>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'} >>> u'Arti\uFB01cial Immune System'.translate(ligatures) u'Artificial Immune System' 

See the Wikipedia article for a list of Unicode ligatures .

+6
source

Source: https://habr.com/ru/post/1395204/


All Articles