I assume that you see ligatures - professional fonts have glyphs that combine several individual characters into one (better looking) glyph. Therefore, instead of writing "f" and "i" as two characters, the font has one "philipic" character. Compare "fi" (two letters) with "fi" (one character).
In Python, you can use the unicodedata module to handle late text in Unicode. You can also use NFKD normal form conversion to separate ligatures:
>>> import unicodedata >>> unicodedata.name(u'\uFB01') 'LATIN SMALL LIGATURE FI' >>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System') u'Artificial Immune System'
So, normalizing your strings with NFKD should help you. If you find that this is too much, then my best suggestion is to make a small table of ligature mappings that you want to split and replace the ligatures manually:
>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'} >>> u'Arti\uFB01cial Immune System'.translate(ligatures) u'Artificial Immune System'
See the Wikipedia article for a list of Unicode ligatures .
source share