Python "denormalize" unicode combining characters

Question

Python "denormalize" unicode combining characters

I want to standardize some unicode text in python. I am wondering if there is an easy way to get a “denormalized” form of combining a unicode character in python? for example, if I have a sequence u'o \ xaf '(i.e. latin small letter o, followed by combining macron) to get ō ( latin small letter o with macron). Easy to go the other way:

o = unicodedata.lookup("LATIN SMALL LETTER O WITH MACRON")
o = unicodedata.normalize('NFD', o)

+3

python unicode

Puzzled79 Jun 27 '10 at 9:11

source share

2 answers

o = unicodedata.normalize('NFC', o)

+3

Ignacio Vazquez-Abrams 27 . '10 9:20

kennytm · Accepted Answer · 2010-06-27T09:26:11+0000

As I said, U + 00AF is not a macron combination. But you can convert it to U + 0020 U + 0304 with NFKD conversion.

>>> unicodedata.normalize('NFKD', u'o\u00af')
u'o \u0304'

Then you can remove the space and get ō using NFC.

( , NFKD , - - , "", , ,

'½' (U + 008D) ↦ '1' '⁄' (U + 2044) '2';
'²' (U + 00B2) ↦ '2'
'①' (U + 2460) ↦ '1'

.)

Python "denormalize" unicode combining characters

More articles: