Python "denormalize" unicode combining characters

I want to standardize some unicode text in python. I am wondering if there is an easy way to get a “denormalized” form of combining a unicode character in python? for example, if I have a sequence u'o \ xaf '(i.e. latin small letter o, followed by combining macron) to get ō ( latin small letter o with macron). Easy to go the other way:

o = unicodedata.lookup("LATIN SMALL LETTER O WITH MACRON")
o = unicodedata.normalize('NFD', o)
+3
source share
2 answers

As I said, U + 00AF is not a macron combination. But you can convert it to U + 0020 U + 0304 with NFKD conversion.

>>> unicodedata.normalize('NFKD', u'o\u00af')
u'o \u0304'

Then you can remove the space and get ō using NFC.


( , NFKD , - - , "", , ,

  • '½' (U + 008D) ↦ '1' '⁄' (U + 2044) '2';
  • '²' (U + 00B2) ↦ '2'
  • '①' (U + 2460) ↦ '1'

.)

+4
o = unicodedata.normalize('NFC', o)
+3

Source: https://habr.com/ru/post/1751963/


All Articles