This is a VFAQ. See this SO question or google "python asciify" or "python unaccent".
To create a decent dictionary for use with unicode.translate
, you need an approach that will automatically detect simple cases and find those where you need to do manual recording. A good approach is to break through the BMP, looking at what is produced unicodedata.name(the_ordinal, "")
.
Auto discovery: re.match("LATIN (SMALL|CAPTTAL) LETTER ([AZ]) WITH ", name)
Otherwise, if you get a match with "LATIN (SMALL|CAPTTAL) LETTER [AZ].+"
, You will need to record manually.
Important note : unicode.translate
uses "matching Unicode orders with Unicode numbers, Unicode or None strings " ... so you can replace, for example. THORN's capital is on "Th."
This is why using unicodedata.normalize
not a good idea:
Characters whose normalized first character is NOT in the ASCII range are deleted. This includes not only all punctuation marks (which you may not like), but letters that are NOT “accented,” for example. ß
>>> from unicodedata import name, normalize >>> for i in range(0xA0, 0x100): ... c = unichr(i) ... a = normalize('NFKD', c).encode('ascii', 'ignore') ... if not a: ... print("FAIL: U+%04X %s" % (i, name(c))) ... FAIL: U+00A1 INVERTED EXCLAMATION MARK FAIL: U+00A2 CENT SIGN FAIL: U+00A3 POUND SIGN FAIL: U+00A4 CURRENCY SIGN FAIL: U+00A5 YEN SIGN FAIL: U+00A6 BROKEN BAR FAIL: U+00A7 SECTION SIGN FAIL: U+00A9 COPYRIGHT SIGN FAIL: U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK FAIL: U+00AC NOT SIGN FAIL: U+00AD SOFT HYPHEN FAIL: U+00AE REGISTERED SIGN FAIL: U+00B0 DEGREE SIGN FAIL: U+00B1 PLUS-MINUS SIGN FAIL: U+00B5 MICRO SIGN FAIL: U+00B6 PILCROW SIGN FAIL: U+00B7 MIDDLE DOT FAIL: U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK FAIL: U+00BF INVERTED QUESTION MARK FAIL: U+00C6 LATIN CAPITAL LETTER AE FAIL: U+00D0 LATIN CAPITAL LETTER ETH FAIL: U+00D7 MULTIPLICATION SIGN FAIL: U+00D8 LATIN CAPITAL LETTER O WITH STROKE FAIL: U+00DE LATIN CAPITAL LETTER THORN FAIL: U+00DF LATIN SMALL LETTER SHARP S <<<<<<<<<<========== ß FAIL: U+00E6 LATIN SMALL LETTER AE FAIL: U+00F0 LATIN SMALL LETTER ETH FAIL: U+00F7 DIVISION SIGN FAIL: U+00F8 LATIN SMALL LETTER O WITH STROKE FAIL: U+00FE LATIN SMALL LETTER THORN >>>
source share