Regex - corresponds to the character and all its diacritical changes (he emphasizes insensitivity)

Question

Regex - corresponds to the character and all its diacritical changes (he emphasizes insensitivity)

I am trying to match the nature and all possible diacritical variations (aka axially sensitive) with a regular expression. Of course, I could do the following:

re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")

but this is not a general solution. If I use unicode categories like \pL , I cannot reduce the match to a specific character, in this case e .

+5

python python-3.x regex diacritics accent-insensitive

Felk Mar 03 '16 at 21:12

source share

1 answer

Felk · Accepted Answer · 2016-03-03T21:12:36+0000

A workaround to achieve your desired goal would be to use unidecode to get rid of all diacritics first, and then just repeat regular e

 re.match(r"^e$", unidecode("é"))

Or in this simplified case

 unidecode("é") == "e"

Another solution, which is independent of the unidecode library, preserves unicode and gives more control, manually removes diacritics as follows:

Use unicodedata.normalize () to turn your input string into a regular D form (for decomposition), making sure that compound characters like é turn into an expanded form e\u301 (e + COMBINE FASTEN ACCENT)

 >>> input = "Héllô" >>> input 'Héllô' >>> normalized = unicodedata.normalize("NFKD", input) >>> normalized 'He\u0301llo\u0302'

Then remove all code points that fall into the Mark, Nonspacing (short Mn ) category. These are all characters who do not have a width and simply decorate the previous character. Use unicodedata.category () to define a category.

 >>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn") >>> stripped 'Hello'

The result can be used as a source to match the regular expression, as in the unidecode example. Here it is all as a function:

 def remove_diacritics(text): """ Returns a string with all diacritics (aka non-spacing marks) removed. For example "Héllô" will become "Hello". Useful for comparing strings in an accent-insensitive fashion. """ normalized = unicodedata.normalize("NFKD", text) return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

Regex - corresponds to the character and all its diacritical changes (he emphasizes insensitivity)

More articles: