Regex - corresponds to the character and all its diacritical changes (he emphasizes insensitivity)

I am trying to match the nature and all possible diacritical variations (aka axially sensitive) with a regular expression. Of course, I could do the following:

re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é") 

but this is not a general solution. If I use unicode categories like \pL , I cannot reduce the match to a specific character, in this case e .

+5
source share
1 answer

A workaround to achieve your desired goal would be to use unidecode to get rid of all diacritics first, and then just repeat regular e

 re.match(r"^e$", unidecode("é")) 

Or in this simplified case

 unidecode("é") == "e" 

Another solution, which is independent of the unidecode library, preserves unicode and gives more control, manually removes diacritics as follows:

Use unicodedata.normalize () to turn your input string into a regular D form (for decomposition), making sure that compound characters like é turn into an expanded form e\u301 (e + COMBINE FASTEN ACCENT)

 >>> input = "Héllô" >>> input 'Héllô' >>> normalized = unicodedata.normalize("NFKD", input) >>> normalized 'He\u0301llo\u0302' 

Then remove all code points that fall into the Mark, Nonspacing (short Mn ) category. These are all characters who do not have a width and simply decorate the previous character. Use unicodedata.category () to define a category.

 >>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn") >>> stripped 'Hello' 

The result can be used as a source to match the regular expression, as in the unidecode example. Here it is all as a function:

 def remove_diacritics(text): """ Returns a string with all diacritics (aka non-spacing marks) removed. For example "Héllô" will become "Hello". Useful for comparing strings in an accent-insensitive fashion. """ normalized = unicodedata.normalize("NFKD", text) return "".join(c for c in normalized if unicodedata.category(c) != "Mn") 
+12
source

Source: https://habr.com/ru/post/1244388/


All Articles