A workaround to achieve your desired goal would be to use unidecode to get rid of all diacritics first, and then just repeat regular e
re.match(r"^e$", unidecode("é"))
Or in this simplified case
unidecode("é") == "e"
Another solution, which is independent of the unidecode library, preserves unicode and gives more control, manually removes diacritics as follows:
Use unicodedata.normalize () to turn your input string into a regular D form (for decomposition), making sure that compound characters like é
turn into an expanded form e\u301
(e + COMBINE FASTEN ACCENT)
>>> input = "Héllô" >>> input 'Héllô' >>> normalized = unicodedata.normalize("NFKD", input) >>> normalized 'He\u0301llo\u0302'
Then remove all code points that fall into the Mark, Nonspacing (short Mn
) category. These are all characters who do not have a width and simply decorate the previous character. Use unicodedata.category () to define a category.
>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn") >>> stripped 'Hello'
The result can be used as a source to match the regular expression, as in the unidecode example. Here it is all as a function:
def remove_diacritics(text): """ Returns a string with all diacritics (aka non-spacing marks) removed. For example "Héllô" will become "Hello". Useful for comparing strings in an accent-insensitive fashion. """ normalized = unicodedata.normalize("NFKD", text) return "".join(c for c in normalized if unicodedata.category(c) != "Mn")
source share