Regex for replacement without accent in python

In Python 3, I would like to use it re.sub()as "without an accent," since we can use the flag re.Ifor case insensitivity.

Maybe something like a flag re.IGNOREACCENTS:

original_text = "¿It 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)

This will result in: “It's 80 ° C, I drink X in X with Chloe” (note that there is still an emphasis on “Chloe”) instead of “It's 80 ° C, I drink X in a cafe with Chloë.” In real python.

I think that such a flag does not exist. So what would be the best option for this? Using re.finditerand unidecodeon original_text, and accent_regexthen replace the dividing line? Or changing all the characters in accent_regextheir accented variations, for example r'[cç][aàâ]f[éèêë]':?

+4
source share
2 answers

unidecodeoften mentioned for emphasis removal in Python, but it also does more than that: it converts '°'to 'deg', which may not be the desired output.

unicodedataseems to have enough functionality to remove accents .

With any template

This method should work with any template and any text.

You can temporarily remove accents from text and a regular expression pattern. Matching information from re.finditer()(start and end indices) can be used to change the original accented text.

, , .

import re
import unicodedata

original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."

accented_pattern = r'a café|François Déporte'

def remove_accents(s):
    return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaac

pattern = re.compile(remove_accents(accented_pattern))

modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))

for match in matches[::-1]:
    modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]

print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.

:

  • \w+
  • :
    • , X
    • ,

import re
from unidecode import unidecode

original_text = "I'm drinking a café in a cafe with Chloë."

def remove_accents(string):
    return unidecode(string)

accented_words = ['café', 'français']

words_to_remove = set(remove_accents(word) for word in accented_words)

def remove_words(matchobj):
    word = matchobj.group(0)
    if remove_accents(word) in words_to_remove:
        return 'X'
    else:
        return word

print(re.sub('\w+', remove_words, original_text))
# I'm drinking a X in a X with Chloë.
+4

Unidecode:

$ pip install unidecode

:

from unidecode import unidecode

original_text = "I'm drinking a café in a cafe."
unidecoded_text = unidecode(original_text)
regex = r'cafe'
re.sub(regex, 'X', unidecoded_text)
0

Source: https://habr.com/ru/post/1675702/


All Articles