I suspect that what is happening is that the [ààãã] part of your regular expression does not actually match the characters, but matches the bytes. The UTF-8 encoding of these characters will look literally in this regular expression:
[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3]
So, when a regular expression is served, for example, 'é' (\ xC3 \ xA9), it looks at it byte-by-time, matches \ xC3 and replaces it with 'a'. He does this for all \ xC3 bytes that he can find. So, 'été' turns into 'a \ xA9ta \ xA9'.
Then the second regular expression, which looks like this:
[\xc3\xA9\xC3\xA8\xC3\xAA\xC3\xAB]
comes and it matches part \ xA9 and replaces it with "e". So now 'a \ xA9ta \ xA9' is turning into aetae.
When you replace [ààâã] with (á | à | â | ã), then it matches the full character correctly in the first pass, but then your second regular expression has the original problem, and the characters \ xC3 are replaced with instead of 'e'.
If this still happens, even with use utf8;
, then there may be an error (or at least a limitation) in the Perl regex engine.
source share