Why do Perl string operations for Unicode characters add garbage to a string?

Question

Why do Perl string operations for Unicode characters add garbage to a string?

Perl:

$string =~ s/[áàâã]/a/gi; #This line always prepends an "a" $string =~ s/[éèêë]/e/gi; $string =~ s/[úùûü]/u/gi;

This regular expression should convert "été" to "ete". Instead, he turns it into an aeta. In other words, it adds an “a” to each matched element. Even à is converted to aa.

If I changed the first line to this

 $string =~ s/(á|à|â|ã)/a/gi;

it works, but ... Now it adds an e to each matched element (for example, "eetee").

Despite the fact that I found a suitable solution, why does it behave this way?

Change 1:

I added "use utf8;" but it did not change the behavior (although it violated my output in JavaScript / AJAX ).

Edit2:

The stream is taken from an Ajax request made by jQuery . The site from which it comes from is installed in UTF-8 .

I use Perl v5.10 ( perl -v returns "This is perl, v5.10.0, created for i586-linux-thread -Multi").

+4

regex perl unicode internationalization

Mike Oct 15 '09 at 12:39

source share

7 answers

The problem is very likely if you do not have

 use utf8;

(or its equivalent for any coding system that you use) in your program. The weird replacements you have look like problems with more than the usual regex replacement.

 #!/usr/local/bin/perl use warnings; use strict; use utf8; binmode STDOUT, "utf8"; my $string = "été"; $string =~ s/[áàâã]/a/gi; #This line always prepends an "a" $string =~ s/[éèêë]/e/gi; $string =~ s/[úùûü]/u/gi; print "$string\n";

prints

ete

If you are reading input from a file or from standard input, make sure you have installed utf8 stream or something suitable for encoding. For STDIN use

 binmode STDOUT, "utf8";

If you are reading a file, use

 open my $file, "<:utf8", "file_name"

to get the right coding. If it is not in UTF-8, use encoding(name) instead of utf8 .

+8

user181548 Oct 15 '09 at 12:49

source share

But did you really want to use regular expressions? Maybe something like Text :: Unidecode would be better

 $ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")' ete

+7

oylenshpeegul Oct 15 '09 at 13:50

source share

This is probably due to the fact that you are using UTF8 strings and they analyze them as if they are not or similar.

Instead of using something like [áàâã] , you should use something like [\xE1-\xE5]

and probably use use utf8; in your code.

+5

Mez Oct 15 '09 at 12:51

source share

Something tells me about this because he does not know how to behave with characters with an accent. Looking at your regular expression, everything seems beautiful. You can add:

 use utf8;

+2

David brunelle Oct 15 '09 at 12:50

source share

This can also be a problem with Unicode Normalization , as certain systems (I look at you, OS X) are extended Latin1 glyphs as a specific normalized representation that can break regular expressions when you refer to a character specifically instead of using unicode or hexadecimal.

+2

squeeks Oct 15 '09 at 12:53

source share

I would say that you should not use regular expressions here. The easiest way to achieve this (although this may be undesirable) is to convert your input string to US ASCII. Appropriate conversion tables should know that e is the closest equivalent to é .

Another option is to use Unicode and normalize your string to NFD. This will destroy all letters with an emphasis on the letter + diacritics. Then you can simply go through the line and delete all combinations of diacritical characters.

+1

Joey Oct 16 '09 at 5:33

source share

Ian clelland · Accepted Answer · 2009-10-15T17:16:09+0000

I suspect that what is happening is that the [ààãã] part of your regular expression does not actually match the characters, but matches the bytes. The UTF-8 encoding of these characters will look literally in this regular expression:

 [\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3]

So, when a regular expression is served, for example, 'é' (\ xC3 \ xA9), it looks at it byte-by-time, matches \ xC3 and replaces it with 'a'. He does this for all \ xC3 bytes that he can find. So, 'été' turns into 'a \ xA9ta \ xA9'.

Then the second regular expression, which looks like this:

 [\xc3\xA9\xC3\xA8\xC3\xAA\xC3\xAB]

comes and it matches part \ xA9 and replaces it with "e". So now 'a \ xA9ta \ xA9' is turning into aetae.

When you replace [ààâã] with (á | à | â | ã), then it matches the full character correctly in the first pass, but then your second regular expression has the original problem, and the characters \ xC3 are replaced with instead of 'e'.

If this still happens, even with use utf8; , then there may be an error (or at least a limitation) in the Perl regex engine.

Why do Perl string operations for Unicode characters add garbage to a string?

Change 1:

Edit2:

More articles: