ASCII character normalization

Question

ASCII character normalization

I need to normalize a string like "quée" and I cannot convert extended ASCII characters like é, á, í, etc. in Roman / English versions. I tried several different methods, but so far nothing works. There is quite a lot of material on this general question, but I can not find a working answer to this problem.

Here is my code:

#transliteration solution (works great with standard chars but doesn't find the #special ones) - I've tried looking for both \x{130} and é with the same result. $mystring =~ tr/\\x{130}/e/; #converting into array, then iterating through and replacing the specific char #( same result as the above solution ) my @breakdown = split( "",$mystring ); foreach ( @breakdown ) { if ( $_ eq "\x{130}" ) { $_ = "e"; print "\nArray Output: @breakdown\n"; } $lowercase = join( "",@breakdown ); }

+6

perl ascii normalization

Andrew Coomes May 24, '12 at 17:19

source share

4 answers

The reason your original code doesn't work is that \x{130} is not é. It LATIN CAPITAL LETTER i WITH DOT ABOVE (U + 0130 or İ) . You meant \x{E9} or just \xE9 (the braces are optional for two-digit numbers), LATIN SMALL LETTER E WITH ACUTE (U + 00E9) .

Also, you have an extra backslash in tr ; it should look like tr/\xE9/e/ .

Your code will work with these changes, although I would still recommend using one of the CPAN modules for this kind of thing. I prefer Text :: Unidecode for myself, as it handles much more than just accented characters.

+7

cjm May 24 '12 at 18:00

source share

After work and re-work, this is what I have now. It does everything I want, except that I want to keep spaces in the middle of the input lines to distinguish between words.

 open FILE, "funnywords.txt"; # Iterate through funnywords.txt while ( <FILE> ) { chomp; # Show initial text from file print "In: '$_' -> "; my $inputString = $_; # $inputString is scoped within a for each loop which dissects # unicode characters ( example: "é" splits into "e" and "´" ) # and throws away accent marks. Also replaces all # non-alphanumeric characters with spaces and removes # extraneous periods and spaces. for ( $inputString ) { $inputString = NFD( $inputString ); # decompose/dissect s/^\s//; s/\s$//; # strip begin/end spaces s/\pM//g; # strip odd pieces s/\W+//g; # strip non-word chars } # Convert to lowercase my $outputString = "\L$inputString"; # Output final result print "$outputString\n"; }

Not quite sure why it colors some regular expressions and comments ...

Here are some sample lines from "funnywords.txt":

Kui

22.

? ÉÉíóñúÑ¿¡

[.this? ]

aquí, aLLí

+3

Andrew Coomes May 25 '12 at 18:56

source share

For your second question about getting rid of any remaining characters, but storing letters and numbers will change your last regular expression from s/\W+//g to s/[^a-zA-Z0-9 ]+//g . Since you already normalize the rest of the input, using this regex will remove anything that is not az, AZ, 0-9, or a space. Using [] and a ^ at the beginning will mean that you want to search for everything that is NOT in the rest of the bracket.

+2

Zephyrie May 30 '12 at 23:40

source share

DVK · Accepted Answer · 2012-05-24T17:26:58+0000

1) This article should provide a good (if complicated) way.

It provides a solution for converting all accented Unicode characters to a base character + accent; once this is done, you can simply remove the accent characters separately.

2) Another option is CPAN: Text::Unaccent::PurePerl (an improved version of Pure Perl Text::Unaccent )

3) Also, this SO answer offers Text::Unidecode :

 $ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")' ete

ASCII character normalization

More articles: