Regexp not working for special special characters in Perl

Question

Regexp not working for special special characters in Perl

I cannot get rid of the special character ¤ and ❤ in the line:

$word = 'cɞi¤r$c❤u¨s'; $word =~ s/[^a-zöäåA-ZÖÄÅ]//g; printf "$word\n";

In the second line, I try to remove any non-alphabet characters from the string $word . I would expect the word circus to be printed, but instead I get:

 ci rc us

öäå and ÖÄÅ in the expression are just normal Swedish characters that I need.

+6

regex perl unicode special-characters

Pithikos Nov 25 '11 at 13:45

source share

3 answers

Short answer: add utf8; so that your literal string in the source code is interpreted as utf8, which includes the contents of the test string and the contents of the regular expression.

Long answer:

 #!/usr/bin/env perl use warnings; use Encode; my $word = 'cɞi¤r$c❤u¨s'; foreach my $char (split //, $word) { print ord($char) . Encode::encode_utf8(":$char "); } my $allowed_chars = 'a-zöäåA-ZÖÄÅ'; print "\n"; foreach my $char (split //, $allowed_chars) { print ord($char) . Encode::encode_utf8(":$char "); } print "\n"; $word =~ s/[^$allowed_chars]//g; printf Encode::encode_utf8("$word\n");

Running it without utf8:

 $ perl utf8_regexp.pl 99:c 201:É 158: 105:i 194:Â 164:¤ 114:r 36:$ 99:c 226:â 157: 164:¤ 117:u 194:Â 168:¨ 115:s 97:a 45:- 122:z 195:Ã 182:¶ 195:Ã 164:¤ 195:Ã 165:¥ 65:A 45:- 90:Z 195:Ã 150: 195:Ã 132: 195:Ã 133: ci¤rc¤us

Running it with utf8:

 $ perl -Mutf8 utf8_regexp.pl 99:c 606:ɞ 105:i 164:¤ 114:r 36:$ 99:c 10084:❤ 117:u 168:¨ 115:s 97:a 45:- 122:z 246:ö 228:ä 229:å 65:A 45:- 90:Z 214:Ö 196:Ä 197:Å circus

Explanation:

The non-ascii characters that you enter in your source code are represented by more than one byte. Since your input is encoded by utf8. In a blank ascii or latin-1 column, the characters would be one byte.

If you do not use the utf8 module, perl considers that each byte that you enter is a separate character, as you can see when you split and print each individual character. When using the utf8 module, it treats a combination of several bytes as one character in accordance with utf8 encoding rules.

As you can see from coinscidence, some of the bytes that are used in Swedish characters match some of the bytes that make up some of the characters in your test string, and they are saved. Namely: ö, which in utf8 consists of 195: Ã 164: ¤ - 164 ends as one of the characters that you allow, and it passes through.

The solution is to tell perl that your lines should be considered utf-8.

Encode_utf8 calls are available to avoid warnings that large characters are printed on the terminal. As always, you need to decode the input and encode the output in accordance with the character encoding that the input or output should process.

Hope this has become clearer.

+3

nicomen Nov 28 '11 at 10:34

source share

As choroba pointed out , adding this to the top of the perl script solves it:

 use utf8; binmode(STDOUT, ":utf8");

where use utf8 allows you to correctly use special characters in the regular expression, and binmode(STDOUT, ":utf8") allows you to display special characters correctly in the shell.

-7

Pithikos Nov 25 '11 at 15:21

source share

choroba · Accepted Answer · 2011-11-25T13:51:13+0000

If the characters are in the source code, be sure to use utf8 . If they are read from a file, binmode $FILEHANDLE, ':utf8' .

Be sure to read perldoc perlunicode .

Regexp not working for special special characters in Perl

More articles: