Short answer: add utf8; so that your literal string in the source code is interpreted as utf8, which includes the contents of the test string and the contents of the regular expression.
Long answer:
#!/usr/bin/env perl use warnings; use Encode; my $word = 'cɞi¤r$c❤u¨s'; foreach my $char (split //, $word) { print ord($char) . Encode::encode_utf8(":$char "); } my $allowed_chars = 'a-zöäåA-ZÖÄÅ'; print "\n"; foreach my $char (split //, $allowed_chars) { print ord($char) . Encode::encode_utf8(":$char "); } print "\n"; $word =~ s/[^$allowed_chars]//g; printf Encode::encode_utf8("$word\n");
Running it without utf8:
$ perl utf8_regexp.pl 99:c 201:É 158: 105:i 194:Â 164:¤ 114:r 36:$ 99:c 226:â 157: 164:¤ 117:u 194:Â 168:¨ 115:s 97:a 45:- 122:z 195:Ã 182:¶ 195:Ã 164:¤ 195:Ã 165:¥ 65:A 45:- 90:Z 195:Ã 150: 195:Ã 132: 195:Ã 133: ci¤rc¤us
Running it with utf8:
$ perl -Mutf8 utf8_regexp.pl 99:c 606:ɞ 105:i 164:¤ 114:r 36:$ 99:c 10084:❤ 117:u 168:¨ 115:s 97:a 45:- 122:z 246:ö 228:ä 229:å 65:A 45:- 90:Z 214:Ö 196:Ä 197:Å circus
Explanation:
The non-ascii characters that you enter in your source code are represented by more than one byte. Since your input is encoded by utf8. In a blank ascii or latin-1 column, the characters would be one byte.
If you do not use the utf8 module, perl considers that each byte that you enter is a separate character, as you can see when you split and print each individual character. When using the utf8 module, it treats a combination of several bytes as one character in accordance with utf8 encoding rules.
As you can see from coinscidence, some of the bytes that are used in Swedish characters match some of the bytes that make up some of the characters in your test string, and they are saved. Namely: ö, which in utf8 consists of 195: Ã 164: ¤ - 164 ends as one of the characters that you allow, and it passes through.
The solution is to tell perl that your lines should be considered utf-8.
Encode_utf8 calls are available to avoid warnings that large characters are printed on the terminal. As always, you need to decode the input and encode the output in accordance with the character encoding that the input or output should process.
Hope this has become clearer.