Why is umlaut not recognized in UTF-8 Perl script with "use utf8"?

Question

Why is umlaut not recognized in UTF-8 Perl script with "use utf8"?

The following script is encoded in UTF-8:

use utf8; $fuer = pack('H*', '66c3bc72'); $fuer =~ s/ü/!!!/; print $fuer;

ü in s/// is stored in the script as c3 bc , as shown below xxd hex dump.

 0000000: 75 73 65 20 75 74 66 38 3b 0a 0a 24 66 75 65 72 use utf8;..$fuer 0000010: 20 3d 20 70 61 63 6b 28 27 48 2a 27 2c 20 27 36 = pack('H*', '6 0000020: 36 63 33 62 63 37 32 27 29 3b 0a 0a 24 66 75 65 6c3bc72');..$fue 0000030: 72 20 3d 7e 20 73 2f c3 bc 2f 21 21 21 2f 3b 0a r =~ s/../!!!/;. 0000040: 0a 70 72 69 6e 74 20 24 66 75 65 72 3b 0a .print $fuer;.

c3 bc is the UTF-8 view for ü .

Since the script is encoded in UTF-8, and I use ing utf8 , I expected the script to replace für with the $fuer variable, but that is not the case.

However, if I remove use utf8 . This is contrary to what I considered use utf8 : to indicate that the script is encoded in UTF-8.

+6

perl utf-8 character-encoding

René nyffenegger Feb 11 '17 at 11:02

source share

2 answers

Derive ü from s/// and into your own variable so that we can test it.

 use utf8; # Script is encoded using UTF-8 use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8. use strict; use warnings; my $uuml = "ü"; printf("%d %vX %s", length($uuml), $uuml, $uuml); # 1 FC ü my $fuer = pack('H*', '66c3bc72'); printf("%d %vX %s", length($fuer), $fuer, $fuer); # 4 66.C3.BC.72 fÃ¼r $fuer =~ s/\Q$uuml/!!!/; printf("%d %vX %s", length($fuer), $fuer, $fuer); # 4 66.C3.BC.72 fÃ¼r

As this is obvious, you are comparing the Unicode ü ( FC ) Code Point against the UTF-8 ü ( C3 BC ) encoding.

So yes, use utf8; indicates that the script is encoded using UTF-8 ... but it does so so that Perl can correctly decode the script.

Decode all inputs and encode all outputs! The solution is to replace

 my $fuer = pack('H*', '66c3bc72');

from

 use Encode qw( decode_utf8 ); my $fuer = decode_utf8(pack('H*', '66c3bc72'));

or

 my $fuer = pack('H*', '66c3bc72'); utf8::decode($fuer);

+4

ikegami Feb 11 '17 at 19:41

source share

Borodin · Accepted Answer · 2017-02-11T11:18:09+0000

The problem is with character boundaries. You are comparing a coded byte string with a decoded character string

$fuer = pack('H*', '66c3bc72') creates a four-byte string "\x66\xc3\xbc\x72" , while a small u with diaeresis ü is equal to "\xfc" , so they don't match

If you used decode_utf8 from the Encode module to further process your $fuer variable, then it would decrypt UTF-8 to form the three-character string "\x66\xfc\x72" and the substitute would then work

use utf8 applies the equivalent to decode_utf8 to the entire source file, so without it your ü appears as "\xc3\xbc" , which corresponds to a packed variable

Why is umlaut not recognized in UTF-8 Perl script with "use utf8"?

More articles: