Using Encode :: encode with "utf8"

Question

Using Encode :: encode with "utf8"

So, as you probably know in Perl, “utf8” means Perl a weaker understanding of UTF-8, which allows characters that are technically not valid code points in UTF-8. In contrast, "UTF-8" (or "utf-8") is a more rigorous understanding of UTF-8 in Perl, which does not allow invalid code points.

I have a few questions related to this difference:

Encode :: encode by default replaces invalid characters with a replacement character. Is this true even if you use the "utf8" encoding as the encoding?
What happens when you read and write files that were open'd using "UTF-8"? Does character replacement replace bad characters or is something else going on?
What is the difference between using openwith a layer of type '>: utf8' and a layer of type →: encoding (utf8) '? Can I use both approaches with "utf8" and with "UTF-8"?

+4

perl

Stephen Feb 28 '18 at 21:04

source share

1 answer

ikegami · Answer 1 · 2018-02-28T23:07:30+0000

                   ╔════════════════════════════════════════════╤══════════════════════╗
                   ║                                            │                      ║
                   ║                  On Read                   │       On Write       ║
                   ║                                            │                      ║
        Perl       ╟─────────────────────┬──────────────────────┼──────────────────────╢
        5.26       ║                     │                      │                      ║
                   ║ Invalid encoding    │ Outside of Unicode,  │ Outside of Unicode,  ║
                   ║ other than sequence │ Unicode nonchar, or  │ Unicode nonchar, or  ║
                   ║ length              │ Unicode surrogate    │ Unicode surrogate    ║
                   ║                     │                      │                      ║
╔══════════════════╬═════════════════════╪══════════════════════╪══════════════════════╣
║                  ║                     │                      │                      ║
║ :encoding(UTF-8) ║ Warns and Replaces  │ Warns and Replaces   │ Warns and Replaces   ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :encoding(utf8)  ║ Warns and Replaces  │ Accepts              │ Warns and Accepts    ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :utf8            ║ Corrupt scalar      │ Accepts              │ Warns and Accepts    ║
║                  ║                     │                      │                      ║
╚══════════════════╩═════════════════════╧══════════════════════╧══════════════════════╝

Click here if you have trouble viewing the above

Note that it :encoding(UTF-8)really decodes using utf8, then checks the range if the characters are valid (since it recognizes "\x{20_000}"and even "\x{1000_0000_0000_0000}"). This reduces the number of error messages, so it’s good.

(Encoding names are not case sensitive.)

Tests:

While reading

:encoding(UTF-8)

printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
   perl -MB -nle'
      use open ":std", ":encoding(UTF-8)";
      my $sv = B::svref_2object(\$_);
      printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
   '
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)

:encoding(utf8)

$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
   perl -MB -nle'
      use open ":std", ":encoding(utf8)";
      my $sv = B::svref_2object(\$_);
      printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
   '
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)

:utf8

$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
   perl -MB -nle'
      use open ":std", ":encoding(utf8)";
      my $sv = B::svref_2object(\$_);
      printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
   '
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
0 (internal: 80, UTF8=1)

When recording

:encoding(UTF-8)

$ perl -e'
   use open ":std", ":encoding(UTF-8)";
   print "\x{E9}\n";
   print "\x{FFFF}\n";
   print "\x{D800}\n";
   print "\x{20_0000}\n";
' >a
Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
"\x{ffff}" does not map to utf8.
"\x{d800}" does not map to utf8.
"\x{200000}" does not map to utf8.

$ od -t c a
0000000 303 251  \n   \   x   {   F   F   F   F   }  \n   \   x   {   D
0000020   8   0   0   }  \n   \   x   {   2   0   0   0   0   0   }  \n
0000040

$ cat a
é
\x{FFFF}
\x{D800}
\x{200000}

:encoding(utf8)

$ perl -e'
   use open ":std", ":encoding(utf8)";
   print "\x{E9}\n";
   print "\x{FFFF}\n";
   print "\x{D800}\n";
   print "\x{20_0000}\n";
' >a
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.

$ od -t c a
0000000 303 251  \n 355 240 200  \n 370 210 200 200 200  \n
0000015

$ cat a
é
▒
▒

:utf8
The same results as :encoding(utf8).

Tested using Perl 5.26.

Encode:: encode . , "utf8" ?

Perl 32- 64- . utf8 72- . , .

Using Encode :: encode with "utf8"

While reading

When recording

More articles: