ββββββββββββββββββββββββββββββββββββββββββββββ€βββββββββββββββββββββββ
β β β
β On Read β On Write β
β β β
Perl βββββββββββββββββββββββ¬βββββββββββββββββββββββΌβββββββββββββββββββββββ’
5.26 β β β β
β Invalid encoding β Outside of Unicode, β Outside of Unicode, β
β other than sequence β Unicode nonchar, or β Unicode nonchar, or β
β length β Unicode surrogate β Unicode surrogate β
β β β β
ββββββββββββββββββββ¬ββββββββββββββββββββββͺβββββββββββββββββββββββͺβββββββββββββββββββββββ£
β β β β β
β :encoding(UTF-8) β Warns and Replaces β Warns and Replaces β Warns and Replaces β
β β β β β
ββββββββββββββββββββ«ββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββ’
β β β β β
β :encoding(utf8) β Warns and Replaces β Accepts β Warns and Accepts β
β β β β β
ββββββββββββββββββββ«ββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββ’
β β β β β
β :utf8 β Corrupt scalar β Accepts β Warns and Accepts β
β β β β β
ββββββββββββββββββββ©ββββββββββββββββββββββ§βββββββββββββββββββββββ§βββββββββββββββββββββββ
Click here if you have trouble viewing the above
Note that it :encoding(UTF-8)really decodes using utf8, then checks the range if the characters are valid (since it recognizes "\x{20_000}"and even "\x{1000_0000_0000_0000}"). This reduces the number of error messages, so itβs good.
(Encoding names are not case sensitive.)
Tests:
While reading
:encoding(UTF-8)
printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(UTF-8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:encoding(utf8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(utf8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:utf8
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(utf8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
0 (internal: 80, UTF8=1)
When recording
:encoding(UTF-8)
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
"\x{ffff}" does not map to utf8.
"\x{d800}" does not map to utf8.
"\x{200000}" does not map to utf8.
$ od -t c a
0000000 303 251 \n \ x { F F F F } \n \ x { D
0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
0000040
$ cat a
Γ©
\x{FFFF}
\x{D800}
\x{200000}
:encoding(utf8)
$ perl -e'
use open ":std", ":encoding(utf8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
$ od -t c a
0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
0000015
$ cat a
Γ©
β
β
:utf8
The same results as :encoding(utf8).
Tested using Perl 5.26.
Encode:: encode . , "utf8" ?
Perl 32- 64- . utf8 72- . , .