Using Encode :: encode with "utf8"

So, as you probably know in Perl, β€œutf8” means Perl a weaker understanding of UTF-8, which allows characters that are technically not valid code points in UTF-8. In contrast, "UTF-8" (or "utf-8") is a more rigorous understanding of UTF-8 in Perl, which does not allow invalid code points.

I have a few questions related to this difference:

  • Encode :: encode by default replaces invalid characters with a replacement character. Is this true even if you use the "utf8" encoding as the encoding?

  • What happens when you read and write files that were open'd using "UTF-8"? Does character replacement replace bad characters or is something else going on?

  • What is the difference between using openwith a layer of type '>: utf8' and a layer of type β†’: encoding (utf8) '? Can I use both approaches with "utf8" and with "UTF-8"?

+4
source share
1 answer
                   ╔════════════════════════════════════════════╀══════════════════════╗
                   β•‘                                            β”‚                      β•‘
                   β•‘                  On Read                   β”‚       On Write       β•‘
                   β•‘                                            β”‚                      β•‘
        Perl       β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
        5.26       β•‘                     β”‚                      β”‚                      β•‘
                   β•‘ Invalid encoding    β”‚ Outside of Unicode,  β”‚ Outside of Unicode,  β•‘
                   β•‘ other than sequence β”‚ Unicode nonchar, or  β”‚ Unicode nonchar, or  β•‘
                   β•‘ length              β”‚ Unicode surrogate    β”‚ Unicode surrogate    β•‘
                   β•‘                     β”‚                      β”‚                      β•‘
╔══════════════════╬═════════════════════β•ͺ══════════════════════β•ͺ══════════════════════╣
β•‘                  β•‘                     β”‚                      β”‚                      β•‘
β•‘ :encoding(UTF-8) β•‘ Warns and Replaces  β”‚ Warns and Replaces   β”‚ Warns and Replaces   β•‘
β•‘                  β•‘                     β”‚                      β”‚                      β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•«β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘                  β•‘                     β”‚                      β”‚                      β•‘
β•‘ :encoding(utf8)  β•‘ Warns and Replaces  β”‚ Accepts              β”‚ Warns and Accepts    β•‘
β•‘                  β•‘                     β”‚                      β”‚                      β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•«β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘                  β•‘                     β”‚                      β”‚                      β•‘
β•‘ :utf8            β•‘ Corrupt scalar      β”‚ Accepts              β”‚ Warns and Accepts    β•‘
β•‘                  β•‘                     β”‚                      β”‚                      β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•©β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Click here if you have trouble viewing the above

Note that it :encoding(UTF-8)really decodes using utf8, then checks the range if the characters are valid (since it recognizes "\x{20_000}"and even "\x{1000_0000_0000_0000}"). This reduces the number of error messages, so it’s good.

(Encoding names are not case sensitive.)


Tests:

While reading

  • :encoding(UTF-8)

    printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(UTF-8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "\xFFFF" does not map to Unicode.
    utf8 "\xD800" does not map to Unicode.
    utf8 "\x200000" does not map to Unicode.
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
    5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
    5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    
  • :encoding(utf8)

    $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    
  • :utf8

    $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)
    

When recording

  • :encoding(UTF-8)

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print "\x{E9}\n";
       print "\x{FFFF}\n";
       print "\x{D800}\n";
       print "\x{20_0000}\n";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "\x{ffff}" does not map to utf8.
    "\x{d800}" does not map to utf8.
    "\x{200000}" does not map to utf8.
    
    $ od -t c a
    0000000 303 251  \n   \   x   {   F   F   F   F   }  \n   \   x   {   D
    0000020   8   0   0   }  \n   \   x   {   2   0   0   0   0   0   }  \n
    0000040
    
    $ cat a
    Γ©
    \x{FFFF}
    \x{D800}
    \x{200000}
    
  • :encoding(utf8)

    $ perl -e'
       use open ":std", ":encoding(utf8)";
       print "\x{E9}\n";
       print "\x{FFFF}\n";
       print "\x{D800}\n";
       print "\x{20_0000}\n";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
    
    $ od -t c a
    0000000 303 251  \n 355 240 200  \n 370 210 200 200 200  \n
    0000015
    
    $ cat a
    Γ©
    β–’
    β–’
    
  • :utf8

    The same results as :encoding(utf8).

Tested using Perl 5.26.


Encode:: encode . , "utf8" ?

Perl 32- 64- . utf8 72- . , .

+6

Source: https://habr.com/ru/post/1694263/


All Articles