How to check perl if the file is written as small endian or big endian?

In fact, I have to parse files that can be in any form endian (Big or Little). The Perl interpreter dies if I use one encoding and parse another.

open (my $fh, "<:raw:encoding(UTF-16LE):crlf", $ARGV[0]) or die cannot open file for reading : $! \n";

or

open (my $fh, "<:raw:encoding(UTF-16BE):crlf", $ARGV[0]) or die cannot open file for reading : $! \n";

(for a file in LE and BE perl encoding)

UTF-16BE:Malformed HI surrogate dc00 at toASCII.pl line 123.
+4
source share
2 answers

Most UTF-16le files are valid UTF-16be files and vice versa. For example, there is no way to determine if 0A 00U + 000A (UTF-16le) or U + 0A00 (UTF-16be) matches. Therefore, assuming a specification , you must guess.

Possible heuristic (in decreasing order of reliability):

  • U + FFFE ().
    • FF FE, UTF-16le.
    • FE FF, UTF-16be.
    • UTF-16be, UTF-16le.
    • UTF-16le, UTF-16be.
    • UTF-16be, UTF-16le.
    • UTF-16le, UTF-16be.
  • U + 0A00 , U + 000A (LINE FEED) .
    U + 0D00 , U + 000D (CARRIAGE RETURN) .
    • 0A 00 0D 00, , , UTF-16le.
    • 00 0A 00 0D, , , UTF-16be.
    • UTF-16be, , , UTF-16le.
    • UTF-16le, , , UTF-16be.
  • . ()
  • ASCII, U + xx00
    • xx 00 00 xx, , , UTF-16le.
    • 00 xx xx 00, , , UTF-16be.

:

  • # 4 # 5 "", " ", .
  • # 3 # 1, # 1 - .
  • # 5 # 4, # 4 , # 5, , .

:raw, , decode s/\r\n/\n/g.

+4

- , , , , . , , , , , , read , 't ,

, , , . -

my $file = $ARGV[0];

open my $fh, '<:raw:encoding(UTF-16LE):crlf', $file or die $!;

eval { do_stuff_that_may_crash() };

if ( $@ ) {
    if ( $@ =~ /Malformed HI surrogate/ ) {
        open my $fh, '<:raw:encoding(UTF-16BE):crlf', $file or die $!;
        do_stuff_that_may_crash();
    }
    else {
        die $@;
    }
}

, do_stuff_that_may_crash() - , , ,

+1

Source: https://habr.com/ru/post/1663883/


All Articles