How to use perl to process a file whose format is similar to unicode?

Question

How to use perl to process a file whose format is similar to unicode?

I have an outdated program, and after starting it, it will generate a log file. Now I need to parse this log file.

But the file format is very strange. Note the following: I used vi to open it, it looks like a unicode file, but it is not FFFE. after i used notepad, open it, save it and open it again, i found that FFFE was added to the notepad. Then I can use the command "type log.txt> log1.txt" to convert the entire file to ANSI format. Later in perl, I can use / TDD / in perl to find the file I need.

But now I can not deal with this file.

Any comments or ideas would be greatly appreciated.

0000000: 5400 4400 4400 3e00 2000 4c00 6f00 6100 TDD>. .Loa

After notepad saves it

 0000000: fffe 5400 4400 4400 3e00 2000 4c00 6f00 ..TDD>. .Lo open STDIN, "< log.txt"; while(<>) { if (/TDD/) { # Add my logic. } }

I read a thread, which is very useful, but still cannot solve my problem. How to open a Unicode file using Perl?

I cannot add an answer, so I am editing my stream.

Thanks, Michael, I tried your script, but got the following error. I checked version of Perl 5.1, OS - Windows 2008.

 * ascii * ascii-ctrl * iso-8859-1 * null * utf-8-strict * utf8 UTF-16:Unrecognised BOM 5400 at test.pl line 12.

Update

I tried UTF-16LE with the command:

 perl.exe open.pl utf-16le utf-16 <my log file>.txt

but i still got an error like

 UTF-16LE:Partial character at open.pl line 18, <$fh> line 1824.

too, I tried utf-16be, got the same error.

If I used utf-16, I will get an error

 UTF-16:Unrecognised BOM 5400 at open.pl line 18.

open.pl line 18

 is "print while <$fh>;"

Any idea?

Updated: 11/11/2011. Thanks guys for your help. I solved the problem. I found that the data in the log file is not UTF-16. So, I had to write a .net project by visual studio. He will read the log file with UTF-16 and write to a new file with UTF-8. And then I used a perl script to parse the file and create the result data. Now it worked.

So, if any of you know how to use perl, read the file with a lot of garbage data, please say many thanks.

eg. garbage sample data

 tests.cpp:34) ਍吀䐀䐀㸀 䰀漀愀搀椀渀最 挀挀洀挀漀爀攀⸀搀氀

use a hex reader to open it:

 0000070: a88d e590 80e4 9080 e490 80e3 b880 e280 ................ 0000080: 80e4 b080 e6bc 80e6 8480 e690 80e6 a480 ................ 0000090: e6b8 80e6 9c80 e280 80e6 8c80 e68c 80e6 ................ 00000a0: b480 e68c 80e6 bc80 e788 80e6 9480 e2b8 ................

+6

file perl encode

Orionpax May 6 '11 at 7:26

source share

1 answer

Lumi · Accepted Answer · 2011-05-06T07:44:59+0000

Your file seems to be encoded in UTF-16LE. Additional byte notes are called "Byte Order Mark" or simply a specification.

Here you can read your file using Perl:

 use strict; use warnings; use Encode; # list loaded encodings print STDERR map "* $_\n", Encode->encodings; # read arguments my $enc = shift || 'utf16'; die "no files :-(\n" unless @ARGV; # process files for ( @ARGV ) { open my $fh, "<:encoding($enc)", $_ or die "open $_: $!"; print <$fh>; close $fh; } # loaded more encodings now print STDERR map "* $_\n", Encode->encodings;

Continue to ensure the correct encoding for your file:

 perl open.pl utf16 open.utf16be.txt perl open.pl utf16 open.utf16le.txt perl open.pl utf16le open.utf16le.nobom.txt

Here's the revised version following tchrist's suggestions:

 use strict; use warnings; use Encode; # read arguments my $enc_in = shift || die 'pass file encoding as first parameter'; my $enc_out = shift || die 'pass STDOUT encoding as second parameter'; print STDERR "going to read files as encoded in: $enc_in\n"; print STDERR "going to write to standard output in: $enc_out\n"; die "no files :-(\n" unless @ARGV; binmode STDOUT, ":encoding($enc_out)"; # latin1, cp1252, utf8, UTF-8 print STDERR map "* $_\n", Encode->encodings; # list loaded encodings for ( @ARGV ) { # process files open my $fh, "<:encoding($enc_in)", $_ or die "open $_: $!"; print while <$fh>; close $fh; } print STDERR map "* $_\n", Encode->encodings; # more encodings now

How to use perl to process a file whose format is similar to unicode?

More articles: