PERL for counting non-printable characters

Question

PERL for counting non-printable characters

I have 100,000 files that I would like to parse. In particular, I would like to calculate the percentage of printed characters from a sample file of arbitrary size. Some of these files are related to mainframes, Windows, Unix, etc., Therefore, it is likely that binary and control characters are included.

I started by using the Linux file command, but there weren’t enough details for my purposes. The following code conveys what I am trying to do, but does not always work.

#!/usr/bin/perl -n use strict; use warnings; my $cnt_n_print = 0; my $cnt_print = 0; my $cnt_total = 0; my $prc_print = 0; #Count the number of non-printable characters while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++}; #Count the number of printable characters while ($_ =~ m/[[:print:]]/g) {$cnt_print++}; $cnt_total = $cnt_n_print + $cnt_print; $prc_print = $cnt_print/$cnt_total; #Print the # total number of bytes read followed by the % printable print "$cnt_total|$prc_print\n"

This is a test call that works:

  echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

This is how I intend to call it and works for a single file:

  find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

This does not work correctly:

  find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

Also this:

  find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

Instead of executing the script line once for the EACH returned by the search, it executes ALL FOR ALL the results.

Thanks in advance.

Research so far:

Pipe and XARGS and Separators

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

Lightening (s):
1.) The desired conclusion: if there are 932 files in the directory, the output will consist of 932 lines of a list of file names, the total number of bytes read from the file, and%, which were printable characters.
2.) Many of the files are binary. The script should handle the embedded binary sequences eol or eof .
3.) Many of the files are large, so I would just read the first / last bytes of xx. I tried to use head -c 256 or tail -c 128 to read the first 256 bytes or the last 128 bytes respectively. The solution can work either in the pipeline line or in bytes within the perl script.

+4

perl ascii non-ascii-characters non-printable

Stan Nov 20 '12 at 22:26

source share

3 answers

The -n completes all your code in a while(defined($_=<ARGV>) { ... } block. This means that your declarations my $cnt_print and other variables are repeated for each line of input, essentially resetting all your values of variables.

The workaround is to use global variables (declare them with our if you want to continue to use use strict ), rather than initializing them to 0 , as they will be reinitialized for each line of input. Could you say something like

 our $cnt_print //= 0;

unless you want $cnt_print and his friends to be undefined for the first line of input.

See this recent question with a similar issue.

+4

mob Nov 20 '12 at 10:33

source share

You could find to pass you one argument at a time.

 find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

But I continue to transfer several files at the same time, either through xargs , or using GNU find -exec + .

 find /fct/inbound/trans/ -type f -exec perl script.pl {} +

The following code snippets support both.

You can continue reading the line at a time:

 #!/usr/bin/perl use strict; use warnings; my $cnt_total = 0; my $cnt_n_print = 0; while (<>) { $cnt_total += length; ++$cnt_n_print while /[^[:print:]]/g; } continue { if (eof) { my $cnt_print = $cnt_total - $cnt_n_print; my $prc_print = $cnt_print/$cnt_total; print "$ARGV: $cnt_total|$prc_print\n"; $cnt_total = 0; $cnt_n_print = 0; } }

Or you can read the whole file at a time:

 #!/usr/bin/perl use strict; use warnings; local $/; while (<>) { my $cnt_n_print = 0; ++$cnt_n_print while /[^[:print:]]/g; my $cnt_total = length; my $cnt_print = $cnt_total - $cnt_n_print; my $prc_print = $cnt_print/$cnt_total; print "$ARGV: $cnt_total|$prc_print\n"; }

+1

ikegami Nov 20 '12 at 10:51

source share

Stan · Accepted Answer · 2012-11-28T19:15:03+0000

Here is my working solution based on the feedback provided.

I would appreciate further feedback on the form or more effective methods:

  #!/usr/bin/perl use strict; use warnings; # This program receives a file path and name. # The program attempts to read the first 2000 bytes. # The output is a list of files, the number of bytes # actually read and the percent of tbe bytes that are # ASCII "printable" aka [\x20-\x7E]. my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print); # loop through each file foreach(@ARGV) { $file_name = shift or die "Pass the file name on the command line.\n"; # open the file read only with "<" in "<$file_name" open(FILE, "<$file_name") or die "Can't open $file_name: $!"; # open each file in binary mode to handle non-printable characters binmode FILE; # try to read 2000 bytes from FILE, save the results in $data and the # actual number of bytes read in $n_bytes $n_bytes = read FILE, $data, 2000; $cnt_n_print = 0; $cnt_print = 0; # count the number of non-printable characters ++$cnt_n_print while ($data =~ m/[^[:print:]]/g); $cnt_print = $n_bytes - $cnt_n_print; $prc_print = $cnt_print/$n_bytes; print "$file_name|$n_bytes|$prc_print\n"; close(FILE); }

Here is an example call to the above script:

  find /some/path/to/files/ -type f -exec perl this_script.pl {} +

Here is a list of links that I found useful:

POSIX parenthesis expressions
Opening files in binmode
Reading function
Open read-only file

PERL for counting non-printable characters

More articles: