PERL for counting non-printable characters

I have 100,000 files that I would like to parse. In particular, I would like to calculate the percentage of printed characters from a sample file of arbitrary size. Some of these files are related to mainframes, Windows, Unix, etc., Therefore, it is likely that binary and control characters are included.

I started by using the Linux file command, but there weren’t enough details for my purposes. The following code conveys what I am trying to do, but does not always work.

#!/usr/bin/perl -n use strict; use warnings; my $cnt_n_print = 0; my $cnt_print = 0; my $cnt_total = 0; my $prc_print = 0; #Count the number of non-printable characters while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++}; #Count the number of printable characters while ($_ =~ m/[[:print:]]/g) {$cnt_print++}; $cnt_total = $cnt_n_print + $cnt_print; $prc_print = $cnt_print/$cnt_total; #Print the # total number of bytes read followed by the % printable print "$cnt_total|$prc_print\n" 

This is a test call that works:

  echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl 

This is how I intend to call it and works for a single file:

  find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl 

This does not work correctly:

  find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl 

Also this:

  find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl 

Instead of executing the script line once for the EACH returned by the search, it executes ALL FOR ALL the results.

Thanks in advance.


Research so far:

Pipe and XARGS and Separators

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem


Lightening (s):
1.) The desired conclusion: if there are 932 files in the directory, the output will consist of 932 lines of a list of file names, the total number of bytes read from the file, and%, which were printable characters.
2.) Many of the files are binary. The script should handle the embedded binary sequences eol or eof .
3.) Many of the files are large, so I would just read the first / last bytes of xx. I tried to use head -c 256 or tail -c 128 to read the first 256 bytes or the last 128 bytes respectively. The solution can work either in the pipeline line or in bytes within the perl script.

+4
source share
3 answers

Here is my working solution based on the feedback provided.

I would appreciate further feedback on the form or more effective methods:

  #!/usr/bin/perl use strict; use warnings; # This program receives a file path and name. # The program attempts to read the first 2000 bytes. # The output is a list of files, the number of bytes # actually read and the percent of tbe bytes that are # ASCII "printable" aka [\x20-\x7E]. my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print); # loop through each file foreach(@ARGV) { $file_name = shift or die "Pass the file name on the command line.\n"; # open the file read only with "<" in "<$file_name" open(FILE, "<$file_name") or die "Can't open $file_name: $!"; # open each file in binary mode to handle non-printable characters binmode FILE; # try to read 2000 bytes from FILE, save the results in $data and the # actual number of bytes read in $n_bytes $n_bytes = read FILE, $data, 2000; $cnt_n_print = 0; $cnt_print = 0; # count the number of non-printable characters ++$cnt_n_print while ($data =~ m/[^[:print:]]/g); $cnt_print = $n_bytes - $cnt_n_print; $prc_print = $cnt_print/$n_bytes; print "$file_name|$n_bytes|$prc_print\n"; close(FILE); } 

Here is an example call to the above script:

  find /some/path/to/files/ -type f -exec perl this_script.pl {} + 

Here is a list of links that I found useful:

POSIX parenthesis expressions
Opening files in binmode
Reading function
Open read-only file

0
source

The -n completes all your code in a while(defined($_=<ARGV>) { ... } block. This means that your declarations my $cnt_print and other variables are repeated for each line of input, essentially resetting all your values ​​of variables.

The workaround is to use global variables (declare them with our if you want to continue to use use strict ), rather than initializing them to 0 , as they will be reinitialized for each line of input. Could you say something like

 our $cnt_print //= 0; 

unless you want $cnt_print and his friends to be undefined for the first line of input.

See this recent question with a similar issue.

+4
source

You could find to pass you one argument at a time.

 find /fct/inbound/trans/ -type f -exec perl script.pl {} \; 

But I continue to transfer several files at the same time, either through xargs , or using GNU find -exec + .

 find /fct/inbound/trans/ -type f -exec perl script.pl {} + 

The following code snippets support both.

You can continue reading the line at a time:

 #!/usr/bin/perl use strict; use warnings; my $cnt_total = 0; my $cnt_n_print = 0; while (<>) { $cnt_total += length; ++$cnt_n_print while /[^[:print:]]/g; } continue { if (eof) { my $cnt_print = $cnt_total - $cnt_n_print; my $prc_print = $cnt_print/$cnt_total; print "$ARGV: $cnt_total|$prc_print\n"; $cnt_total = 0; $cnt_n_print = 0; } } 

Or you can read the whole file at a time:

 #!/usr/bin/perl use strict; use warnings; local $/; while (<>) { my $cnt_n_print = 0; ++$cnt_n_print while /[^[:print:]]/g; my $cnt_total = length; my $cnt_print = $cnt_total - $cnt_n_print; my $prc_print = $cnt_print/$cnt_total; print "$ARGV: $cnt_total|$prc_print\n"; } 
+1
source

Source: https://habr.com/ru/post/1447359/


All Articles