I have 100,000 files that I would like to parse. In particular, I would like to calculate the percentage of printed characters from a sample file of arbitrary size. Some of these files are related to mainframes, Windows, Unix, etc., Therefore, it is likely that binary and control characters are included.
I started by using the Linux file command, but there weren’t enough details for my purposes. The following code conveys what I am trying to do, but does not always work.
#!/usr/bin/perl -n use strict; use warnings; my $cnt_n_print = 0; my $cnt_print = 0; my $cnt_total = 0; my $prc_print = 0;
This is a test call that works:
echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl
This is how I intend to call it and works for a single file:
find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl
This does not work correctly:
find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl
Also this:
find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl
Instead of executing the script line once for the EACH returned by the search, it executes ALL FOR ALL the results.
Thanks in advance.
Research so far:
Pipe and XARGS and Separators
http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html
http://en.wikipedia.org/wiki/Xargs#The_separator_problem
Lightening (s):
1.) The desired conclusion: if there are 932 files in the directory, the output will consist of 932 lines of a list of file names, the total number of bytes read from the file, and%, which were printable characters.
2.) Many of the files are binary. The script should handle the embedded binary sequences eol or eof .
3.) Many of the files are large, so I would just read the first / last bytes of xx. I tried to use head -c 256 or tail -c 128 to read the first 256 bytes or the last 128 bytes respectively. The solution can work either in the pipeline line or in bytes within the perl script.