It is much easier than you think. With GNU awk (which you have been using since you used the gawk extension length(array)
in your code) for ENDFILE:
$ cat tst.awk
BEGIN { OFS="\t" }
NR==1 { print "", $0, "Total" }
FNR>1 {
for (i=1; i<=NF; i++) {
cnt[i,$i]++
}
}
ENDFILE {
printf "%s%s", FILENAME, OFS
for (i=1; i<=NF; i++) {
printf "%d%s", cnt[i,1], OFS
}
print FNR-1
delete cnt
}
$ awk -f tst.awk file1 file2
Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 4
file2 0 1 0 1 3
Above all, a tiny amount of data is stored in the array (the number of values for each field in 1 file at a time), so it uses minimal memory and performs very few operations and therefore should work very quickly.
As @ghoti points out, you probably aren't using gawk, so here's a version without gawk that just relies on length(array)
:
$ cat tst.awk
BEGIN { OFS="\t" }
NR==1 { print "", $0, "Total" }
FNR==1 { prt(); next }
{
for (i=1; i<=NF; i++) {
cnt[i,$i]++
}
}
END { prt() }
function prt() {
if (prevFilename) {
printf "%s%s", prevFilename, OFS
for (i=1; i<=NF; i++) {
printf "%d%s", cnt[i,1], OFS
}
print length(cnt) - NF
delete cnt
}
prevFilename = FILENAME
}
$ awk -f tst.awk file1 file2
Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 3
file2 0 1 0 1 4
source
share