Print the number of unique values for each column in many files

Question

Print the number of unique values for each column in many files

I have huge binary matrices from many columns, and I'm trying to get the number of zeros and ones in each field for each file, keeping track of the file and headers. Each file has the same headers and number of columns (but a variable number of rows), and it looks like this:

File 1:
Header1 Header2 Header3 Header4
0 1 0 1 
0 1 0 1
1 0 0 1
0 1 0 1

File 2:
Header1 Header2 Header3 Header4
0 1 0 0 
0 0 0 0
0 0 0 1

Desired output with calculation 0/1

    Header1 Header2 Header3 Header4 Total
File1 1 3 0 4  4
File2 0 1 0 1  3

At the moment I have a number of values equal to units only for file1, but it is displayed with each line as a header, while I would like the original headers to remain as headers, and this also does not print 0. if there is none ... And it does not contain the original file name, so this is not entirely correct! Can you guide me on the right way to do this?

awk 'NF>0{
  for (i=1; i<=NF; i++) 
      if(NR==1)h[i]=$i;else if($i==1) a[i]++;
  } END{for(i=1; i<=length(a); i++) print h[i], a[i], NR}' file1

+4

linux awk

user971102 02 . '15 3:29

3

, , ( , ), :

#!/bin/sh
awk '
    function pr(filename) {
        if (filename) printf ("%s",filename)
        for (i=1; i<=NF; i++) {
            if (filename)
                printf ("%s%s",OFS,a[i])
            else
                printf ("%s%s",OFS,$i) 
            a[i] = 0
            }
        if (filename)
            printf ("%s%s",OFS,prevFNR-1) 
        else 
            printf ("%sTotal",OFS)
        printf ("\n")
        }

    FNR==1  {
            pr(prevFileName)
            prevFileName = FILENAME
            next
            }

    NF>0    {
            for (i=1; i<=NF; i++) 
                if ($i==1) a[i]++
            prevFNR = FNR
            } 

    END {
        pr(FILENAME)
        }' file1 file2

, , FNR==1 , . prevFNR prevFileName . , pr(), , prevFileName FNR==1, , , .

:

 Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 4
file2 0 1 0 1 3

+4

Simon 02 . '15 4:22

It is much easier than you think. With GNU awk (which you have been using since you used the gawk extension length(array)in your code) for ENDFILE:

$ cat tst.awk
BEGIN { OFS="\t" }
NR==1 { print "", $0, "Total" }
FNR>1 {
    for (i=1; i<=NF; i++) {
        cnt[i,$i]++
    }
}
ENDFILE {
    printf "%s%s", FILENAME, OFS
    for (i=1; i<=NF; i++) {
        printf "%d%s", cnt[i,1], OFS
    }
    print FNR-1
    delete cnt
}

$ awk -f tst.awk file1 file2
        Header1 Header2 Header3 Header4 Total
file1   1       3       0       4       4
file2   0       1       0       1       3

Above all, a tiny amount of data is stored in the array (the number of values for each field in 1 file at a time), so it uses minimal memory and performs very few operations and therefore should work very quickly.

As @ghoti points out, you probably aren't using gawk, so here's a version without gawk that just relies on length(array):

$ cat tst.awk
BEGIN { OFS="\t" }
NR==1 { print "", $0, "Total" }
FNR==1 { prt(); next }
{
    for (i=1; i<=NF; i++) {
        cnt[i,$i]++
    }
}
END { prt() }

function prt() {
    if (prevFilename) {
        printf "%s%s", prevFilename, OFS
        for (i=1; i<=NF; i++) {
            printf "%d%s", cnt[i,1], OFS
        }
        print length(cnt) - NF
        delete cnt
    }
    prevFilename = FILENAME
}

$ awk -f tst.awk file1 file2
        Header1 Header2 Header3 Header4 Total
file1   1       3       0       4       3
file2   0       1       0       1       4

+3

Ed morton Oct 2 '15 at 8:02

source share

ghoti · Accepted Answer · 2015-10-02T04:22:32+0000

, :

awk '
  # Gather headers, only from the first line of the first file.
  NR==1{
    for(i=1;i<=NF;i++){
      h[i]=$i;
    }
  }
  # Do not process header as if they were data.
  FNR==1{ next; }

  NF>limit{ limit=NF; }

  # Step through data 
  {
    f[FILENAME]++;
    for(i=1;i<=NF;i++){
      a[FILENAME,i]+=$i;
    }
  }

  # Display what we found.
  END{
    # Headers...
    printf("File\t");
    for(i=1;i<=length(h);i++){
      printf("%s\t",h[i])
    }
    print "Total";

    # And data.
    for(file in f){
      printf("%s",file);
      for(i=1;i<=limit;i++){
        printf("\t%d",a[file,i])
      }
      printf("\t%d\n",f[file]);
    }
  }' file1 file2

, f[] , awk . script awk. ( FreeBSD.) , , . , .: -)

, . , , , .

Print the number of unique values ​​for each column in many files

More articles:

Print the number of unique values for each column in many files