How to print the first row of column 1 and the last row of column 2 based on uique id in column 3

I have a tab delimited file that looks like this:

Het 157709  157731  Cluster.90  2   +
Het 157739  157760  Cluster.90  2   +
Het 164238  164259  Cluster.97  10  +
Het 164380  164401  Cluster.97  10  +
Het 164396  164417  Cluster.97  10  +
Het 164397  164421  Cluster.97  10  +
Het 164397  164420  Cluster.97  10  +
Het 164399  164420  Cluster.97  10  +
Het 164536  164561  Cluster.97  10  +
Het 164576  164598  Cluster.97  10  +
Het 164599  164615  Cluster.97  10  +
Het 164635  164656  Cluster.97  10  +
Het 198007  198031  Cluster.125 3   +
Het 198007  198028  Cluster.125 3   +
Het 198011  198035  Cluster.125 3   +

I am looking for an efficient way to create a file as follows:

Het 157709  157760  Cluster.90  2   +
Het 164238  164656  Cluster.97  10  +
Het 198007  198035  Cluster.125 3   +

Where for each unique entry in column 4 I write a line that includes the first row for columns 1 and 2, followed by the last row in columns 3, 4, 5 and 6. So far I have tried the following solution but it seems very inefficient:

for i in `awk '{print $4}' filename | sort | uniq`
    do
    fgrep -F $i -w filename | awk 'NR==1 {printf $1"\t"$2"\t"} END {print $3"\t"$4"\t"$5"\t"$6}' >>filename2
done

The problem is that when I have a huge file (487559 lines), it takes forever. Is there a better solution lurking in someone's head there?

+4
source share
4 answers

awk, , script:

awk '!($4 in a){a[$4]=$1 FS $2; r[++i]=$4; b[$4]=$3 FS $4 FS $5 FS $6; next;} {b[$4]=$3 FS $4 FS $5 FS $6; next} END{for (k=1; k<=i; k++) print a[r[k]], b[r[k]]}' OFS='\t' file
Het 157709      157760 Cluster.90 2 +
Het 164238      164656 Cluster.97 10 +
Het 198007      198035 Cluster.125 3 +

:

awk '!($4 in a){
    a[$4]=$1 FS $2;
    r[++i]=$4;
    b[$4]=$3 FS $4 FS $5 FS $6;
    next;
}
{
    b[$4]=$3 FS $4 FS $5 FS $6;
    next;
}
END {
   for (k=1; k<=i; k++)
       print a[r[k]], b[r[k]]
}' OFS='\t' file
+1

, 4 5 . , , . , FWIW:

paste <(uniq -f3 file | cut -f1,2) <(tac file | uniq -f3 | tac | cut -f3-)

uniq , , , .

+3

awk:

awk '
!seen[$4]++ { 
  col[$4] = $1 FS $2; 
  fld[++i] = col[$4] 
}
{ 
  sub(/([^ ]+ +){2}/,x); 
  line[i] = fld[i] FS $0 
} 
END { 
  for(x = 1; x <= i; x++) 
    print line[x] 
}' OFS='\t' file

:

Het 157709 157760  Cluster.90  2   +
Het 164238 164656  Cluster.97  10  +
Het 198007 198035  Cluster.125 3   +
+1

, fgrep awk .

, , 4, , .

So just write in bash, python, ruby, perl, awk or any language of your choice that reads line by line from stdin and writes the last value seen in column 4. Whenever this value changes, do what you need execute: write a row containing the first values ​​seen in the first two columns, and the last values ​​seen in subsequent columns. Then write down the new values ​​for columns 1 and 2. This is pretty simple, but can be tricky around the first and last rows.

0
source

Source: https://habr.com/ru/post/1530354/


All Articles