Awk: how to compare two lines in one line

Question

Awk: how to compare two lines in one line

I have a data set with 20,000 probes, they consist of two columns, 21nts each. From this file I need to extract the lines in which the last nucleotide in the column Probe1 corresponds to the last nucleotide in the column Probe 2. So far I tried the AWK (substr) function, but did not get the expected result. Here is one liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'

Another option would be to bind the last character in columns 2 and 4 ( awk '$2~/[AZ]$/ ), but I cannot find a way to match probes in two columns with regex. All suggestions and comments will be greatly appreciated.

Dataset example:

  Probe 1 Probe 2 4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT 4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC

Required Conclusion:

 4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC

+6

bash awk

Bio21 Nov 27 '16 at 14:30

source share

1 answer

janos · Accepted Answer · 2016-11-27T14:38:17+0000

This will filter the input corresponding to the line where the last character of the second column is equal to the last character of the 4th column:

 awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'

What I changed from your sample script:

Move the if from the { ... } block to the filter
Use length($2) and length($4) instead of hardcoding value of 21
{ print $0 } not required since this is the default action for matched strings

Awk: how to compare two lines in one line

More articles: