Awk: how to compare two lines in one line

I have a data set with 20,000 probes, they consist of two columns, 21nts each. From this file I need to extract the lines in which the last nucleotide in the column Probe1 corresponds to the last nucleotide in the column Probe 2. So far I tried the AWK (substr) function, but did not get the expected result. Here is one liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}' 

Another option would be to bind the last character in columns 2 and 4 ( awk '$2~/[AZ]$/ ), but I cannot find a way to match probes in two columns with regex. All suggestions and comments will be greatly appreciated.

Dataset example:

  Probe 1 Probe 2 4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT 4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC 

Required Conclusion:

 4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC 
+6
source share
1 answer

This will filter the input corresponding to the line where the last character of the second column is equal to the last character of the 4th column:

 awk 'substr($2, length($2), 1) == substr($4, length($4), 1)' 

What I changed from your sample script:

  • Move the if from the { ... } block to the filter
  • Use length($2) and length($4) instead of hardcoding value of 21
  • { print $0 } not required since this is the default action for matched strings
+5
source

Source: https://habr.com/ru/post/1012686/


All Articles