I have a data set with 20,000 probes, they consist of two columns, 21nts each. From this file I need to extract the lines in which the last nucleotide in the column Probe1 corresponds to the last nucleotide in the column Probe 2. So far I tried the AWK (substr) function, but did not get the expected result. Here is one liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to bind the last character in columns 2 and 4 ( awk '$2~/[AZ]$/
), but I cannot find a way to match probes in two columns with regex. All suggestions and comments will be greatly appreciated.
Dataset example:
Probe 1 Probe 2 4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT 4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Required Conclusion:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Bio21 source share