Gawk regex for sequence selection

Question

Gawk regex for sequence selection

Sorry for the simple question about regexp, but I can't get what I need without what seems like a complicated solution to me. I am parsing a file containing a sequence of three letters A, E, D, as in

AADDEEDDA

EEEEEEEE

AEEEDEEA

AEEEDDAAA

and I would like to identify only those that start with E and end in D with just one change in sequence, for example, in

EDDDDDDDD

EEEDDDDDD

EEEEEEEED

I fight the right regex to do this. Here is my last attempt

echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'

which does not work. Any help?

Thanks in advance.

+4

regex gawk

G. tartifola Nov 11 '15 at 20:40

source share

3 answers

Giuseppe Ricupero · Answer 1 · 2015-11-11T21:09:52+0000

If I understand your request correctly, just

awk '/^E+D+$/' file.input

will do the trick.

UPDATE: / ( -), , ( -F,):

awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test

shadowtalker · Answer 2 · 2015-11-11T21:29:29+0000

:

^E+[^ED]*D+$

E , , E D , D .

AWK

$2 ~ /^E+[^ED]*D+$/

$2 , ~ , / . , AWK "", . , "" ( { s). , , AWK , { print $0 }, .

Roger Lindsjö · Answer 3 · 2015-11-11T21:11:29+0000

, , E, D .

echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E+D+$) print $0}'

Gawk regex for sequence selection

More articles: