Start and stop matching coincidence indexes?

Question

Start and stop matching coincidence indexes?

I need to know the start and end match indices from the following regular expression:

pat = re.compile("(?=(ATG(?:(?!TAA|TGA|TAG)\w\w\w)*))")

Example string s='GATGDTATGDTAAAA'

pat.findall(s) returns the desired matches ['ATGDTATGD', 'ATGDTAAAA'] . How to extract start and end indices? I tried:

 iters = pat.finditer(s) for it in iters: print it.start() print it.end()

However, it.end() always matches it.start() , since the beginning of my template starts with (?= , So it does not consume a single line (I need it to capture matching matches). Obviously, pat.findall retrieved the desired row, but how to get the start and end indices?

+6

python regex

ashim Nov 15 '13 at 4:37

source share

2 answers

There are no matching matches in regular expressions.

Either you are matching something, or not. Everything you match can only be part of one match / subheading.

Views on the future are ephemeral; they do not increase real counters.

+4

Tomalak Nov 15 '13 at 5:03

source share

Tim peters · Accepted Answer · 2013-11-15T05:16:07+0000

As @Tomalak said, the regexp mechanism does not have a built-in concept of matching matches, so there is no “smart” solution that can be found ( which turned out to be wrong - see below). But this is easy to do with a loop:

 import re pat = re.compile("ATG(?:(?!TAA|TGA|TAG)\w\w\w)*") s = 'GATGDTATGDTAAAA' i = 0 while True: m = pat.search(s, i) if m: start, end = m.span() print "match at {}:{} {!r}".format(start, end, m.group()) i = start + 1 else: break

which displays

 match at 1:10 'ATGDTATGD' match at 6:15 'ATGDTAAAA'

It works by starting to search again for one character at the beginning of the last match until more matches are found.

Smart or a time bomb?

If you want to live in danger, you can enter a 2-digit code in your original finditer code:

 print it.start(1) print it.end(1)

That is, get the start and end of the first ( 1 ) capture group. Without passing an argument, you get the beginning and end of the match as a whole, but, of course, the corresponding statement always corresponds to an empty line (and therefore the beginning and end are equal).

I say this is dangerous because the semantics of the capture group inside the statement (whether lookahead or lookbehind, positive or negative, ...) are fuzzy at best. It's hard to say if you can stumble upon a mistake (or in case of an accident) here! Cute :-)

EDIT: After a night of sleep and a brief discussion of Python-Dev, I find this behavior to be intentional (and equally reliable). To find all matches (possibly matching!) For regexp R, wrap them like this:

 pat = re.compile("(?=(" + R + "))")

and then

 for m in pat.finditer(some_string): m.group(1) # the matched substring m.span(1) # the slice indices of the match substring # etc

works great.

It’s best to read (?=(R)) how to "match an empty line here, but only if R starts here, and if it succeeds, put the information that R corresponds to group 1". Then finditer() is executed as it always happens when matching an empty string: it moves the beginning of the search to the next character and retries (the same as in the first loop in my first answer).

Using this parameter with findall() more difficult, because if R contains capture groups, you will get all of them (you cannot choose and choose how you can do with the matching object, for example, finditer() returns).

Start and stop matching coincidence indexes?

Smart or a time bomb?

More articles: