Python Regular Expression with Appearance and Alternatives

Question

Python Regular Expression with Appearance and Alternatives

I want to have a regular expression that finds texts that are “wrapped” between “HEAD or HEADa” and “HEAD”. That is, I can have text starting with the first word as HEAD or HEADa, and the following "heads" are of type HEAD.

HEAD\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
HEADa\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....

I just want to capture the text between the "heads", so I have a regular expression with appearance and look forward, looking at my "heads". I have the following regex:

 var = "HEADa", "HEAD" my_pat = re.compile(r"(?<=^\b"+var[0]+r"|"+var[1]+r"\b) \w*\s\s(.*?)(?=\b"+var[1] +r"\b)",re.DOTALL|re.MULTILINE)

However, when I try to execute this regular expression, I get an error message saying that I cannot have a variable length in the expression of the external expression. What is wrong with this regex?

+6

python regex

user963386 Nov 19 '11 at 13:49

source share

1 answer

Alan moore · Accepted Answer · 2011-11-19T14:56:38+0000

Currently, the first part of your regex looks like this:

 (?<=^\bHEADa|HEAD\b)

You have two alternatives; one corresponds to five characters, and the other corresponds to four, and why you get an error. Some flavors of regular expressions will allow you to do this, although they say they do not allow variable lengths for lookbehinds, but not Python. You can break it into two types, for example:

 (?:(?<=^HEADa\b)|(?<=\bHEAD\b))

... but you probably don't need it. Try instead:

 (?:^HEADa|\bHEAD)\b

Anything that falls under (.*?) Later will still be available through group # 1. If you really need all the text between the delimiters, you can capture this in group # 1 and the other group will become # 2 (or you can use named groups and should not keep track of numbers).

Generally speaking, lookbehind should never be your first resort. This might seem like an obvious tool to work with, but you're usually better off doing a direct match and extracting the part you want with the capture group. And this applies to all tastes, not just Python; just because you can do more with lookbehinds in other tastes does not mean you should.

By the way, you may have noticed that I redistributed your word boundaries; I think this is what you really intended.

Python Regular Expression with Appearance and Alternatives

More articles: