Regex - Get a string between two words that do not contain a word

Question

Regex - Get a string between two words that do not contain a word

I looked around and could not do it. I'm not really a noob.

I need to get text limited to (including) START and END, which does not contain START. Basically, I can’t find a way to nullify the whole word without using advanced materials.

Example line:

abcSTARTabcSTARTabcENDabc

Expected Result:

STARTabcEND

Not good:

STARTabcSTARTabcEND

I can not use reverse search. I am testing my regex here: www.regextester.com

Thanks for any advice.

+6

regex search word jmeter

rrr Sep 7 '11 at 11:33

source share

5 answers

try it

 START(?!.*START).*?END

See here online at Regexr

(?!.*START) is a negative outlook. This ensures that the word "START" will not follow

.*? - This is not a greedy coincidence of all characters until the next "END". This is necessary because the negative look just looks ahead and does not capture anything (zero length statement)

Update:

I thought a little more, the solution above matches up to the first "END". If this is not necessary (since you exclude START from the content), use the greedy version

 START(?!.*START).*END

this will match the last "END".

+10

stema Sep 7 '11 at 11:39

source share

 START(?:(?!START).)*END

will work with any number of START...END pairs. To demonstrate in Python:

 >>> import re >>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz" >>> re.findall(r"START(?:(?!START).)*END", a) ['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']

If you only care about the content between START and END , use this:

 (?<=START)(?:(?!START).)*(?=END)

Look here:

 >>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a) ['def', 'jlk', 'uvw']

+4

Tim pietzcker Oct 05 '11 at 13:27

source share

Can I suggest a possible improvement in Tim Pitzker’s solution? It seems to me that START(?:(?!START).)*?END best to catch START followed by END without any START or END between. I am using .NET and the Tim solution will also match something like START END END . At least in my personal case this is not necessary.

+2

Johannes Wentu Jun 04 '14 at 8:05

source share

[EDIT: I left this post for information on capture groups, but the main solution I gave was incorrect. ~~(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END) hit> as specified in comments will not work;~~ ~~I forgot that ignored characters cannot be dropped, and so you need something like ... |STA(?![^R])|~~ ~~to allow this character to be part of END, so something like STARTSTEND has failed;~~ ~~so this is clearly the best choice;~~ ~~the following should show the correct way to use capture groups ...]~~

The answer given with the "zero width negative view" operator "?!", With capture groups: (?:START)((?!.*START).*)(?:END) , which captures the inner text using $ 1 for replacement. If you want the START and END entries to be captured, you could do (START)((?!.*START).*)(END) , which gives $ 1 = START $ 2 = text and $ 3 = END or various other permutations by adding / removing () or ?: s.

Thus, if you use it to search and replace, you can do something like BEGIN $ 1FINISH. So, if you started with:

abcSTARTdefSTARTghiENDjkl

you would get ghi as capture group 1, and replacing BEGIN $ 1FINISH would give you the following:

abcSTARTdefBEGINghiFINISHjkl

which will allow you to change START / END tokens only with the correct setting of pairs.

Each (x) is a group, but I put (?:x) for each of them, except for the middle, which marks it as a group not related to capture; the only one that I left without ?: was in the middle; however, you could also capture BEGIN / END markers if you want to move them or do something.

For more information on Java regular expressions, see the Java regex documentation .

0

shelleybutterfly Sep 7 '11 at 12:11

source share

tripleee · Accepted Answer · 2011-09-07T11:50:16+0000

The real pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END . Modern regex flavors have negative statements that make it more elegant, but I interpret your comment on “reverse lookups”, possibly meaning that you cannot or do not want to use this feature.

Update : just for completeness, please note that the above is greedy for the final delimiter. To commit only the shortest line, add negation to also close the trailing delimiter - START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END . However, this may exceed the threshold of torture in most cultures.

Bug fixed: In the previous version of this answer there was a bug in which SSTART could be part of a match (second S would match [^T] , etc.). I fixed this, but adding S to [^ST] and adding S* in front of optional S , to otherwise allow arbitrary repetitions of S

Regex - Get a string between two words that do not contain a word

More articles: