Regex - Get a string between two words that do not contain a word

I looked around and could not do it. I'm not really a noob.

I need to get text limited to (including) START and END, which does not contain START. Basically, I can’t find a way to nullify the whole word without using advanced materials.

Example line:

abcSTARTabcSTARTabcENDabc

Expected Result:

STARTabcEND

Not good:

STARTabcSTARTabcEND

I can not use reverse search. I am testing my regex here: www.regextester.com

Thanks for any advice.

+6
source share
5 answers

The real pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END . Modern regex flavors have negative statements that make it more elegant, but I interpret your comment on “reverse lookups”, possibly meaning that you cannot or do not want to use this feature.

Update : just for completeness, please note that the above is greedy for the final delimiter. To commit only the shortest line, add negation to also close the trailing delimiter - START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END . However, this may exceed the threshold of torture in most cultures.

Bug fixed: In the previous version of this answer there was a bug in which SSTART could be part of a match (second S would match [^T] , etc.). I fixed this, but adding S to [^ST] and adding S* in front of optional S , to otherwise allow arbitrary repetitions of S

+4
source

try it

 START(?!.*START).*?END 

See here online at Regexr

(?!.*START) is a negative outlook. This ensures that the word "START" will not follow

.*? - This is not a greedy coincidence of all characters until the next "END". This is necessary because the negative look just looks ahead and does not capture anything (zero length statement)

Update:

I thought a little more, the solution above matches up to the first "END". If this is not necessary (since you exclude START from the content), use the greedy version

 START(?!.*START).*END 

this will match the last "END".

+10
source
 START(?:(?!START).)*END 

will work with any number of START...END pairs. To demonstrate in Python:

 >>> import re >>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz" >>> re.findall(r"START(?:(?!START).)*END", a) ['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND'] 

If you only care about the content between START and END , use this:

 (?<=START)(?:(?!START).)*(?=END) 

Look here:

 >>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a) ['def', 'jlk', 'uvw'] 
+4
source

Can I suggest a possible improvement in Tim Pitzker’s solution? It seems to me that START(?:(?!START).)*?END best to catch START followed by END without any START or END between. I am using .NET and the Tim solution will also match something like START END END . At least in my personal case this is not necessary.

+2
source

[EDIT: I left this post for information on capture groups, but the main solution I gave was incorrect. (?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END) hit> as specified in comments will not work; I forgot that ignored characters cannot be dropped, and so you need something like ... |STA(?![^R])| to allow this character to be part of END, so something like STARTSTEND has failed; so this is clearly the best choice; the following should show the correct way to use capture groups ...]

The answer given with the "zero width negative view" operator "?!", With capture groups: (?:START)((?!.*START).*)(?:END) , which captures the inner text using $ 1 for replacement. If you want the START and END entries to be captured, you could do (START)((?!.*START).*)(END) , which gives $ 1 = START $ 2 = text and $ 3 = END or various other permutations by adding / removing () or ?: s.

Thus, if you use it to search and replace, you can do something like BEGIN $ 1FINISH. So, if you started with:

abcSTARTdefSTARTghiENDjkl

you would get ghi as capture group 1, and replacing BEGIN $ 1FINISH would give you the following:

abcSTARTdefBEGINghiFINISHjkl

which will allow you to change START / END tokens only with the correct setting of pairs.

Each (x) is a group, but I put (?:x) for each of them, except for the middle, which marks it as a group not related to capture; the only one that I left without ?: was in the middle; however, you could also capture BEGIN / END markers if you want to move them or do something.

For more information on Java regular expressions, see the Java regex documentation .

0
source

Source: https://habr.com/ru/post/896781/


All Articles