Lazy quantifier and view

I am working on a regex to validate URLs in C #. Right now, the regex should not match the other http:// , but the first inside the URL. This was my first attempt:

 (https?:\/\/.+?)\/(.+?)(?!https?:\/\/) 

But this regex doesn't work (even deleting (?!https?:\/\/) ). Take for example this input line:

 http://test.test/notwork.http://test 

Here is my first doubt: why the capture group (.+?) Does not match notwork.http://test ? The left quantifier should match as few times as possible, but why not until the end? In this case, of course, I missed something (firstly, I thought it could be related to the return, but I do not think it is), so I read this and found a solution, even if I'm not sure that this is the best as he says that

This method does not give an advantage over a lazy dot-star

In any case, this solution is a hardened point. This is my next attempt:

 (https?:\/\/.+?)\/((?:(?!https?:\/\/).)*) 

Now: this regex works, but not as we would like. I need a match only when the url is valid.

By the way, I think that I did not quite understand what the new regular expression is doing: why the negative forecast remains before . and not after him? So I tried to move it after . , and it seems to match the URL until it finds the second-last character before the second http. Returning to the corrected regular expression, my hypothesis is that a negative lookahead is actually trying to verify what is after . already read by regex, is this correct?

Other decisions are well made, but first I would like to understand this. Thanks.

+5
source share
1 answer

The solution you are looking for

 (?>https?://\S+?/(?:(?!https?://).)*)(?!https?://) 

Watch the regex demo

More details

  • (?>https?://\S+?/(?:(?!https?://).)*) - atomic group (not allowing return to its subpatterns), which corresponds to
    • https?:// - http:// or https://
    • \S+? - any 1 or more characters without spaces, as little as possible, up to the first ...
    • / - / character followed by ...
    • (?:(?!https?://).)* - zero or more characters (as many as possible) that do not start the http:// or https:// character sequence.
  • (?!https?://) - a negative lookahead if it does not match, if there is http:// or https:// right of the current location.

(https?:\/\/.+?)\/(.+?)(?!https?:\/\/) does not work, because the template .+? corresponds lazily, i.e. it captures the first char that it finds, and then allows the subsequent subpattern to match. The following subpattern is negative loolahead, which does not match only if http:// or https:// is missing to the right of the current location. Since there is no such substring under n in http://test.test/notwork.http://test , a result that ends with n returned, the match is successful. If you don't tell the regex engine anymore, or before some other delimiter / pattern, it won't.

The moderate token greedy has been discussed a lot. The exact doubt as to where to place the gaze is addressed in this answer .

+2
source

Source: https://habr.com/ru/post/1270784/


All Articles