Php regex

I got this template (I use php):

'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([az\.]{2,6})([\/\w \.-]*)*\/?)\]/i' 

When I look for this line: http://phpquest.zapto.org/users/register.php

Matches (order 0-5):

  • '[link=http://phpquest.zapto.org/users/register.php]'
  • 'http://phpquest.zapto.org/users/register.php'
  • 'http://'
  • 'phpquest.zapto'
  • org
  • ''

When I replace * with + inside the last subpattern as follows:

 '/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([az\.]{2,6})([\/\w \.-]+)*\/?)\]/i' 

Matches (order 0-5):

  • '[link=http://phpquest.zapto.org/users/register.php]'
  • 'http://phpquest.zapto.org/users/register.php'
  • 'http://'
  • 'phpquest.zapto'
  • org
  • '/users/register.php'

If anyone can help me understand why I will be very grateful, thank you all and have a nice day.

+4
source share
2 answers

Maybe a simpler example is comparing this to this .

Regular expressions used:

 (a*)* 

and

 (a+)* 

And the test line is aaaaaa .

What happens is that after capturing the main group (in the example I gave, series a ), he tries to match more, but cannot. But wait! It may also mean nothing, because * means 0 or more times!

Therefore, after matching all a 's, it will match and catch "nothing", and since only the last captured part is saved, you get "" as a result of the capture group.

In (a+)* after matching and capturing aaaaaa it can no longer match or catch ( + does not allow it to match nothing, not * ), and therefore aaaaaa is the last match.

+3
source

This can be simplified with the following template.

 /\[link=(https?:\/\/)(([a-z0-9]+\.?)+)((\/[^\/]+)+)\/?\]/i 

The regex * not greedy, but + is. Therefore, when using + in the second attempt, all components of the path are mapped and this group is captured; however, in the first attempt with * , since you only fixed the inner group * with the parenthesis, you matched the unwanted pattern * , in this case nothing.

+2
source

Source: https://habr.com/ru/post/1495572/


All Articles