Strange capture group behavior in regular expression

Given the following simple regular expression, the purpose of which is to capture text between quotation marks:

regexp = '"?(.+)"?' 

When the input looks something like this:

 "text" 

The capture group (1) has the following:

 text" 

I was expecting group (1) to have only text (without quotes). Can someone explain what is happening and why the regexp captures the character " even if it is outside of capture group # 1. Another strange behavior that I don't understand is why the second quote character is captured, but not first, given that both of them are optional.Finally, I fixed this using the following regular expression, but I would like to understand what I'm doing wrong:

 regexp = '"?([^"]+)"?' 
+5
source share
5 answers

why the regular expression captures the "character", even if it is outside of capture group # 1

The pattern "?(.+)"? contains hidden snapping of subpattern points . A. may also match " . "? is an optional subpattern. This means that if the previous subpattern is greedy (and .+ Is the greedy subpattern) and can match the subsequent subpattern (and .+ Can match " ),. .+ Will use this optional value.

a negative character class is the correct way to match any characters, but a specific / range (s) of characters. [^"] will never match, " so the last " will never match this pattern.

why the second quote character is captured, but not the first, given that both of them are optional

The first "? Precedes the greedy dot pattern. The engine sees " (if it is on the line) and matches the quote to the first "? .

+1
source

Quantifiers in regular expressions are greedy : they try to match as much text as possible. Since your last " is optional (did you write "? In your regular expression) .+ Will match it.

Using [^"] is one acceptable solution. The disadvantage is that your string cannot contain the characters " (which may or may not be desirable, depending on the case).

Another is the requirement " :

 regexp = '"(.+)"' 

Another is to make + inanimate using +? . However, you also need to add the ^ and $ bindings (or similar, depending on the context), otherwise it will correspond only to the first character ( t in the case of "test" ):

 regexp = '^"?(.+?)"?$' 

This regular expression allows characters to " be in the middle of the string, so that "t"e"s"t" will cause the group t"e"s"t be captured by the group.

+3
source

. + greedy. He will collect everything, including "Your last"? does not require quotes, therefore. + includes quote.

The first quote is not written because it matches the value "?

+1
source

The regular expression is greedy by default, it will try to match as soon as possible.

Since your capture group contains .+ , It will match the end bracket before "? . Then, leaving the group, it is at the end of your line, which maps to the optional one. "

0
source

.+ matches any character as long as it can (including " ). And when it reaches the end of the input, "? matches, as it means " is optional.

You should use "non greedy":

regex "(.+?)"

0
source

Source: https://habr.com/ru/post/1243413/


All Articles