Is there a way to shorten this regex?

I want to match strings in the format A0123456 , E0123456 or IN:A0123456Q , etc. I originally made this regex

^(IN:)?[AE][0-9]{7}Q?$

but it corresponded to IN:E012346 without Q at the end. So I created this regex

(^IN:[AE][0-9]{7}Q$)|(^[AE][0-9]{7}$)

Is there a way to shorten this regex so that it requires both IN: and Q if they are present, but not if they are not present?

Edit: regex will be used in Ruby.

Edit 2: I changed the regex to reflect that I matched the wrong lines as it would still match IN:A0123456 .

Editing 3: both answers below are valid, but since I use Ruby 2.0 and prefer a regex expression that I can use if I change my application and don't want to use the taste of Ruby subexpressions, I chose matt to accept the answer.

+6
source share
2 answers

If you are using Ruby 2.0, you can use the if-then-else conditional match (undocumented in Ruby docs, but exists):

 /^(IN:)?[AE][0-9]{7}(?(1)Q|)$/ 

The conditional part (?(1)Q|) , which indicates whether the group corresponds to number 1 and then corresponds to Q , otherwise nothing matches. Since the group number is 1 (IN:) , this achieves what you want.

+3
source

The second regex has a problem:

 ^(IN:[AE][0-9]{7}Q)|([AE][0-9]{7})$ 

| has a lower priority than concatenation, so the regex will be parsed as:

 ^(IN:[AE][0-9]{7}Q) # Starts with (IN:[AE][0-9]{7}Q) | # OR ([AE][0-9]{7})$ # Ends with ([AE][0-9]{7}) 

To fix this problem, just use the group without capture:

 ^(?:(IN:[AE][0-9]{7}Q)|([AE][0-9]{7}))$ 

It ensures that the input string matches either the format, and not just the beginning or end of a specific format (which is clearly wrong).


Regarding the reduction of regex, you can replace [0-9] with \d if you want, but that's fine as it is.

I don't think there is another way to reduce regex within the default Ruby support level.

Subroutine call

For your Perl / PCRE information only, you can shorten it with a subroutine call :

 ^(?:([AE][0-9]{7})|(IN:(?1)Q))$ 

(?1) refers to the pattern defined by the first capture group, that is, [AE][0-9]{7} . The regular expression is practically the same, just look shorter. This demo with input IN:E0123463Q shows all the text that will be removed by group 2 (and the text will not be removed for group 1).


Ruby has a similar concept to call subexpression , with slightly different syntax. Ruby uses \g<name> or \g<number> to refer to a capture group whose template we want to reuse:

 ^(?:([AE][0-9]{7})|(IN:\g<1>Q))$ 

the test case here is on rubular in Ruby 1.9.7, for entering IN:E0123463Q , returns E0123463 as a match for group 1 and IN:E0123463Q as a match for group 2.

The Ruby implementation (1.9.7) seems to write the captured text for group 1, even if group 1 is not directly involved in the mapping. In PCRE, routine calls do not capture text.

Conditional Regular Expression

There is also a conditional regex that allows you to check if a capture group matches or not. You can check the matte answer for more information.

+5
source

Source: https://habr.com/ru/post/953659/


All Articles