How to match "something or nothing" in a bash regex?

I am having trouble writing numbers in a string of this format (t|b|bug_|task_|)1234 using the bash regular expression. Below does not work:

 [[ $current_branch =~ ^(t|b|bug_|task_|)([0-9]+) ]] 

But as soon as I change it to something like this:

 [[ $current_branch =~ ^(t|b|bug_|task_)([0-9]+) ]] 

it works, but, of course, incorrectly, because it does not apply to cases where there are no prefixes. I understand that in this case I could do

 [[ $current_branch =~ ^(t|b|bug_|task_)?([0-9]+) ]] 

and achieve the same result, but I would like to know why the second example does not work. For example, this regular expression works fine in Ruby.

(This is on GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin11) , OSX Lion)

+6
source share
2 answers

I am sure that the difference between the working and non-working versions of the regular expression is based on different ways of reading regex (7) . I am going to quote the whole significant part, because I think it depends on your problem:


Regular expressions ("REs"), as defined in POSIX.2, take two forms: modern REs (about those of egrep; POSIX.2 calls these "extended" REs) and deprecated REs (some of ed (1), POSIX .2 "basic" REs). Obsolete SRs usually exist for backward compatibility in some older programs; they will be discussed at the end. POSIX.2 leaves some aspects of RE syntax and semantics open; "(!)" Labels solutions to these aspects that may not be compatible with other POSIX.2 implementations.

A (modern) RE is one (!) Or more nonempty (!) Branch, separated by '|'. This matches any that matches one of the branches.

A branch is one (!) Or more pieces, concatenated. It matches the match for the first, then the match for the second, etc.

A piece is an atom, followed by a single (!) '*', '+', '?' or related. The atom followed by '*' corresponds to a sequence of 0 or more atom matches. An atom followed by a β€œ+” corresponds to a sequence of 1 or more atomic matches. The atom followed by '?' matches a sequence of 0 or 1 atomic matches.

Compression is '{' followed by an unsigned integer, possibly followed by ',' possibly followed by another unsigned integer, always followed by '}'. Integers must be between 0 and RE_DUP_MAX (255 (!)) Inclusive, and if there are two, the first cannot exceed the second. An atom followed by an estimate containing a single integer I and no comma matches the sequence of exact matches of the atom. An atom followed by a border containing a single integer i and a comma corresponds to a sequence of i or more matches of the atom. an atom, followed by a border containing two integers i and j, corresponds to the sequence i through j (inclusive) matches of the atom.

An atom is a regular expression enclosed in "()" (matching matching for a regular expression), an empty set of "()" (matching the empty string) (!), A (see below), '.' (matching any single character), '^' (matching a null string at the beginning of a line), '$' (matching a null string at the end of a line), a '\' followed by one of the characters "^. [$ () | * +? {\ "(coinciding with the fact that the character is accepted as a regular character), a '\' followed by any other character (!) (corresponding to this character, a normal character as if" \ "were not present (! )), or one character that has no other meaning (corresponding to this symbol). A '{' and then a character other than a digit is a regular character, not the beginning of a border (!). It is not valid to terminate RE with '\'.


Well, there is enough here to unpack. First of all, please note that the symbol (!) "Means that there is an open or non-portable problem.

The essential question is in the following paragraph:

A (modern) RE is one (!) Or more non-empty (!) Branches, separated by "|".

Your thing is that you have an empty branch. As can be seen from "(!)", An empty branch is open or not portable. I think that’s why it works on some systems, but not on others. (I tested it on Cygwin 4.1.10 (4) -release, and it did not work, and then on Linux 3.2.25 (1) -release, and it happened. Two systems have equivalent, but not identical man pages for regex7 .)

Assuming the branches should be non-empty, the branch may be a piece, which may be an atom.

An atom can be an "empty set" () "(matching the empty string) (!)". <sarcasm> Well, this is really useful. </sarcasm> So, POSIX defines a regular expression for an empty string, that is () , but also adds "(!)" to say that this is an open problem or not portable.

Since you're looking for a branch that matches an empty string, try

 [[ $current_branch =~ ^(t|b|bug_|task_|())([0-9]+) ]] 

which uses regex () to match an empty string. (This worked for me in my Cygwin 4.1.10 (4) -release shell, where your original regex was not.)

However, although (hopefully) this offer will work for you in your current setup, there is no guarantee that it will be portable. Sorry to disappoint.

+2
source

[[ $current_branch =~ ^(t|b|bug_|task_|)([0-9]+) ]] works for me in bash 4.1.2, but does not work in bash 3.2.48. It could just be a bug that was fixed between the two versions.

0
source

Source: https://habr.com/ru/post/915628/


All Articles