Regular expression with all additional components, how to avoid empty matches

I need to process a comma-separated string that contains triplets of values ​​and translates them into run-time types, the input looks like this:

"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..." 

Therefore, each substring must be converted as follows:

 "1x2y3z" should become Vector3 with x = 1, y = 2, z = 3 "80r160g255b" should become Color with r = 80, g = 160, b = 255 "48h30m50s" should become Time with h = 48, m = 30, s = 50 

The problem I am facing is that all components are optional (but they keep order), so the following lines are also valid Vector3 , Color and Time values:

 "1x3z" Vector3 x = 1, y = 0, z = 3 "255b" Color r = 0, g = 0, b = 255 "1h" Time h = 1, m = 0, s = 0 

What have i tried so far?

All optional components

 ((?:\d+A)?(?:\d+B)?(?:\d+C)?) 

A , B and C are replaced with the correct letter for each case, the expression works almost well, but it gives twice the expected results (one match for the string and another match for the empty string immediately after the first match), for example:

 "1h1m1s" two matches [1]: "1h1m1s" [2]: "" "11x50z" two matches [1]: "11x50z" [2]: "" "11111h" two matches [1]: "11111h" [2]: "" 

This is not unexpected ... because an empty string matches an expression when ALL components are empty; therefore, to fix this problem, I tried the following:

1 to 3 quantifiers

 ((?:\d+[ABC]){1,3}) 

But now the expression matches strings with the wrong order or even repeating components !:

 "1s1m1h" one match, should not match at all! (wrong order) "11z50z" one match, should not match at all! (repeated components) "1r1r1b" one match, should not match at all! (repeated components) 

As for my last attempt, I tried this version of my first expression:

Match from start ^ to end $

 ^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$ 

And it works better than the first version, but it still matches an empty string plus I must first tokenize the input and then pass each token to the expression to ensure that the test string can match the beginning ( ^ ). and end ( $ ).

EDIT: Lookahead attempt (thanks to Casimir et Hippolyte )

After reading and (trying) an understanding of the concept of regular expression and using Answer by Casimir et Hippolyte I tried the suggested expression:

 \b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b 

Against the following test line:

 "48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h" 

And the results were amazing! it can detect complete real matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" that were not meant to match at all). Unfortunately, it captures emtpy matches every time an inconsistent match is found, so "" found immediately before "1s1m1h" , "1h1h1h" , "adfank" and "12322134445688" , so I changed the Lookahead condition to get the following expression:

 \b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b 

It gets rid of empty matches on any line that does not match (?:\d+[ABC]){1,3}) , so empty matches before "adfank" and "12322134445688" disappear, but those before "1s1m1h" , "1h1h1h" are stil detected.


So, the question arises: is there any regular expression that corresponds to three triplet values ​​in this order, where all components are optional, but must consist of at least one component and not correspond to empty lines?

The regex tool I'm using is the C ++ 11 one.

+6
source share
2 answers

Yes, you can add a lookahead at the beginning to make sure there is at least one character:

 ^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$ 

If you need to find such a substring in a larger line (so without tokenize before), you can remove the bindings and use a more explicit subpattern in the view:

 (?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?) 

In this case, to avoid false positives (since you are looking for very small lines that may be part of something else), you can add word boundaries to the pattern:

 \b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b 

Note: in a comma-delimited line: (?=\d+[ABC]) can be replaced by (?=[^,])

+5
source

I think this can do the trick.

I use either the beginning of the line to match ^ , or the comma delimiter to correct the beginning of each match: (?:^|,) .

Example:

 #include <regex> #include <iostream> const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~"); int main() { std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b"; std::sregex_iterator iter(test.begin(), test.end(), r); std::sregex_iterator end_iter; for(; iter != end_iter; ++iter) std::cout << iter->str(1) << '\n'; } 

Output:

 1x2y3z 80r160g255b 48h30m50s 1x3z 255b 

That's what you need?

EDIT:

If you really want to go to the city and make empty expressions unsurpassed, as far as I can tell, you need to insert each permutation as follows:

 const std::string A = "(?:\\d+[xrh])"; const std::string B = "(?:\\d+[ygm])"; const std::string C = "(?:\\d+[zbs])"; const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")"); 
+1
source

Source: https://habr.com/ru/post/987194/


All Articles