Simplify regex for time literals (e.g. "10h50m")

I am writing lexer rules for a custom description language using pyLR1 , which includes time literals, like for example:

10h30m # meaning 10 hours + 30 minutes 5m30s # meaning 5 minutes + 30 seconds 10h20m15s # meaning 10 hours + 20 minutes + 15 seconds 15.6s # meaning 15.6 seconds 

The specification order for the hour, minute and second parts should be fixed on h , m , s . To indicate this in detail, I need the following valid combinations of hms , hm , h , ms , m and s (with numbers between different segments of the course). As a bonus, the regex should check for decimal (i.e., unnatural) numbers in segments and only allow them in the segment with the smallest value.

So, I have for all but the last group, a numerical match:

 ([0-9]+) 

And for the last group even:

 ([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?) # to allow for .5 and 0.5 and 5.0 and 5 

After going through all the combinations of h, m and s, the cute little python script gives me the following regex:

 (([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s) 

Obviously, this is a bit of an expression of horror. Is there any way to simplify this? The answer should work with the pythons re module, and I will also accept answers that do not work with pyLR1 if this is due to its limited subset of regular expressions.

+6
source share
5 answers

You can expand your regular expression using the notation h , m , s to denote each of the sub-registers, the most basic version:

 h|hm|hms|ms|m|s 

which you currently have. You can break it down into:

 (h|hm|hms)|(ms|m)|s 

and then pull h from the first expression and m from the second, which we get (using (x|) == x? ):

 h(m|ms)?|ms?|s 

Continuing, we get to

 h(ms?)?|ms?|s 

which is probably simpler (and possibly the easiest).


Adding d to the regular expression to indicate decimals (as in \.[0-9]+ ), this can be written as

 h(d|m(d|sd?)?)?|m(d|sd?)?|sd? 

(i.e., at each stage, they probably have either decimal numbers or a continuation to the next of h m or s .)

This will result in something like (just a few hours and minutes):

 [0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m 

If you look at this, you may not be able to get into a form acceptable for pyLR1, so parsing with decimal places in every place and then a secondary check may be the best way to do this.

+3
source

The presentation below should be clear, I don’t know the exact syntax of the regular expression that you are using, so you need to “translate” it into the actual syntax yourself.

your watch

  [0-9]{1,2}h 

your minutes

 [0-9]{1,2}m 

your seconds

 [0-9]{1,2}(\.[0-9]{1,3})?s 

you want everything in order and can omit any of them (wrap with ? )

 ([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)? 

this, however, matches things like: 10h30s
valid combinations are hms , hm , hs , h , ms , m and s
or iow, minutes can be skipped, but there are still hours and seconds.

another problem is that if an empty string is given, it matches, since all three ? make it valid. so you need to get around this somehow. um


looking at @dbaupp h(ms?)?|ms?|s , you can take above and map:

 h: [0-9]{1,2}h m: [0-9]{1,2}m s: [0-9]{1,2}(\.[0-9]{1,3})?s 

so you can:

 h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)? ms? : [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)? s : [0-9]{1,2}(\.[0-9]{1,3})?s 

all those OR'd together give you a big but easily broken regular expression:

 ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s 

which will save you the trouble with empty string and hs matching.


looking at @ Donal Fellows comment on @ dbaupp's answer, I will also do (h?m)?S|h?M|H

 (h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s h?m : ([0-9]{1,2}h)?[0-9]{1,2}m h : [0-9]{1,2}h 

and when combined, you will get something less than the above:

 (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h 

now we need to find a way to match the .xx demical view

+1
source

Here is a short Python expression that works :

 (\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms])) 

Inspired by Cameron Martins legend based response .

Explanations:

 (\d+h)? # optional int "h" (capture 1) (\d+m)? # optional int "m" (capture 2) (\d*\.\d+|\d+(\.\d*)?) # int or decimal (?(2) # if "m" (capture 2) was matched: s # "s" | (?(1) # else if "h" (capture 1) was matched: m # "m" | # else (nothing matched): [hms])) # any of the "h", "m" or "s" 
+1
source

You can have hours, minutes, and seconds.

  /(\d{1,2}h)*(\d{1,2}m)*(\d{1,2}(\.\d+)*s)*/ 

must do the job. Depending on the regular expression library, you will get your data in order or you will have to analyze it further to check h, m or s.

In this latter case, see also what returns

  /(\d{1,2}(h))*(\d{1,2}(m))*(\d{1,2}(\.\d+)*(s))*/ 
0
source

The last group should be:

 ([0-9]*\.[0-9]+|[0-9]+(\.[0-9]+)?) 

if you do not want to match 5.


You can use regex ifs , for example:

 (([0-9]+h)?([0-9]+m)?([0-9]+s)?)(?(?<=h)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m)?|(?(?<=m)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)?|\b(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)[hms])?)) 

Here - http://regexr.com?31dmj

I did not check that this works, but it tries to match only integers for several hours, minutes, seconds, seconds, and then, if the last match matches the hours, it allows you to do fractional minutes, otherwise, if the last match is minutes, it allows fractional seconds.

0
source

Source: https://habr.com/ru/post/919501/


All Articles