In Python, how to parse a string representing a set of keyword arguments, so the order doesn't matter

Question

In Python, how to parse a string representing a set of keyword arguments, so the order doesn't matter

I am writing a RecurringInterval class, which - based on dateutil.rrule - is a repeating time interval. I have defined a custom, user-readable __str__ method for it and would also like to define a parse method that (like rrulestr () ) parses a string back into an object.

Below is the parse method and some test cases:

 import re from dateutil.rrule import FREQNAMES import pytest class RecurringInterval(object): freq_fmt = "{freq}" start_fmt = "from {start}" end_fmt = "till {end}" byweekday_fmt = "by weekday {byweekday}" bymonth_fmt = "by month {bymonth}" @classmethod def match_pattern(cls, string): SPACES = r'\s*' freq_names = [freq.lower() for freq in FREQNAMES] + [freq.title() for freq in FREQNAMES] # The frequencies may be either lowercase or start with a capital letter FREQ_PATTERN = '(?P<freq>{})?'.format("|".join(freq_names)) # Start and end are required (their regular expressions match 1 repetition) START_PATTERN = cls.start_fmt.format(start=SPACES + r'(?P<start>.+?)') END_PATTERN = cls.end_fmt.format(end=SPACES + r'(?P<end>.+?)') # The remaining tokens are optional (their regular expressions match 0 or 1 repetitions) BYWEEKDAY_PATTERN = cls.optional(cls.byweekday_fmt.format(byweekday=SPACES + r'(?P<byweekday>.+?)')) BYMONTH_PATTERN = cls.optional(cls.bymonth_fmt.format(bymonth=SPACES + r'(?P<bymonth>.+?)')) PATTERN = SPACES + FREQ_PATTERN \ + SPACES + START_PATTERN \ + SPACES + END_PATTERN \ + SPACES + BYWEEKDAY_PATTERN \ + SPACES + BYMONTH_PATTERN \ + SPACES + "$" # The character '$' is needed to make the non-greedy regular expressions parse till the end of the string return re.match(PATTERN, string).groupdict() @staticmethod def optional(pattern): '''Encloses the given regular expression in an optional group (ie, one that matches 0 or 1 repetitions of the original regular expression).''' return '({})?'.format(pattern) '''Tests''' def test_match_pattern_with_byweekday_and_bymonth(): string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by weekday Monday, Tuesday by month January, February" groups = RecurringInterval.match_pattern(string) assert groups['freq'] == "Weekly" assert groups['start'].strip() == "2017-11-03 15:00:00" assert groups['end'].strip() == "2017-11-03 16:00:00" assert groups['byweekday'].strip() == "Monday, Tuesday" assert groups['bymonth'].strip() == "January, February" def test_match_pattern_with_bymonth_and_byweekday(): string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by month January, February by weekday Monday, Tuesday " groups = RecurringInterval.match_pattern(string) assert groups['freq'] == "Weekly" assert groups['start'].strip() == "2017-11-03 15:00:00" assert groups['end'].strip() == "2017-11-03 16:00:00" assert groups['byweekday'].strip() == "Monday, Tuesday" assert groups['bymonth'].strip() == "January, February" if __name__ == "__main__": # pytest.main([__file__]) pytest.main([__file__+"::test_match_pattern_with_byweekday_and_bymonth"]) # This passes # pytest.main([__file__+"::test_match_pattern_with_bymonth_and_byweekday"]) # This fails

Although the analyzer works if you specify the arguments in the correct order, it is "inflexible" because it does not allow you to give optional arguments in arbitrary order. This is why the second test fails.

How can I get the parser to parse "optional" fields in any order to pass both tests? (I thought of creating an iterator with all the regular expression permutations and trying re.match for each of them, but that doesn't seem like an elegant solution).

+6

python regex parsing

Kurt peek Feb 24 '17 at 9:32

source share

2 answers

Here you have many options, each with different minuses.

One approach is to use re-rotation, for example (by weekday|by month)* :

 (?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?)(?:\s+by weekday (?P<byweekday>.+?)|\s+by month (?P<bymonth>.+?))*$

This will correspond to the lines of the form week month and month week , but also week week or month week month , etc.

Another option would be to use lookaheads, for example (?=.*by weekday)?(?=.*by month)? :

  (?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?(?=$| by))(?=.*\s+by weekday (?P<byweekday>.+?(?=$| by))|)(?=.*\s+by month (?P<month>.+?(?=$| by))|)

However, this requires a known delimiter (I used "by") to know how far it can be matched. In addition, it will silently ignore any additional characters (this means that they will correspond to the lines of the form by weekday [some gargabe] by month ).

+1

Aran-fey Feb 24 '17 at 10:57

source share

ymbirtt · Accepted Answer · 2017-02-24T11:10:40+0000

At this point, your language becomes complex enough to allow time to cut regular expressions and learn to use the appropriate parsing library. I threw it together using pyparsing and I annotated it very much in order to try to explain what is happening, but if something is not clear, ask and I will try to explain.

 from pyparsing import Regex, oneOf, OneOrMore # Boring old constants, I'm sure you know how to fill these out... months = ['January', 'February'] weekdays = ['Monday', 'Tuesday'] frequencies = ['Daily', 'Weekly'] # A datetime expression is anything matching this regex. We could split it down # even further to get day, month, year attributes in our results object if we felt # like it datetime_expr = Regex(r'(\d{4})-(\d\d?)-(\d\d?) (\d{2}):(\d{2}):(\d{2})') # A from or till expression is the word "from" or "till" followed by any valid datetime from_expr = 'from' + datetime_expr.setResultsName('from_') till_expr = 'till' + datetime_expr.setResultsName('till') # A range expression is a from expression followed by a till expression range_expr = from_expr + till_expr # A weekday is any old weekday weekday_expr = oneOf(weekdays) month_expr = oneOf(months) frequency_expr = oneOf(frequencies) # A by weekday expression is the words "by weekday" followed by one or more weekdays by_weekday_expr = 'by weekday' + OneOrMore(weekday_expr).setResultsName('weekdays') by_month_expr = 'by month' + OneOrMore(month_expr).setResultsName('months') # A recurring interval, then, is a frequency, followed by a range, followed by # a weekday and a month, in any order recurring_interval = frequency_expr + range_expr + (by_weekday_expr & by_month_expr) # Let parse! if __name__ == '__main__': res = recurring_interval.parseString('Daily from 1111-11-11 11:00:00 till 1111-11-11 12:00:00 by weekday Monday by month January February') # Note that setResultsName causes everything to get packed neatly into # attributes for us, so we can pluck all the bits and pieces out with no # difficulty at all print res print res.from_ print res.till print res.weekdays print res.months

In Python, how to parse a string representing a set of keyword arguments, so the order doesn't matter

More articles: