Relevant dates with regular expressions in Python?

I know that my answers have similar questions, but after reading them I still do not have the solution that I am looking for.

Using Python 3.2.2, I need to match "Month, Day, Year" when the month is a string, Day - two digits of no more than 30, 31 or 28 for February and February 29 in a leap year. (Mostly REAL and Valid Date)

This is what I have so far:

pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])" expression = re.compile(pattern) matches = expression.findall(sampleTextFile) 

I'm still not very familiar with the regex syntax, so I may have characters that are not needed ([,] [] for commas and spaces seem like the wrong way to do this), but when I try to match "January 26, 1991" in my text file, a listing of the elements in the “matches” (“January”, “26”, “1991”, “19”).

Why does the extra “19” appear at the end?

Also, what things can I add or change in my regex that will allow me to check dates correctly? My plan right now is to accept almost all dates and then extrude them later using high-level constructs, comparing the grouping of days with the grouping of the month and year to see if the day should be <31,30,29,28

Any help would be greatly appreciated, including constructive criticism as to how I am going to design my regex.

+6
source share
6 answers

Here is one way to make a regular expression that matches any date in the format you want (although you can obviously configure whether commas are optional, add month abbreviations, etc.):

 years = r'((?:19|20)\d\d)' pattern = r'(%%s) +(%%s), *%s' % years thirties = pattern % ( "September|April|June|November", r'0?[1-9]|[12]\d|30') thirtyones = pattern % ( "January|March|May|July|August|October|December", r'0?[1-9]|[12]\d|3[01]') fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4)) feb = r'(February) +(?:%s|%s)' % ( r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb)) r = re.compile(result) print result 

Then we have:

 >>> r.match('January 30, 2001') is not None True >>> r.match('January 31, 2001') is not None True >>> r.match('January 32, 2001') is not None False >>> r.match('February 32, 2001') is not None False >>> r.match('February 29, 2001') is not None False >>> r.match('February 28, 2001') is not None True >>> r.match('February 29, 2000') is not None True >>> r.match('April 30, 1908') is not None True >>> r.match('April 31, 1908') is not None False 

And what is this glorious regular expression, you may ask?

 >>> print result (?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000)))) 

(I originally intended to make an elegant listing of the possible dates, but I basically ended up writing all this rude thing, with the exception of a multiple of four.)

+6
source

Here are some quick thoughts:

Everyone who offers you to use something other than regular expression gives you very good advice. On the other hand, this is always a good time to learn more about regular expression syntax ...

The expression in square brackets - [...] - matches any single character inside these brackets. Therefore, the entry [,] , containing only one character, is exactly identical to writing a simple unpainted comma: , .

The .findall method returns a list of all the relevant groups in the string. The group is identified by parenthese - (...) - and they are calculated from left to right, primarily in the first place. Your final expression is as follows:

 ((19|20)[0-9][0-9]) 

Outer parentheses correspond to the whole year, and inner parentheses correspond to the first two digits. Therefore, for a date of type "1989", the last two matching groups will be 1989 and 19 .

+2
source

The group is identified by parentheses (...) , and they are calculated from left to right in the first place. Your final expression is as follows:

((19 | 20) [0-9] [0-9])

Outer parentheses correspond to the whole year, and inner parentheses correspond to the first two digits. Therefore, for a date like "1989", the two groups of matches will be 1989 and 19. Since you do not need an internal group (the first two digits), you should use a non-capture group instead. Non-capture groups start with ?: , Which are used as follows: (?:a|b|c)

By the way, there is good documentation on how to use regular expressions here .

+2
source

Python has a date parser as part of the time module:

 import time time.strptime("December 31, 2012", "%B %d, %Y") 

The above is all you need if the date format is always the same.

So, in real production code, I would write a regular expression that parses the date, and then uses the results from the regular expression to build a date string that always has the same format.

Now that you said in the comments that this is homework, I will post another answer with regular expression tips.

+1
source

You have this regex:

 pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])" 

One of the functions of regular expressions is the "character class". Characters in square brackets form a character class. Thus, [,] is a character class corresponding to one character,, (comma). You could just put a comma.

Perhaps you wanted to make the comma optional? You can do this by putting a question mark after it: ,?

Everything that you put in parentheses makes a “matching group”. I think the mysterious extra “19” came from a group that you did not want to have. You can create an inappropriate group using this syntax: (?:

So for example:

 r'(?:red|blue) socks' 

This will match “red socks” or “blue socks,” but does not make a matching group. If you then put this in regular parentheses:

 r'((?:red|blue) socks)' 

This will create a matching group whose value is "red socks" or "blue socks"

I think that if you apply these comments to your regular expression, this will work. This is currently mostly correct.

As for checking the date in relation to the month, this is beyond the scope of regular expression. Your template will match "February 31" , and there is no easy way to fix this.

+1
source

First of all, as said, I do not think that regular expression is the best choice to solve this problem, but to answer your question. Using parentheses, you parse a string into several subgroups, and when you call findall, you will create a list with the entire matching group that you created and the corresponding string.

 ((19|20)[0-9][0-9]) 

Here is your problem, the regular expression will correspond to whole years as well as 19 or 20, depending on whether the year begins with 19 or 20.

0
source

Source: https://habr.com/ru/post/914086/


All Articles