Prevent RegEx freezes in big matches

This is a great regex for dates ... However, it hangs endlessly on this page that I tried ... I wanted to try this page ( http://pleac.sourceforge.net/pleac_python/datesandtimes.html ) because on there are many dates, and I want to capture them all. I don’t understand why it hangs when it is not on other pages ... Why does my regular expression hang and / or how can I clean it to make it better / efficient?

Python Code:

monthnames = "(?:Jan\w*|Feb\w*|Mar\w*|Apr\w*|May|Jun\w?|Jul\w?|Aug\w*|Sep\w*|Oct\w*|Nov(?:ember)?|Dec\w*)"

pattern1 = re.compile(r"(\d{1,4}[\/\\\-]+\d{1,2}[\/\\\-]+\d{2,4})")

pattern4 = re.compile(r"(?:[\d]*[\,\.\ \-]+)*%s(?:[\,\.\ \-]+[\d]+[stndrh]*)+[:\d]*[\ ]?(PM)?(AM)?([\ \-\+\d]{4,7}|[UTCESTGMT\ ]{2,4})*"%monthnames, re.I)

patterns = [pattern4, pattern1]

for pattern in patterns:
    print re.findall(pattern, s)

btw ... when I say that I am trying to do this against this site. I am trying to use it for a webpage source.

+3
source share
4 answers

. :

(?:[\d]*[\,\.\ \-]+)*

. :

(?:[\d,. \-]*[,. \-])?

, . , .

- : , . (AM) (?: AM), . :

[' Aug  6 20:43:20 2003', ' Mar 14 06:02:55 1973', ' March 14 06:02:55 AM 1973', ' Jun 16 20:18:03 1981']
['2003-08-06', '2003-08-07', '2003-07-23', '1973-01-18', '3/14/1973', '16/6/1981', '16/6/1981', '16/6/1981', '16/6/1981', '08/08/2003']

( , , ), * + ( NFA, python re), , . , , " " . , , , . , , , . , ( ), ... .

;)

+5

Python . , .

, "s" . , . , . HTML, beautifulsoup, node . .

0

. (, , ) , , .

0

First, you have to read what string means r"": you only need to put a backslash where you really want a backslash, so your regex should be simple:

monthnames = "(?:Jan\w*|Feb\w*|Mar\w*|Apr\w*|May|Jun\w?|Jul\w?|Aug\w*|Sep\w*|Oct\w*|Nov(?:ember)?|Dec\w*)"

pattern1 = re.compile(r"(\d{1,4}[-/]+\d{1,2}[-/]+\d{2,4})")

pattern4 = re.compile(r"(?:\d*[,. -]+)*%s(?:[,. -]+\d+[stndrh]*)+[:\d]*[ ]?(PM)?(AM)?([ -+\d]{4,7}|[UTCESTGMT ]{2,4})*"%monthnames, re.I)

As for your real problem, Python doesn't succeed with *nested inside a *. Change pattern4 to this (the first \d*will become \d+):

pattern4 = re.compile(r"(?:\d+[,. -]+)*%s(?:[,. -]+\d+[stndrh]*)+[:\d]*[ ]?(PM)?(AM)?([ -+\d]{4,7}|[UTCESTGMT ]{2,4})*"%monthnames, re.I)

and the regex returns quickly by typing this:

[('', '', '2003'), ('', '', '1973'), ('', 'AM', ' 1973'), ('', '', '1981"')]
['2003-08-06', '2003-08-07', '2003-07-23', '1973-01-18', '3/14/1973', '16/6/1981', '16/6/1981', '16/6/1981', '16/6/1981'
, '08/08/2003']

although I don’t know what you wanted.

0
source

Source: https://habr.com/ru/post/1726086/


All Articles