Why is python regex so slow?

After a long debugging, I found that my python regex application is slow. Here is what I find awesome:

import datetime import re pattern = re.compile('(.*)sol(.*)') lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000, "ciao mandi "*1000 + "sal " + "ciao mandi "*1000] for s in lst: print "string len", len(s) start = datetime.datetime.now() re.findall(pattern,s) print "time spent", datetime.datetime.now() - start print 

Output on my machine:

 string len 220004 time spent 0:00:00.002844 string len 22004 time spent 0:00:05.339580 

The first test line is 220K long, matches, and the match is pretty fast. The second test line is 20K long, does not match and it takes 5 seconds to calculate!

I knew this report http://swtch.com/~rsc/regexp/regexp1.html , which states that the regexp implementation in python, perl, ruby ​​is somewhat not optimal ... Is this the reason? I did not think that this could happen with such a simple expression.

added My initial task was to split the string, in turn, with another regular expression. Sort of:

 for regex in ['(.*)sol(.*)', '\emph{([^{}])*)}(.*)', .... ]: lst = re.findall(regex, text) if lst: assert len(lst) == 1 assert len(lst[0]) == 2 return lst[0] 

This explains why I cannot use split . I solved the problem by replacing (.*)sol(.*) With (.*?)sol(.*) , As Martyn suggested.

I should probably use match instead of findall ... but I don't think that would solve the problem, since regexp would match the whole input and therefore findall should stop on the first match.

In any case, my question concerned how easy it is for regexp beginners to get into this problem ... I think it’s not so easy to understand that (.*?)sol(.*) Is a solution (and, for example, (.*?)sol(.*?) no).

+6
source share
3 answers

The Thompson NFA approach changes regular expressions from default greedy to non-greedy by default. Normal regex engines can do the same; just change .* to .*? . You should not use greedy expressions when not greedy will do.

Someone already created a parser for NFA regular expression for Python: https://github.com/xysun/regex

It really surpasses the default Python regular expression analyzer for pathological cases. However, it is under -performs on everything else:

This regex engine does not match the Python re module on regular inputs (using the Glenn Fowler test suite - see below)

Fixing a pathological case at the expense of a typical one is probably a good reason not to use the NFA approach as the default mechanism, and not when a pathological case can be excluded.

Another reason is that some features (like backlinks) are either very difficult or impossible to implement using the NFA approach. Also see the answer on the Idea Python mailing list .

Thus, your test can be done much better if you made at least one of the templates inanimate to avoid a catastrophic return:

 pattern = re.compile('(.*?)sol(.*)') 

or do not use a regular expression at all; you can use str.partition() to get the prefix and postfix instead:

 before, sol, after = s.partition('sol') 

eg. Not all textual problems are in the form of a regular expression, so put this hammer down and look at the rest of your toolkit!

In addition, you can look at an alternative module re , regex . This module implements some basic checks of pathological cases and avoids them cleverly without resorting to the implementation of Thomson NFA. Quoting a record in tracking Python regex bug report :

The internal engine no longer interprets the shape of the bytecode, but instead follows a related set of nodes, and it can work both in width and in what makes it much better when it comes across one of these "pathological" regular expressions.

This engine can launch your pathological case more than 100 thousand times faster:

 >>> import re, regex >>> import timeit >>> p_re = re.compile('(.*)sol(.*)') >>> p_regex = regex.compile('(.*)sol(.*)') >>> s = "ciao mandi "*1000 + "sal " + "ciao mandi "*1000 >>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_re as p', number=1) 2.4578459990007104 >>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_regex as p', number=100000) 1.955532732012216 

Pay attention to the numbers; I limited the re test to 1 run, and it took 2.46 seconds, and the regex test starts 100 thousand times in less than 2 seconds.

+11
source

I think this has nothing to do with the catastrophic retreat (or at least my own understanding).

The problem is caused by the first (.*) In (.*)sol(.*) And the fact that the regular expression is not fixed anywhere.

re.findall() , after a failure at index 0, try again at index 1, 2, etc. to the end of the line.

 badbadbadbad...bad ^ Attempt to match (.*)sol(.*) from index 0. Fail ^ Attempt to match (.*)sol(.*) from index 1. Fail ^ Attempt to match (.*)sol(.*) from index 2. Fail (and so on) 

It effectively has quadratic complexity O (n 2 ), where n is the length of the string.

The problem can be solved by linking your template, so that it will immediately work on positions that your template cannot match. (.*)sol(.*) will look for sol in a line of text (limited to a line terminator), so if it cannot find a match at the beginning of a line, it will not find anything for the rest of the line.

Therefore you can use:

 ^(.*)sol(.*) 

with parameter re.MULTILINE .

Running this test code (modified from yours):

 import datetime import re pattern = re.compile('^(.*)sol(.*)', re.MULTILINE) lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000, "ciao mandi "*10000 + "sal " + "ciao mandi "*10000] for s in lst: print "string len", len(s) start = datetime.datetime.now() re.findall(pattern,s) print "time spent", datetime.datetime.now() - start print 

(Note that both the walkthrough and the failure are 220004 characters)

It gives the following result:

 string len 220004 time spent 0:00:00.002000 string len 220004 time spent 0:00:00.005000 

This clearly shows that both cases now have the same order of magnitude.

+4
source
 ^(?=(.*?sol))\1(.*)$ 

You can try this.This reduces backtracking and doesn't work faster. Try your line here.

http://regex101.com/r/hQ1rP0/22

0
source

Source: https://habr.com/ru/post/976289/


All Articles