The Thompson NFA approach changes regular expressions from default greedy to non-greedy by default. Normal regex engines can do the same; just change .* to .*? . You should not use greedy expressions when not greedy will do.
Someone already created a parser for NFA regular expression for Python: https://github.com/xysun/regex
It really surpasses the default Python regular expression analyzer for pathological cases. However, it is under -performs on everything else:
This regex engine does not match the Python re module on regular inputs (using the Glenn Fowler test suite - see below)
Fixing a pathological case at the expense of a typical one is probably a good reason not to use the NFA approach as the default mechanism, and not when a pathological case can be excluded.
Another reason is that some features (like backlinks) are either very difficult or impossible to implement using the NFA approach. Also see the answer on the Idea Python mailing list .
Thus, your test can be done much better if you made at least one of the templates inanimate to avoid a catastrophic return:
pattern = re.compile('(.*?)sol(.*)')
or do not use a regular expression at all; you can use str.partition() to get the prefix and postfix instead:
before, sol, after = s.partition('sol')
eg. Not all textual problems are in the form of a regular expression, so put this hammer down and look at the rest of your toolkit!
In addition, you can look at an alternative module re , regex . This module implements some basic checks of pathological cases and avoids them cleverly without resorting to the implementation of Thomson NFA. Quoting a record in tracking Python regex bug report :
The internal engine no longer interprets the shape of the bytecode, but instead follows a related set of nodes, and it can work both in width and in what makes it much better when it comes across one of these "pathological" regular expressions.
This engine can launch your pathological case more than 100 thousand times faster:
>>> import re, regex >>> import timeit >>> p_re = re.compile('(.*)sol(.*)') >>> p_regex = regex.compile('(.*)sol(.*)') >>> s = "ciao mandi "*1000 + "sal " + "ciao mandi "*1000 >>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_re as p', number=1) 2.4578459990007104 >>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_regex as p', number=100000) 1.955532732012216
Pay attention to the numbers; I limited the re test to 1 run, and it took 2.46 seconds, and the regex test starts 100 thousand times in less than 2 seconds.