Why is compiled python regular expression running slower?

In another SO question , the performance of regular expressions and the Python operator were compared in. However, in the accepted answer is used re.match, which corresponds only to the beginning of the line and, therefore, behaves in a completely different way than to in. In addition, I wanted to see a performance boost without recompiling RE every time.

Surprisingly, I can see that the precompiled version looks slower.

Any ideas why?

I know that there are many other questions that are asked about a similar problem. Most of them do as they do, simply because they incorrectly reused the compiled regular expression. If this is also my problem, please explain.

from timeit import timeit
import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

def find():
    assert text.find(pattern) > -1

def re_search():
    assert re.search(pattern, text)

def re_compiled():
    assert re.search(compiled_pattern, text)

def in_find():
    assert pattern in text

print('str.find     ', timeit(find))
print('re.search    ', timeit(re_search))
print('re (compiled)', timeit(re_compiled))
print('in           ', timeit(in_find))

Conclusion:

str.find      0.36285957560356435
re.search     1.047689160564772
re (compiled) 1.575113873320307
in            0.1907925627077569
+4
source share
1 answer

Short answer

If you call compiled_pattern.search(text)directly, it will not call _compileat all, it will be faster than re.search(pattern, text)and much faster than re.search(compiled_pattern, text).

This performance difference is related KeyErrorto cache and slow hash calculations for compiled templates.


reand SRE_Patternmethods

, re pattern (, re.search(pattern, string) re.findall(pattern, string)), Python pattern _compile, . :

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

, pattern , ( SRE_Pattern).

_compile

โ€‹โ€‹ _compile. :

_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512

def _compile(pattern, flags):
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    loc = None
    _cache[type(pattern), pattern, flags] = p, loc
    return p

_compile

_compile , _cache dict. , (, timeit), _compile _cache, .

ipdb Spyder, re.py .

import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

re.search(pattern, text)
re.search(pattern, text)

re.search(pattern, text) , :

{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}

_cache. .

_compile

, _compile ?

-, _compile , _cache. . , :

In [1]: import re

In [2]: pattern = "(?:a(?:b(?:b\\รฉ|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
   ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
   ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
   ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
   ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\รฉ's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
   ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
   ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
   ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"

In [3]: compiled_pattern = re.compile(pattern)

In [4]: % timeit hash(pattern)
126 ns ยฑ 0.358 ns per loop (mean ยฑ std. dev. of 7 runs, 10000000 loops each)

In [5]: % timeit hash(compiled_pattern)
7.67 ยตs ยฑ 21 ns per loop (mean ยฑ std. dev. of 7 runs, 100000 loops each)

hash(compiled_pattern) 60 , hash(pattern) .

KeyError

a pattern , _cache[type(pattern), pattern, flags] KeyError.

KeyError . _compile , . , , .

, _compile , , KeyError.

, , , re.search(compiled_pattern, text) , re.search(pattern, text).

, , _compile .

+7

Source: https://habr.com/ru/post/1689795/


All Articles