Pythonic and efficient way to define multiple regular expressions for use in many iterations

I am currently writing a Python script to process about 10,000 input documents. Based on the output of the script progress, I notice that the first 400+ documents are processed very quickly, and then the script slows down, although all input documents are about the same size.

I suggest that this may be due to the fact that most of the processing of the document is done using regular expressions, which I don’t save as regular expression objects after compiling them. Instead, I recompile regular expressions when I need it.

Since my script has about 10 different functions, all of which use about 10-20 different regex patterns, I wonder what would be a more efficient way in Python to avoid re-compiling regex patterns over and over (in Perl I could just turn on the modifier //o ).

My assumption is that if I store regex objects in separate functions using

 pattern = re.compile() 

the resulting regex object will not be saved until the next function call for the next iteration (each function is called, but once for each document).

Creating a global list of precompiled regular expressions seems unattractive, since I will need to store the list of regular expressions elsewhere in my code than where they are actually used.

Any advice here on how to handle this neatly and efficiently?

+6
source share
4 answers

The re module caches compiled regex patterns. The cache is cleared when it reaches the re._MAXCACHE size, which is 100 by default. (Since you have 10 functions with 10-20 regular expressions each (i.e. 100-200 regular expressions), your slow slowdown makes sense with clearing cache.)

If you change the private variables, a quick and dirty fix for your program might be to set re._MAXCACHE to a higher value:

 import re re._MAXCACHE = 1000 
+9
source

The last time I looked, re.compile supported a fairly small cache, and when it was full, it simply emptied it. DIY without limits:

 class MyRECache(object): def __init__(self): self.cache = {} def compile(self, regex_string): if regex_string not in self.cache: self.cache[regex_string] = re.compile(regex_string) return self.cache[regex_string] 
+5
source

The compiled regex is automatically cached by re.compile , re.search and re.match , but the maximum cache size is 100 in Python 2.7, so you overflow the cache.

Creating a global list of precompiled regular expressions seems unattractive, since I will need to store the list of regular expressions elsewhere in my code than where they are actually used.

You can define them near the place where they are used: immediately before the functions that use them. If you reuse the same RE in another place, then it would be nice to define it globally anyway to avoid having to change it in several places.

+2
source

In the spirit of β€œsimple is better,” I would use a small helper function:

 def rc(pattern, flags=0): key = pattern, flags if key not in rc.cache: rc.cache[key] = re.compile(pattern, flags) return rc.cache[key] rc.cache = {} 

Using:

 rc('[az]').sub... rc('[az]').findall <- no compilation here 

I also recommend that you try regex . Among many other advantages over stock, its MAXCACHE defaults to 500, and it will not be completely lost during overflow.

+1
source

Source: https://habr.com/ru/post/911884/


All Articles