I am currently writing a Python script to process about 10,000 input documents. Based on the output of the script progress, I notice that the first 400+ documents are processed very quickly, and then the script slows down, although all input documents are about the same size.
I suggest that this may be due to the fact that most of the processing of the document is done using regular expressions, which I donβt save as regular expression objects after compiling them. Instead, I recompile regular expressions when I need it.
Since my script has about 10 different functions, all of which use about 10-20 different regex patterns, I wonder what would be a more efficient way in Python to avoid re-compiling regex patterns over and over (in Perl I could just turn on the modifier //o
).
My assumption is that if I store regex objects in separate functions using
pattern = re.compile()
the resulting regex object will not be saved until the next function call for the next iteration (each function is called, but once for each document).
Creating a global list of precompiled regular expressions seems unattractive, since I will need to store the list of regular expressions elsewhere in my code than where they are actually used.
Any advice here on how to handle this neatly and efficiently?
source share