Executing multiple consecutive regular expressions in Python. Ineffective?

First up - my code works. It works slowly, and I wonder if I am missing something that will make it more efficient. I am parsing PDF files using python (and yes, I know that this should be avoided, if at all possible).

My problem is that I need to do some pretty complicated regular expression substitutions - and when I talk about replacement, I really want to delete. I did the ones that first extract most of the data, so the following expressions do not need to parse too much text, but all I can think of is to speed up the process.

I am new to python and regexes, so it is very possible that this could be done better.

Thank you for reading.

regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})" regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})" regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)" regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*" contentRaw = re.sub(regexStartPattern,"",contentRaw) contentRaw = re.sub(regexEndPattern,"",contentRaw) contentRaw = re.sub(regexPagePattern,"",contentRaw) contentRaw = re.sub(regexCleanPattern,"",contentRaw) 
+4
source share
2 answers

I'm not sure if you do this inside a loop. If the following does not apply, follow these steps:

If you use the template several times, you must compile it with re.compile( ... ) . Thus, the template only compiles once. The increase in speed should be huge . Minimal example:

 >>> a="abcdef" >>> re.sub(' ', '-', a) 'abcdef' >>> p=re.compile(' ') >>> re.sub(p, '-', a) 'abcdef' 

Another idea . Use re.split (...) instead of re.sub and operate on an array with the resulting fragments of your data. I'm not quite sure how this is implemented, but I think re.sub creates text fragments and combines them into one line at the end, which is expensive. After the last step, you can join the array using " ".join(fragments) . Obviously, this method will not work if your templates overlap somewhere.

It would be interesting to get time information for your program before and after your changes.

+4
source

Regex is always the last choice when trying to decode strings. Therefore, if you see another opportunity to solve your problem, use this.

However, you can use re.compile to precompile your regular expression patterns:

 regexPagePattern = re.compile(r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})") regexPagePattern.sub("",contentRaw) 

This should speed things up a bit (pretty good bit;))

0
source

Source: https://habr.com/ru/post/1401176/


All Articles