First up - my code works. It works slowly, and I wonder if I am missing something that will make it more efficient. I am parsing PDF files using python (and yes, I know that this should be avoided, if at all possible).
My problem is that I need to do some pretty complicated regular expression substitutions - and when I talk about replacement, I really want to delete. I did the ones that first extract most of the data, so the following expressions do not need to parse too much text, but all I can think of is to speed up the process.
I am new to python and regexes, so it is very possible that this could be done better.
Thank you for reading.
regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})" regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})" regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)" regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*" contentRaw = re.sub(regexStartPattern,"",contentRaw) contentRaw = re.sub(regexEndPattern,"",contentRaw) contentRaw = re.sub(regexPagePattern,"",contentRaw) contentRaw = re.sub(regexCleanPattern,"",contentRaw)
gruvn source share