A logical and regular expression with multiple statements of views can be significantly accelerated by linking them to the beginning of the line. Or even better, use two regular expressions: one for OR ed is a list of terms using the re.search method and a second regular expression with a logical list AND ed using the re.match method, for example:
def boolean_and_new(terms): return ''.join([r'(?=.*?\b%s\b)' % (term) for term in terms]) def run_new(): words_and_phrases = ['apple', 'cherry pie'] booleans = [boolean_and_new(terms) for terms in [ ['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]] regex1 = re.compile(r'(?i)\b(?:%s)\b' % ('|'.join(words_and_phrases))) regex2 = re.compile(r'(?i)%s' % ('|'.join(booleans))) matched_data = list() for d in data(): if regex1.search(d) or regex2.match(d): matched_data.append(d)
Effective regular expressions for this dataset are:
regex1 = r'(?i)\b(?:apple|cherry pie)\b' regex2 = r'(?i)(?=.*?\bsweet pie\b)(?=.*?\bsavoury pie\b)(?=.*?\bmeringue\b)|(?=.*?\bchicken pie\b)(?=.*?\bbeef pie\b)'
Note that the second regular expression has an ^ anchor at the beginning, as it is used with the re.match method. It also includes several additional (minor) settings; removing unnecessary capture groups and changing the greedy star-point to lazy. This solution works almost 10 times faster than the original on my Win32 box running Python 3.0.1.
Optional:. So why is it faster? Let's look at a simple example that describes how the "engine" of NFA regular expression works. (Note that the following description stems from the classic work on the topic: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3), which explains all this performance material in detail - highly recommended.) Suppose you have a target line, containing one word, and you want the regular expression see if that word is: "apple" . Here are two regular expressions that could be performed to complete the task:
re1 = r'^apple' re2 = r'apple' s = r'orange'
If your target string is: apple (or applesauce or apple-pie , etc.), then both regular expressions will succeed very quickly. But if your target line says: orange , the situation is different. The NFA regex engine must try all possible regex permutations on the target string before it can report a match error. The way the regular expression engine works is that it stores an internal pointer to its current location in the target line and a second pointer to the location in the regular expression template and promotes these pointers as the business grows. Please note that these pointers point to the spaces between the characters and for starters, the target line pointer is set to the position before the first letter if the target line.
re1: The first token in the regex is ^ start string binding. This "anchor" is one of the special "assertion" expressions that match the location on the target line and actually don't match any characters. (Lookahead and lookbehind and border expression expressions \b are also statements that match the location and don’t “consume” any characters.) Well, with the target line pointer initialized to the location before the first letter of the word orange , the regular expression engine checks to see if the anchor matches ^ and it does (because this location is, in fact, the beginning of a line). Thus, the pattern pointer will move to the next token in the regular expression, the letter a . (Target line pointer does not advance). It then checks to see if the regular expression literal a matches the character of the target string o . This is not appropriate. At the moment, the regex engine is smart enough to know that a regex can never succeed anywhere else on the target line (since ^ never match anywhere, but at the beginning). In this way, he can declare a match failure.
P2: In this case, the engine starts by checking whether the first pattern char a matches the first target char 'o'. Again, this is not so. However, in this case, the regular expression engine fails! He determined that the pattern would not match in the first place, but he had to try (and fail) at all locations with the target string before he could declare the match to fail. Thus, the engine advances the target line pointer to the next location (Friedl refers to this as a “bump-along” transmission). For each "bump along", it resets the drawing pointer back to the beginning. Thus, it checks the first token in pattern a for the second char in the string: r . This also doesn't match, so the transmission hits again to the next place inside the line. At this point, it checks the first char of pattern a for the third char of the target: a , which matches. The engine advances both pointers and checks the second char in the regular expression p against the fourth character in the target n . It fails. At this point, the engine announces a failure in the place before a in orange , and then returns to n again. This happens until it is executed in every place on the target line, after which it can declare a general match failure.
For long storylines, this extra unnecessary work can take a lot of time. Creating accurate and effective regular expression is an equal art and science. And to create a really big regular expression, you need to understand exactly how the engine works under the hood. Obtaining this knowledge takes time and effort, but the time spent (in my experience) will pay for itself many times. Indeed, there is only one good place to effectively learn these skills, namely to sit down and learn Mastering Regular Expressions (3rd edition) , and then practice the learned methods. I can honestly say that this, hands, is the most useful book I have ever read. (This is even funny!)
Hope this helps! 8 ^)