What is the fastest way in Python to find if a string matches any words in a list of words, phrases, logical ANDs?

Question

What is the fastest way in Python to find if a string matches any words in a list of words, phrases, logical ANDs?

I am trying to find a quick way in Python to check if a list of terms can be matched with strings of between 50 and 50,000 characters.

The term may be:

A word, for example. 'An Apple'
A phrase, for example. 'cherry pie'
The logical expression of words and phrases, for example. "sweet cake And tasty cake And meringue"

Coincidence is a word or phrase around the borders of words, therefore:

match(term='apple', string='An apple a day.') # True match(term='berry pie', string='A delicious berry pie.') # True match(term='berry pie', string='A delicious blueberry pie.') # False

I currently have about 40 terms, most of them simple words. The number of terms will increase over time, but I would not expect it to go beyond 400.

I am not interested in which term matches the string or where in the string that matches, I just need true / false to match each string - it is much more likely that no conditions will match the string, so for 1 in 500, where it matches, I can save the string for further processing.

Speed is the most important criteria, and I would like to use the existing code of those who are smarter than me, instead of trying to implement white paper. :)

So far, the fastest solution I came up with is:

 def data(): return [ "The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae).", "This resulted in early armies adopting the style of hunter-foraging.", "Beef pie fillings are popular in Australia. Chicken pie fillings are too." ] def boolean_and(terms): return '(%s)' % (''.join(['(?=.*\\b%s\\b)' % (term) for term in terms])) def run(): words_and_phrases = ['apple', 'cherry pie'] booleans = [boolean_and(terms) for terms in [['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]] regex = re.compile(r'(?i)(\b(%s)\b|%s)' % ('|'.join(words_and_phrases), '|'.join(booleans))) matched_data = list() for d in data(): if regex.search(d): matched_data.append(d)

The regular expression ends as follows:

 (?i)(\b(apple|cherry pie)\b|((?=.*\bsweet pie\b)(?=.*\bsavoury pie\b)(?=.*\bmeringue\b))|((?=.*\bchicken pie\b)(?=.*\bbeef pie\b)))

Thus, all terms are ORed together, the case is ignored, words / phrases are wrapped in \ b for word boundaries, logical AIs use lookaheads, so that all terms are mapped, but they should not coincide in a specific order.

Timeout Results:

  print timeit.Timer('run()', 'from __main__ import run').timeit(number=10000) 1.41534304619

Without lookaheads (i.e. logical AND) it is really fast, but as soon as they are added, the speed slows down significantly.

Does anyone have any ideas on how this can be improved? Is there a way to optimize the look or maybe a completely different approach? I do not think that the work will work, as it is usually a little greedy with the fact that it matches.

+4

python string algorithm regex pattern-matching

johanati Mar 25 '11 at 1:20

source share

2 answers

I'm going to give a partial answer here, but why don't you break the test and match the lines at the word boundaries and build a set . You can quickly traverse sets, and if the set matches, you can perform an expensive regex test.

+4

Andrew White Mar 25 '11 at 1:24

source share

ridgerunner · Accepted Answer · 2011-03-25T05:01:06+0000

A logical and regular expression with multiple statements of views can be significantly accelerated by linking them to the beginning of the line. Or even better, use two regular expressions: one for OR ed is a list of terms using the re.search method and a second regular expression with a logical list AND ed using the re.match method, for example:

 def boolean_and_new(terms): return ''.join([r'(?=.*?\b%s\b)' % (term) for term in terms]) def run_new(): words_and_phrases = ['apple', 'cherry pie'] booleans = [boolean_and_new(terms) for terms in [ ['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]] regex1 = re.compile(r'(?i)\b(?:%s)\b' % ('|'.join(words_and_phrases))) regex2 = re.compile(r'(?i)%s' % ('|'.join(booleans))) matched_data = list() for d in data(): if regex1.search(d) or regex2.match(d): matched_data.append(d)

Effective regular expressions for this dataset are:

 regex1 = r'(?i)\b(?:apple|cherry pie)\b' regex2 = r'(?i)(?=.*?\bsweet pie\b)(?=.*?\bsavoury pie\b)(?=.*?\bmeringue\b)|(?=.*?\bchicken pie\b)(?=.*?\bbeef pie\b)'

Note that the second regular expression has an ^ anchor at the beginning, as it is used with the re.match method. It also includes several additional (minor) settings; removing unnecessary capture groups and changing the greedy star-point to lazy. This solution works almost 10 times faster than the original on my Win32 box running Python 3.0.1.

Optional:. So why is it faster? Let's look at a simple example that describes how the "engine" of NFA regular expression works. (Note that the following description stems from the classic work on the topic: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3), which explains all this performance material in detail - highly recommended.) Suppose you have a target line, containing one word, and you want the regular expression see if that word is: "apple" . Here are two regular expressions that could be performed to complete the task:

 re1 = r'^apple' re2 = r'apple' s = r'orange'

If your target string is: apple (or applesauce or apple-pie , etc.), then both regular expressions will succeed very quickly. But if your target line says: orange , the situation is different. The NFA regex engine must try all possible regex permutations on the target string before it can report a match error. The way the regular expression engine works is that it stores an internal pointer to its current location in the target line and a second pointer to the location in the regular expression template and promotes these pointers as the business grows. Please note that these pointers point to the spaces between the characters and for starters, the target line pointer is set to the position before the first letter if the target line.

re1: The first token in the regex is ^ start string binding. This "anchor" is one of the special "assertion" expressions that match the location on the target line and actually don't match any characters. (Lookahead and lookbehind and border expression expressions \b are also statements that match the location and don’t “consume” any characters.) Well, with the target line pointer initialized to the location before the first letter of the word orange , the regular expression engine checks to see if the anchor matches ^ and it does (because this location is, in fact, the beginning of a line). Thus, the pattern pointer will move to the next token in the regular expression, the letter a . (Target line pointer does not advance). It then checks to see if the regular expression literal a matches the character of the target string o . This is not appropriate. At the moment, the regex engine is smart enough to know that a regex can never succeed anywhere else on the target line (since ^ never match anywhere, but at the beginning). In this way, he can declare a match failure.

P2: In this case, the engine starts by checking whether the first pattern char a matches the first target char 'o'. Again, this is not so. However, in this case, the regular expression engine fails! He determined that the pattern would not match in the first place, but he had to try (and fail) at all locations with the target string before he could declare the match to fail. Thus, the engine advances the target line pointer to the next location (Friedl refers to this as a “bump-along” transmission). For each "bump along", it resets the drawing pointer back to the beginning. Thus, it checks the first token in pattern a for the second char in the string: r . This also doesn't match, so the transmission hits again to the next place inside the line. At this point, it checks the first char of pattern a for the third char of the target: a , which matches. The engine advances both pointers and checks the second char in the regular expression p against the fourth character in the target n . It fails. At this point, the engine announces a failure in the place before a in orange , and then returns to n again. This happens until it is executed in every place on the target line, after which it can declare a general match failure.

For long storylines, this extra unnecessary work can take a lot of time. Creating accurate and effective regular expression is an equal art and science. And to create a really big regular expression, you need to understand exactly how the engine works under the hood. Obtaining this knowledge takes time and effort, but the time spent (in my experience) will pay for itself many times. Indeed, there is only one good place to effectively learn these skills, namely to sit down and learn Mastering Regular Expressions (3rd edition) , and then practice the learned methods. I can honestly say that this, hands, is the most useful book I have ever read. (This is even funny!)

Hope this helps! 8 ^)

What is the fastest way in Python to find if a string matches any words in a list of words, phrases, logical ANDs?

More articles: