What is the fastest way to compare text with lots of regular expressions using python?

I have a list of regular expressions, and I would like to match the tweets that are created as they are, so I can associate them with a specific account. With a small number of rules, as mentioned above, it goes very fast, but as soon as you increase the number of rules, it becomes slower and slower.

import string, re2, datetime, time, array rules = [ [[1],["(?!.*ipiranga).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]], [[2],["(?!.*brasil).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]], ] #cache compile compilled_rules = [] for rule in rules: compilled_scopes.append([[rule[0][0]],[re2.compile(rule[1][0])]]) def get_rules(text): new_tweet = string.lower(tweet) for rule in compilled_rules: ok = 1 if not re2.search(rule[1][0], new_tweet): ok=0 print ok def test(): t0=datetime.datetime.now() i=0 time.sleep(1) while i<1000000: get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil") i+=1 t1=datetime.datetime.now()-t0 print "test" print i print t1 print i/t1.seconds 

When I checked with 550 rules, I could not do more than 50 reqs / s. Is there a better way to do this? I need at least 200 reqs / s

EDIT: after the prompts from Jonathan, I could improve the speed 5 times, but put in my rules a bit. See code below:

 scope_rules = { "1": { "termo 1" : "^(?!.*brasil)(?=.*petrobras).*", "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*", "termo 3" : "^(?!.*petrobras)(?=.*ipiranga).*", "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*", }, "2": { "termo 1" : "^(?!.*ipiranga)(?=.*petrobras).*", "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*", "termo 3" : "^(?!.*brasil)(?=.*ipiranga).*", "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*", } } compilled_rules = {} for scope,rules in scope_rules.iteritems(): compilled_rules[scope]={} for term,rule in rules.iteritems(): compilled_rules[scope][term] = re.compile(rule) def get_rules(text): new_tweet = string.lower(text) for scope,rules in compilled_rules.iteritems(): ok = 1 for term,rule in rules.iteritems(): if ok==1: if re.search(rule, new_tweet): ok=0 print "found in scope" + scope + " term:"+ term def test(): t0=datetime.datetime.now() i=0 time.sleep(1) while i<1000000: get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto ipiranga da lagoa") i+=1 t1=datetime.datetime.now()-t0 print "test" print i print t1 print i/t1.seconds cProfile.run('test()', 'testproof') 
+4
source share
4 answers

Going even further, I created the Cython extension for evaluating rules, and now it flashes quickly. I can make about 70 queries per second with rules of the order of 3000 regular expressions

regex.pyx

 import re2 import string as pystring cpdef list match_rules(char *pytext, dict compilled_rules): cdef int ok, scope, term cdef list response = [] text = pystring.lower(pytext) for scope, rules in compilled_rules.iteritems(): ok = 1 for term,rule in rules.iteritems(): if ok==1: if re2.search(rule, text): ok=0 response.append([scope,term]) return response 

python code

 import re2 as re import datetime, time, cProfile scope_rules = {1: {1 : "^(?!.*brasil)(?=.*petrobras).*", 2: "^(?!.*petrobras)(?=.*ipiranga).*",3 : "^(?!.*petrobras)(?=.*ipiranga).*",4 : "^(?!.*petrobras)(?=.*ipiranga).*",},2: {1 : "^(?!.*brasil)(?=.*petrobras).*", 2: "^(?!.*petrobras)(?=.*ipiranga).*",3 : "^(?!.*petrobras)(?=.*ipiranga).*",4 : "^(?!.*petrobras)(?=.*ipiranga).*",},} compilled_rules = {} for scope,rules in scope_rules.iteritems(): compilled_rules[scope]={} for term,rule in rules.iteritems(): compilled_rules[scope][term] = re.compile(rule) def test(): t0=datetime.datetime.now() i=0 time.sleep(1) while i<1000000: mregex.match_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil",compilled_rules) i+=1 t1=datetime.datetime.now()-t0 print t1 print i/t1.seconds cProfile.run('test()', 'testproof') 
+1
source

Your rules seem to be to blame here: because of the two .* Separated by lookaheads, a very large number of permutations must be checked for a successful match (or exclude a match). This is further compounded by using re.search() without anchors. In addition, interleaving, including the posto part, is redundant - the regular expression matches whether there is a posto in your string, so you can completely abandon it.

For example, your first rule can be rewritten as

 ^(?!.*ipiranga)(?=.*petrobras) 

without any change in the results. You can also optimize it with word boundaries if you are looking for exact words:

 ^(?!.*\bipiranga\b)(?=.*\petrobras\b) 

Some measurements (using RegexBuddy ):

Your first regex applied to the string Acabei de ir no posto petrobras. Moro pertinho do posto brasil Acabei de ir no posto petrobras. Moro pertinho do posto brasil , forces the regex engine about 4700 steps to figure out a match. If I output s to petrobras , more than 100,000 steps are required to determine the inconsistency.

Mines correspond to 230 steps (and do not work at 260), so you get acceleration 20-400 times only from the correct correct expression.

+4
source

In addition to optimizing the regular expression patterns themselves (which will be of great importance), you can try Google RE2 - it should be faster than the Python standard regular expression module.

This is done in C ++, but there PyRE2 , the Python shell for RE2 via Facebook :)

PS Thanks to your question, I found a great read to match the regex!

0
source

in addition to @Tim Pietzcker's recommendation

If you have many rules, it may make sense to try a multi-level approach based on smaller rules, grouped by commonality.

for example, both of your rules above correspond to "posto" and "petrobras". If you group these regular expressions into a list together and then qualify sending to this list, you can avoid running a large number of rules that will never apply.

in pseducode ....

 # a dict of qualifier rules with list of rules rules = { "+ petrobras" : [ "- ipiranga" "- brasil" "? posto" ] , } for rule_qualifier in rules: if regex( rule_qualifier , text ): for rule in rule_qualifier: if regex( rule , text ): yay 
0
source

Source: https://habr.com/ru/post/1437793/


All Articles