I have a list of regular expressions, and I would like to match the tweets that are created as they are, so I can associate them with a specific account. With a small number of rules, as mentioned above, it goes very fast, but as soon as you increase the number of rules, it becomes slower and slower.
import string, re2, datetime, time, array rules = [ [[1],["(?!.*ipiranga).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]], [[2],["(?!.*brasil).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]], ] #cache compile compilled_rules = [] for rule in rules: compilled_scopes.append([[rule[0][0]],[re2.compile(rule[1][0])]]) def get_rules(text): new_tweet = string.lower(tweet) for rule in compilled_rules: ok = 1 if not re2.search(rule[1][0], new_tweet): ok=0 print ok def test(): t0=datetime.datetime.now() i=0 time.sleep(1) while i<1000000: get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil") i+=1 t1=datetime.datetime.now()-t0 print "test" print i print t1 print i/t1.seconds
When I checked with 550 rules, I could not do more than 50 reqs / s. Is there a better way to do this? I need at least 200 reqs / s
EDIT: after the prompts from Jonathan, I could improve the speed 5 times, but put in my rules a bit. See code below:
scope_rules = { "1": { "termo 1" : "^(?!.*brasil)(?=.*petrobras).*", "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*", "termo 3" : "^(?!.*petrobras)(?=.*ipiranga).*", "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*", }, "2": { "termo 1" : "^(?!.*ipiranga)(?=.*petrobras).*", "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*", "termo 3" : "^(?!.*brasil)(?=.*ipiranga).*", "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*", } } compilled_rules = {} for scope,rules in scope_rules.iteritems(): compilled_rules[scope]={} for term,rule in rules.iteritems(): compilled_rules[scope][term] = re.compile(rule) def get_rules(text): new_tweet = string.lower(text) for scope,rules in compilled_rules.iteritems(): ok = 1 for term,rule in rules.iteritems(): if ok==1: if re.search(rule, new_tweet): ok=0 print "found in scope" + scope + " term:"+ term def test(): t0=datetime.datetime.now() i=0 time.sleep(1) while i<1000000: get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto ipiranga da lagoa") i+=1 t1=datetime.datetime.now()-t0 print "test" print i print t1 print i/t1.seconds cProfile.run('test()', 'testproof')