My first thought was to get the regexp engine to handle this. They are generally optimized to handle large amounts of text, so this should not be a performance issue. This is brute force, but performance seems to be in order. And you can split the input parts into parts and process them with several processes. Here is my moderately tested solution (in Python).
import random import string import re def create_random_sentence(): nwords = random.randint(4, 10) sentence = [] for i in range(nwords): sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10)))) ret = " ".join(sentence) print ret return ret patterns = [ r"Hi there, [a-zA-Z]+.", r"What a lovely day today!", r"Lovely sunset today, [a-zA-Z]+, isn't it?", r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"] for i in range(95): patterns.append(create_random_sentence()) monster_pattern = "|".join("(%s)"%x for x in patterns) print monster_pattern print "--------------" monster_regexp = re.compile(monster_pattern) inputs = ["Hi there, John.", "What a lovely day today!", "Lovely sunset today, John, isn't it?", "Will you be meeting Linda today, John?", "Goobledigoock"]*2000 for i in inputs: ret = monster_regexp.search(i) if ret: print ".", else: print "x",
I created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples, and the rest are random suggestions to emphasize performance a bit.
Then I combined them into one regex with 100 groups. (group1)|(group2)|(group3)|... I assume that you will have to sanitize the input for things that may matter in regular expressions (e.g. ? , Etc.). This is monster_regexp .
Testing one line against it checks it for 100 patterns in one shot. There are methods that extract the exact group that has been mapped. I am testing 10,000 rows, 80% of which should match, and 10% not. These are short circcuits, so if there is success, it will be relatively fast. Failures will have to go through the entire regular expression so that it is slower. You can order things based on input frequency to get better performance.
I ran this on my machine, and this is my time.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not so bad.
However, it will take longer to run the template with such a large regex and crash, so I changed the input to have a lot of randomly generated lines that would not match, and then try. 10,000 lines, none of which match monster_regexp, and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total