Match string to pattern

At some point in my application, I need to map some lines to a pattern. Let's say some of the example lines look like this:

  • Hi John.
  • What a wonderful day today!
  • Wonderful sunset today, John, isn't it?
  • Will you date Linda today, John?

Most (not all) of these lines are from predefined patterns:

  • "Hi% s."
  • "What a beautiful day today!"
  • "Wonderful sunset today,% s, right?"
  • "Do you meet% s today,% s?"

This template library is constantly expanding (currently around 1,500), but is manually maintained. The input lines, though (first group), are pretty much unpredictable. Although most of them will match one of the patterns, some of them will not.

So, here is my question: Given that a string (from the first group) as input, I need to know which of the patterns (the well-known second group) matches. If nothing is agreed, I need to say that.

I guess the solution involves creating a regular expression from patterns and iteratively matching. However, I'm not sure what the code for creating these regular expressions looks like.

Note. The lines shown here are for illustration. Actually, strings are not generated by humans, but are computer-generated human strings, as shown above, from systems that I do not control. Since they are not entered manually, we don’t need to worry about things like typos and other human errors. You just need to find which template it matches.

Note 2: I could modify the template library in a different format if this makes it easier to create regular expressions. The current structure of type printf% s is not set in the frame.

+6
source share
6 answers

My first thought was to get the regexp engine to handle this. They are generally optimized to handle large amounts of text, so this should not be a performance issue. This is brute force, but performance seems to be in order. And you can split the input parts into parts and process them with several processes. Here is my moderately tested solution (in Python).

import random import string import re def create_random_sentence(): nwords = random.randint(4, 10) sentence = [] for i in range(nwords): sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10)))) ret = " ".join(sentence) print ret return ret patterns = [ r"Hi there, [a-zA-Z]+.", r"What a lovely day today!", r"Lovely sunset today, [a-zA-Z]+, isn't it?", r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"] for i in range(95): patterns.append(create_random_sentence()) monster_pattern = "|".join("(%s)"%x for x in patterns) print monster_pattern print "--------------" monster_regexp = re.compile(monster_pattern) inputs = ["Hi there, John.", "What a lovely day today!", "Lovely sunset today, John, isn't it?", "Will you be meeting Linda today, John?", "Goobledigoock"]*2000 for i in inputs: ret = monster_regexp.search(i) if ret: print ".", else: print "x", 

I created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples, and the rest are random suggestions to emphasize performance a bit.

Then I combined them into one regex with 100 groups. (group1)|(group2)|(group3)|... I assume that you will have to sanitize the input for things that may matter in regular expressions (e.g. ? , Etc.). This is monster_regexp .

Testing one line against it checks it for 100 patterns in one shot. There are methods that extract the exact group that has been mapped. I am testing 10,000 rows, 80% of which should match, and 10% not. These are short circcuits, so if there is success, it will be relatively fast. Failures will have to go through the entire regular expression so that it is slower. You can order things based on input frequency to get better performance.

I ran this on my machine, and this is my time.

python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total

which is not so bad.

However, it will take longer to run the template with such a large regex and crash, so I changed the input to have a lot of randomly generated lines that would not match, and then try. 10,000 lines, none of which match monster_regexp, and I got this.

python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total

+1
source

I see this as a parsing problem. The idea is that the parser function takes a string and determines whether it is valid or not.

The string is valid if you can find it among the given patterns. This means that you need an index of all the templates. The index must be a full text index. Also it must match in accordance with the position of the word. eg. it should be shorted if the first input word is not found among the first word of the patterns. It should take care of any match, i.e. %s in the template.

One solution is to place the templates in the database in memory (for example, redis) and make a full text index on it. (this will not match the position of the word), but you can narrow it down to the correct template by dividing the input into words and search. Searches will be very fast because you have a small database in memory. Also note that you are looking for the next match. One or more words do not match. The most matches are the pattern you want.

An even better solution is to create your own index in dictionary format. Here is an example index for the four templates you specified as a JavaScript object.

 { "Hi": { "there": {"%s": null}}, "What: {"a": {"lovely": {"day": {"today": null}}}}, "Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}}, "Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}} } 

This index is recursive descending according to the position of the word. So, find the first word if you find a search for the next object returned first, and so on. The same words at a given level will have only one key. You must also match the any case. It should be blinded quickly in memory.

+3
source

Similar to Noufal's solution, but returns a matching pattern or None.

 import re patterns = [ "Hi there, %s.", "What a lovely day today!", "Lovely sunset today, %s, isn't it", "Will you be meeting %s today, %s?" ] def make_re_pattern(pattern): # characters like . ? etc. have special meaning in regular expressions. # Escape the string to avoid interpretting them as differently. # The re.escape function escapes even %, so replacing that with XXX to avoid that. p = re.escape(pattern.replace("%s", "XXX")) return p.replace("XXX", "\w+") # Join all the pattens into a single regular expression. # Each pattern is enclosed in () to remember the match. # This will help us to find the matched pattern. rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns)) def match(s): """Given an input strings, returns the matched pattern or None.""" m = rx.match(s) if m: # Find the index of the matching group. index = (i for i, group in enumerate(m.groups()) if group is not None).next() return patterns[index] # Testing with couple of patterns print match("Hi there, John.") print match("Will you be meeting Linda today, John?") 
+1
source

Python solution. JS should be similar.

 >>> re2.compile('^ABC(.*)E$').search('ABCDE') == None False >>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None False >>> re2.compile('^ABC(.*)E$').search('ABX') == None True >>> 

The trick is to use ^ and $ to link your template and make it a "template". Use (. *) Or (. +) Or whatever you want to “look for”.

The main bottleneck for you, imho, will be repeated through the list of these patterns. Regular computing searches are expensive.

If you want the result to "match any match pattern," create a massive OR regular expression and let your regex mechanism handle "OR'ing for you."

Also, if you only have prefix patterns, check the TRIE data structure.

0
source

It can be a task for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/ ; a copy of the function is: http://php.net/manual/en/function.sscanf.php .

You can use it without changing the prepared lines, but I doubt the performance.

0
source

I don’t understand the problem. Do you want to take patterns and build regular expressions from it? Most regex engines have the option "quoted string". (\ Q \ E). Thus, you can take a string and make it ^ \ QHi, \ E (?:. *) \ Q. \ E $ these will be regular expressions that exactly match the string you want outside of your variables.

if you want to use one regular expression to match only one pattern, you can put them in grouped patterns to find out which one matches, but this will not give you EVERY match, only the first.

if you use the correct parser (I used PEG.js), it may be more convenient. So another option if you think you can get stuck in adge

0
source

Source: https://habr.com/ru/post/944415/


All Articles