How to efficiently match a string to a set of wildcards?

Question

How to efficiently match a string to a set of wildcards?

I am looking for a solution to match a single line with a set of wildcards. for instance

>>> match("ab", ["a*", "b*", "*", "c", "*b"]) ["a*", "*", "*b"]

The order of output does not matter.

I will have about 10 ^ 4 wildcards to match, and I will make about ~ 10 ^ 9 matches. This means that I will probably have to rewrite my code like this:

 >>> matcher = prepare(["a*", "b*", "*", "c", "*b"] >>> for line in lines: yield matcher.match("ab") ["a*", "*", "*b"]

I started writing a trie implementation in Python that handles wildcards, and I just need to get these corner cases correctly. Despite this, I am curious to hear; How would you solve this? Are there any Python libraries out there that make me solve this faster?

Some ideas so far:

Named (Python, re) regular expressions will not help me, because they will return only one match.
pyparsing seems like an amazing library, but is rarely documented and, as I see it, does not support matching multiple patterns.

+2

python

Ztyx Oct 15 '12 at 22:19

source share

2 answers

Ztyx · Answer 1 · 2012-10-15T22:21:44+0000

It looks like the Aho-Corasick algorithm will work. esmre seems to be doing what I'm looking for. I got this information from this question .

jfs · Answer 2 · 2012-10-22T05:09:49+0000

You can use the FilteredRE2 class from re2 library using the Aho-Corasick algorithm (or similar). From re2 docs :

Required substrings. Suppose you have an effective way to check which list of strings are displayed as substrings in large text (for example, you may have implemented the Aho-Corasick algorithm), but now your users want to be able to perform regular expression searches efficiently. Regular expressions often have large literal strings in them; if they can be identified, they can be string, and then the string search results can be used to filter the set of regular expression searches that are needed. The FilteredRE2 class implements this analysis. Given a list of regular expressions, it performs regular expressions to compute a boolean expression with literal strings, and then returns a list of strings. For example, FilteredRE2 converts (hello | hi) world [az] + foo to the boolean expression "(helloworld OR hiworld) And foo" and returns these three lines. regular expressions, FilteredRE2 converts each into a boolean expression and returns all the involved strings. Then, after saying which of the lines is present, FilteredRE2 can evaluate each expression to determine the set of regular expressions that might possibly be present. This filtering can reduce the number of actual regular expression searches significantly.
The ability of these analyzes to a large extent depends on the simplicity of their input. The first uses the DFA form, and the second uses the parsed regular expression (Regexp *). Such an analysis will be more complicated (perhaps even impossible) if RE2 allows for irregular functions in its regular expressions.

How to efficiently match a string to a set of wildcards?

More articles: