Is there a data structure that provides pattern matching (regular expression)?

I have come across this situation several times: there are several templates that some text can compare with, and you want to do something specific based on what template it is.

In the past, I always used a list of regular expressions and repeated until a match was found.

I am wondering if there is a more efficient data structure for this. Something like if I used C #, for example, a dictionary with Regex keys.

I understand that if templates are all prefixes or suffixes, then something like Trie will make sense. It is not clear to me that this will work in the general case.

It also seems to me that there may be some ambiguity around key clashes; for example, if any text matches multiple patterns, what should be returned? (I would think that perhaps in this case a non-deterministic result would be fine, but as long as the behavior is documented, I will be fine with it.)

Anyway, is there such a data structure, both in .NET and elsewhere?

+4
source share
3 answers

Suppose these regular expressions are really regular. Then, each of them can be transformed into a nondeterministic finite state machine , which can be converted into a deterministic finite state machine , which can be estimated in O (n) time in the input length.

But this does not apply to the issue of matching multiple regular expressions at once. We can do this by creating one regex that looks like this: (regexp1|regexp2|...) , and turn it into a single NFA / DFA. Add some tools to the branches of the machine to keep track of which particular regular expression created the path that matched the input, and you have your helper still O (n) in the length of the input string.

This method will not support any "regular expression" functions that make the language irregular, such as backlinks .

Another disadvantage is that the resulting DFA can be large. It is also possible to directly evaluate the NFA, which is probably slower but has better memory behavior.


In fact, it is quite easy to express this idea in code, without worrying about the machine. Just use the appropriate groups:

 combined_regexp = (regexp1)|(regexp2)|... 

During the evaluation, just look at which group corresponds to the input.

Keep in mind that most regex implementations / libraries have pretty bad behavior in some cases where they can take exponential time to compile or match regex. I am not sure what the problem is in practice. The Google RE2 library is one that has been specifically designed to not have this pathological behavior, but there may be others.

Another problem may be that if your regular expression implementation does not specifically advertise O (n) behavior, it may just try each alternative in turn. In this case, this approach will not buy you anything.

+3
source

The fgrep tool does exactly what you are talking about: matches text with multiple regular expressions. I understand that the original version uses something very similar to the Aho-Corasick string matching algorithm to search for multiple regular expressions in a single pass. Basically, he created the DFA and went through it.

I do not know the .NET implementation of fgrep. If you find him, I would be interested to hear about it.

You can track the fgrep source code (Google for it, there are many sources) and see how it is implemented.

Alternatively, you can disable your program shell for fgrep. Or perhaps create a C ++ DLL that has an fgrep entry point that you could call from your C # program.

If your multiple patterns are constant lines (i.e. not regular expressions), then you might be interested in implementing the C # Aho-Corasick algorithm .

+4
source

A few years ago, I implemented a regular expression search system based on a deterministic finite state machine, very similar to what Thomas describes in his answer. It compiles a list of keys and regular expression values ​​into a single state state machine that references objects of a certain type in its terminal states (for example, RegexTrie refers to strings in terminal states).

The implementation is available here: https://bitbucket.org/tjnieminen/regexkeytrie

A standard search is performed by maintaining a list of paths through the machine. Each active path advances for each character of the search text, and matches are recorded whenever a terminal state is reached. A new path starting at the root of the machine is added for each character in the source text (this allows a substring to match), and paths that stop in a nonterminal state are removed from the path list.

As a rule, it is best to set up processing for any task. For example, if you use a mechanism to replace regular expressions, it might be best to create edited text during a crawl (such as a transformer) instead of performing search and replace operations with returned matches.

I compared the implementation with regular .Net regular expressions, and it seems to work better in scripts where it should, i.e. Combine a large set of regular expressions and long text to search.

The implementation has not been thoroughly tested, so there may be some errors, and it is probably quite easy to get out of memory using complex regular expressions (or they may be needed forever for compilation). But since nothing like this is available at the moment, it can be a useful starting point for those who are looking for the performance characteristics that it provides.

0
source

Source: https://habr.com/ru/post/1495826/


All Articles