How to find the "minimal spanning set" for a set of regular expressions?

CONTEXT:

I have a small (currently less than 100) but growing collection of regular expressions, and I want to optimize the definition process for a given text string, which from RE in my collection corresponds to a text string.

Some of the REs are related to ordering - for example, if I know that the string $ t matches / windows / i, then I also know that $ t matches /windows.*2000/i. Therefore, when testing $ t against RE in my collection, I can skip testing / windows / i if I already tested $ t against /windows.*2000/i and found a match (although if /windows.*2000/i doesn’t match, Of course, I can not miss the test against / windows / i).

Please note that none of the REs in my collection is completely equivalent (for any RE pair, there is at least one text string that matches one and does not match the other).

STRATEGY:

I want to build a directed graph G with a node for each RE in my collection and a directed edge for each pair of REs with an ordering relation (A → B means that matching A implies matching B)) and find the “minimum spanning set” of nodes for the graph (the minimum set of nodes S is such that each node in G lies on a directional path that is taken in S).

EASY PART:

There are many freely available algorithms for working with Directed acyclic graphs. Therefore, as soon as the graph G is constructed for my collection of RE (which should be obvious that G is acyclic), I do not expect that I will have difficulty finding the right algorithm to find the minimal set for G.

WHERE I need help:

I would like to find an effective way to find all the ordering relationships between REs in my collection and possibly also ensure that no two REs in the collection are equivalent (I will need to automatically check this as new REs are added).

My (essentially random) Internet searches have thus confirmed at least one plausible claim that a reasonable way to figure out which (if any) ordering relationship exists between two REs does exist, but has not yet found any descriptions full algorithm.

Does anyone know about an existing implementation (for comparison, RE), which is quite effective, freely accessible and (ideally) implemented either in one of the popular scripting languages ​​or in C / C ++?

+6
source share
1 answer

I'm not sure if you have the flexibility in terms of the regex library that you should use, but you can look at the RE2 Set interface can match multiple regexes at the same time. Please note that RE2 mainly uses the DFA approach and does not support all of the regular expression functions that are performed in other, mostly reverse tracking.

+2
source

Source: https://habr.com/ru/post/887135/


All Articles