RegEx calculates the number of different permutations

So this is a little unusual use of RegEx; I want to calculate a number (or indicate infinite, if appropriate) different strings that will match a particular pattern.

For example, consider [a-zA-Z] , which yields 52, [a-zA-Z]{1,2} , which yields 2652 (52 + 52 ร— 52-52 ร— 2, subtracting 52 ร— 2 for strings of the type aa , MM , which are not) or [a-zA-Z]+ , which would be โˆž.

Of course, I would like this mechanism to deal with more complex regular expressions than this. I am particularly interested in solutions for PHP and Ruby. Is it possible?

+4
source share
2 answers

Regular expressions are used to match a given string by comparing it to a given pattern. Any given regular expression can match more lines, the longer the regular expression, the more lines it can match.

In my opinion, what you need cannot be done with regular expressions. You can write a program that deconstructs a regular expression and tries to guess the number of lines you could match. However, the construction of such a program is most likely not to be trivial.

For example, in your case, [a-zA-Z] will not only match a through z (and the same for the uppercase variant), but will also match any string containing these letters, which basically is any string, which you can imagine that contains at least one of these letters.

Adding ^ and $ anchors can reduce the number of calls, but again, you will still have more than 48, because sometimes you can also claim that {EmptyString}a{EmptyString} can also be matched ^a$ , which significantly increases the number of possible results.

+3
source

To achieve this, I think you need a solution that is more complex than the regular expression engine itself. Engines with regular expressions simply โ€œtestโ€ (and โ€œcaptureโ€, but the complexity is trivial), while in your task you want to either test the entire discourse of potential resources (of course, completely impractical), or deduce the number of potential inputs mathematically. But note that in order to infer the number of potential inputs, you inevitably have to go through more or less the same steps as the regular expression mechanism, except that each step asks: "Potential inputs for this atom?"

I'm not sure why you need such a counter, but if all you are trying to do is compare the potential inputs of two regular expressions, then I recommend using sampling methods, i.e. generate a large set of random strings and count how many of them correspond to each regular expression. (And it moves from top to bottom and is very speculative, but since pure random strings are unlikely to show the grouping patterns this natural language does, you may have to generate your patterns using fractal methods, la Mandelbrot.)

Now, if you want to follow the path of deductive counting anyway, here are two ideas that can help simplify the problem:

  • If you find * or + (which is not displayed and is not in the character class), then you know that the answer is endless. The same goes for {M,} . EDIT: Well, if the quantifier is in an "impossible" piece of regular expression, for example. (.*(?=a)(?=b)) , which states that the next character must be either "a" or "b"!

  • You can expand many expressions in alternation operators, so that regardless of your final decision, he can ignore character classes and quantifiers in general, only by focusing on the number of atoms per alternation group (which can be multiplied together), for example

    • Character classes, such as [0-9a-f] , can be expanded to 0123456789abcdef , which, in turn, can be expanded to (?:0|1|2|...|d|e|f) .

    • Finite quantifiers such as x? (aka x{0,1} ), x{M,N} and x{,N} , can be expanded to (?:|x) , (?:x|xx|xxx|...) , etc. d.

Good luck

+2
source

Source: https://habr.com/ru/post/1434211/


All Articles