Regular Expression Optimization

I have a regex that is the output of a computer program. It has things like

(((2)|(9)))* 

which a person will undoubtedly write as

 [29]* 

So, I need a program that can do simple conversions that make the regex more understandable. So far I have used a quick script

 $r =~ s/\(([0-9])\)/$1/g; $r =~ s/\(([0-9])\|([0-9])\)/[$1$2]/g; $r =~ s/\(([0-9]|\[[0-9]+\])\)\*/$1*/g; $r =~ s/\((\[[0-9]+\]\*)\)/$1/g; $r =~ s/\|\(([^()]+)\)\|/|$1|/g; 

which reduce the length, but the result still contains pieces, such as

 (ba*b)|ba*c|ca*b|ca*c 

which should be simplified to

 [bc]a*[bc] 

I searched CPAN and found Regexp :: List, Regexp :: Assemble and Regexp :: Optimizer. The first two do not apply, and the third has problems. Firstly, it will not pass the tests, so I cannot use it if I do not force install Regexp::Optimizer in cpan. Secondly, even when I do this, it pinches the expression.


Note. I added this [regular language] in addition to [regex] because regexp only uses the concatenation, interleaving, and Kleene star, so it is actually correct.

+6
source share
1 answer

I feel that there may be a way to do this by converting a regular expression to a grammar, putting the grammar in Chomsky’s normal form, combining common non-terminals, and looking for patterns using some heurstic comparison. You can even get more succinct answers if you don’t put it in the “real” CNF ... I would leave lambdas / epsilon inside.

  ba*b|ba*c|ca*b|ca*c S -> bAb | bAc | cAb | cAc A -> aA | lambda S -> BAB | BAC | CAB | CAC A -> AA | a | lambda B -> b C -> c S -> DB | DC | EB | EC A -> AA | a | lambda B -> b C -> c D -> BA E -> CA 

At this point, you may find a heuristic that recognizes

  S -> (D+E)(B+C) 

Backsubstituting,

  S -> (BA|CA)(b|c) -> (ba*|ca*)(b|c) 

Repeat this for subexpressions like

  S' -> bA' | cA' A' -> aA' | lambda S' -> B'A' | C'A' A' -> A'A' | a | lambda B' -> b C' -> c 

Now, given that S → (B | C) (A), we can get

  S' -> (B'|C')(A') -> (b|c)(a*) 

For final decision

  S -> ((b|c)a*)(b|c) 

Then you can just look for extra parentheses to remove (noting that the union is associative, and this will optimize things into a concatenative normal form, just drop all the parentheses that don't just enclose the list of options | -delimited ... so the above becomes

  (b|c)a*(b|c) 

The trick comes with heuristics, and this may not do all the optimizations that are possible. I do not know how this will work. However, this might be something to consider.

+3
source

Source: https://habr.com/ru/post/895437/


All Articles