Regex Difficulty Replacing

Question

Regex Difficulty Replacing

I received nothing from this answer. What is the difficulty of matching and replacing a regular expression?

Edit: I am working on python. But I would like to know in general about the most popular languages / tools (java, perl, sed).

+12

complexity-theory regex

cnu Aug 22 '08 at 3:02

source share

7 answers

Other theoretical information about possible interest.

For clarity, suppose the standard definition of a regular expression

http://en.wikipedia.org/wiki/Regular_language

from the theory of formal language. In practice, this means that the only building material is the symbols of the alphabet, the operators of concatenation, rotation and Kleene, along with the unit and zero constants (which appear for group-theoretical reasons). As a rule, it is a good idea not to overload this term despite everyday practice in scripting languages, which leads to ambiguity.

There is an NFA construct that solves the correspondence problem for the regular expression r and the input text t in O (| r | | t |) time and O (| r |) space, where | - | is a function of length. This algorithm has been further improved by Myers.

http://doi.acm.org/10.1145/128749.128755

to the complexity of time and space O (| r | | t | / log | t |), using automata node lists and the paradigm of four Russians. This paradigm seems to be named after four Russian guys who wrote an innovative newspaper that is not online. However the paradigm is illustrated in this computational biology lecture notes

http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf

It’s funny for me to call the paradigm the number and nationality of the authors instead of their last names.

The problem of matching regular expressions with added backlinks NP-complete, which was proved by Aho

http://portal.acm.org/citation.cfm?id=114877

short for the vertex-cover problem, which is a classical NP-complete problem.

To reconcile regular expressions with reverse queries deterministically, we could use backtracking (unlike the regex Perl engine) to track possible subwords of the input text t that can be assigned to variables in p. There are only words O (| t | ^ 2) that can be assigned to any variable in r. If r has n variables, then O (| t | ^ 2n) jobs are possible. When assigning substrings to variables is fixed, the problem boils down to simple regular matching of an expression. Hence the worst difficulty in matching regular expressions with backlinks is O (| m | ^ 2n).

Please note that regular expressions with backlinks have not yet installed a fully functional regular expression.

Take, for example, the “do not care” symbol, except for any other operators. There are several polynomial algorithms that determine whether a set of patterns exists that match the input text. For example, Kucherov and Rusinovich

http://dx.doi.org/10.1007/3-540-60044-2_46

define the pattern as the word w_1 @ w_2 @ ... @w_n, where each w_i is a word (not a regular expression), and “@” is a variable-length character “does not matter” that is not contained in any of w_i. They get the O ((| t | + | P |) log | P |) algorithm for matching the set of patterns P with the input text t, where | t | is the length of the text, and | P | is the length of all words in P.

It would be interesting to know how these measures of complexity are combined and what is the measure of the complexity of the task of matching regular expressions with backlinks, "don't care" and other interesting features of practical regular expressions.

Alas, I did not say a word about Python ... :)

+7

user108761 May 19 '09 at 15:57

source share

Depends on what you define by regular expression. If you allow the concatenation operator, alternative, and Kleene star, the time can be O(m*n+m) , where m is the size of the regular expression and n is the length of the string. You do this by creating an NFA (linear in m ) and then simulating it, maintaining the set of states you are in, and updating it (in O(m) ) for each letter of input.

Things that make regex regular expression difficult:

parentheses and backlinks: capture is still ok with the above algorithm, although it will have the complexity above, so it may be unreasonable. Backreferences increase the power of regular expression recognition, and its difficulty is good
positive look-ahead: this is just another name for the intersection, which increases the complexity of the above algorithm to O(m^2+n)
negative look ahead: disaster for building an automaton ( O(2^m) , possibly PSPACE-complete). But you can still handle the dynamic algorithm in something like O(n^2*m)

Please note that with a particular implementation, the situation may become better or worse. As a rule, simple functions should be fast enough, and clear (for example, not like a*a* ) regular expressions are better.

+4

jpalecek Apr 20 '09 at 0:33

source share

To delve into the answer to this question, to construct an automaton O (2 ^ m) is the worst case, although it really depends on the form of the regular expression (for a very simple one that matches the word, it is in O (m), using, for example, the algorithm Knut-Morris-Pratt ).

+2

OysterD Aug 26 '08 at 17:06

source share

Depends on the implementation. What language / library / class? It may be a better case, but it will be very specific to the number of functions in the implementation.

+1

Chris Aug 22 '08 at 3:06

source share

You can trade space for speed by creating a non-deterministic finite state machine instead of DFA. This can be traversed in linear time. Of course, in the worst case, this may require the space O (2 ^ m). I expect the compromise will be worth it.

0

Pramod 04 Sep '08 at 5:06

source share

If you are doing collation and swapping, this means grouping and a backlink.

Here is a perl example where grouping and trackbacks can be used to solve the NP complete problem: http://perl.plover.com/NPC/NPC-3SAT.html

This (combined with several other theoretical tidbits) means that using regular expressions for matching and substitution is NP-complete.

Please note that this is different from the formal definition of a regular expression, which does not have the concept of grouping, and corresponds in polynomial time, as described by other answers.

0

Dave Jun 03 '09 at 0:43

source share

goric · Accepted Answer · 2008-08-22 03:15

From a purely theoretical position:

An implementation I am familiar with will be to create a deterministic finite state machine for recognizing a regular expression. This is done in O (2 ^ m), m is the size of the regular expression using the standard algorithm. Once this is built, the execution of the line through it is linear along the length of the line - O (n), n is the length of the line. The substitution in the match found in the string must be constant.

So, in general, I suppose that O (2 ^ m + n).

Regex Difficulty Replacing

More articles: