Why regex "[^ <] * <\\?" exponential time when the text does not have a "<"?

Using the ICU 4.0 regex library, I find that the following expression looks exponential:

actual: "[^<]*<\?"
C code: "[^<]*<\\?"

Purpose: Find "<?" where there is no other "<" in front of him

When running this regular expression on plain text without "<" characters at all, it seems to take exponential time. If the text has at least one "<" then it is fast. I do not understand why.

Not necessarily matching "<?" prevent this from having a refund? I would think that he will try to find the first "<" and then check the rest of the expression. If he cannot find the "<" then he will refuse, because the pattern obviously cannot match.

Is this an ICU regex error or is it expected?

+2
source share
6 answers

You will find the explanation in Regular expression matching can be simple and quick . As MizardX said, if the match fails at position 0, the engine will try again at position 1, 2, etc. If the text is long, be prepared to wait quite a while ...

The solution is to commit your expression: "^[^<]*<\?"

+6

. Java :

String regex = "[^<]*+<\\?";

:

String regex = "(?>[^<]*)<\\?";

, [^<]* , . , . Java PHP .NET ; .

+2

. <? . , O (n 2).

+1

, . RegularExpressions.info

:
, , , .

. .
, .
, .
, , .

, ABC "ABC", , "BC", , "C" . , , "[^ <]" , <?

+1

Perl re dump

. .

Perl . , .

perl -Mre=debug -e' "abcdefghijklm" =~ /[^<]*<[?]/; '

Compiling REx "[^<]*<[?]"
Final program:
   1: STAR (13)
   2:   ANYOF[\0-;=-\377{unicode_all}] (0)
  13: EXACT <<?> (17)
  17: END (0)
floating "<?" at 0..2147483647 (checking floating) minlen 2 
Guessing start of match in sv for REx "[^<]*<[?]" against "abcdefghijklm"
Did not find floating substr "<?"...
Match rejected by optimizer
Freeing REx: "[^<]*<[?]"

, - , , , .

perl -Mre=debug -e' "ab<?" =~ /[^<]*(?!<)<[?]/; '

Compiling REx "[^<]*(?!<)<[?]"
Final program:
   1: STAR (13)
   2:   ANYOF[\0-;=-\377{unicode_all}] (0)
  13: UNLESSM[0] (19)
  15:   EXACT <<> (17)
  17:   SUCCEED (0)
  18: TAIL (19)
  19: EXACT <<?> (23)
  23: END (0)
floating "<?" at 0..2147483647 (checking floating) minlen 2 
Guessing start of match in sv for REx "[^<]*(?!<)<[?]" against "ab<?"
Found floating substr "<?" at offset 2...
Guessed: match at offset 0
Matching REx "[^<]*(?!<)<[?]" against "ab<?"

# Start at first pos()
#      |
#      V
   0 <> <ab<?>               |  1:STAR(13)
                                  ANYOF[\0-;=-\377{unicode_all}] can match 2 times out of 2147483647...
   2 <ab> <<?>               | 13:  UNLESSM[0](19)
   2 <ab> <<?>               | 15:    EXACT <<>(17)
   3 <ab<> <?>               | 17:    SUCCEED(0)
                                      subpattern success...
                                    failed...
# try with one fewer [^<]*
   1 <a> <b<?>               | 13:  UNLESSM[0](19)
   1 <a> <b<?>               | 15:    EXACT <<>(17)
                                      failed...
# try with one fewer [^<]* again
   1 <a> <b<?>               | 19:  EXACT <<?>(23)
                                    failed...
# try with zero [^<]*
   0 <> <ab<?>               | 13:  UNLESSM[0](19)
   0 <> <ab<?>               | 15:    EXACT <<>(17)
                                      failed...
   0 <> <ab<?>               | 19:  EXACT <<?>(23)
                                    failed...
                                  failed...

# Start at second pos()
#       |
#       V
   1 <a> <b<?>               |  1:STAR(13)
                                  ANYOF[\0-;=-\377{unicode_all}] can match 1 times out of 2147483647...
   2 <ab> <<?>               | 13:  UNLESSM[0](19)
   2 <ab> <<?>               | 15:    EXACT <<>(17)
   3 <ab<> <?>               | 17:    SUCCEED(0)
                                      subpattern success...
                                    failed...
   1 <a> <b<?>               | 13:  UNLESSM[0](19)
   1 <a> <b<?>               | 15:    EXACT <<>(17)
                                      failed...
   1 <a> <b<?>               | 19:  EXACT <<?>(23)
                                    failed...
                                  failed...

# Start at third and final pos()
#        |
#        V
   2 <ab> <<?>               |  1:STAR(13)
                                  ANYOF[\0-;=-\377{unicode_all}] can match 0 times out of 2147483647...
   2 <ab> <<?>               | 13:  UNLESSM[0](19)
   2 <ab> <<?>               | 15:    EXACT <<>(17)
   3 <ab<> <?>               | 17:    SUCCEED(0)
                                      subpattern success...
                                    failed...
                                  failed...
Match failed
Freeing REx: "[^<]*(?!<)<[?]"

, '[^<]*', , . , , , '<?'.

, .


^ - BOL ( ) .

perl -Mre=debug -e' "abcdefghijklm<?" =~ /^[^<]*+(?!<)<[?]/; '

Compiling REx "^[^<]*+(?!<)<[?]"
Final program:
   1: BOL (2)
   2: SUSPEND (18)
   4:   STAR (16)
   5:     ANYOF[\0-;=-\377{unicode_all}] (0)
  16:   SUCCEED (0)
  17: TAIL (18)
  18: UNLESSM[0] (24)
  20:   EXACT <<> (22)
  22:   SUCCEED (0)
  23: TAIL (24)
  24: EXACT <<?> (28)
  28: END (0)
floating "<?" at 0..2147483647 (checking floating) anchored(BOL) minlen 2 
Guessing start of match in sv for REx "^[^<]*+(?!<)<[?]" against "abcdefghijklm<?"
Found floating substr "<?" at offset 13...
Guessed: match at offset 0
Matching REx "^[^<]*+(?!<)<[?]" against "abcdefghijklm<?"
   0 <> <abcdefghij>         |  1:BOL(2)
   0 <> <abcdefghij>         |  2:SUSPEND(18)
   0 <> <abcdefghij>         |  4:  STAR(16)
                                    ANYOF[\0-;=-\377{unicode_all}] can match 13 times out of 2147483647...
  13 <defghijklm> <<?>       | 16:    SUCCEED(0)
                                      subpattern success...
  13 <defghijklm> <<?>       | 18:UNLESSM[0](24)
  13 <defghijklm> <<?>       | 20:  EXACT <<>(22)
  14 <defghijklm<> <?>       | 22:  SUCCEED(0)
                                    subpattern success...
                                  failed...
Match failed
Freeing REx: "^[^<]*+(?!<)<[?]"

, , .

0

, , , (?) . , s, , '<'. [^<]* , : s[0] s[n-1] (s c- , ). ( '<'). [^<]* s[0] s[n-2], '<' . , 0 0 ( * , ). , , 0 , , s[1], . 2 .., . .

Edit: Your regular expression will essentially match any part of the line that ends with <?and does not contain the other <, for example, in <<?it matches <?and in ba<abc<?def, it matches abc<?. Some of the proposed suggestions will behave differently. In particular, ^[^<]*<\?it will not match either of these two examples.

0
source

Source: https://habr.com/ru/post/1715321/


All Articles