Improve regex performance

My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with generic expressions that can be reused to prepare various formats. One of the common tasks is to remove crlf in certain parts of the files, but not in others. For example, this:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence
    that should not contain
    any line break.
    </SOURCE>

It should become:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence that should not contain any line break.
    </SOURCE>

I have rexep that does the job pretty nicely:

(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

The problem is that it is intensively processed and with files above 500 KB, it can take 30 or more seconds. (regex compiled, in this case uncompiled is much slower)

This is not a big problem, but I wonder if there is a better way to achieve the same results with Regex.

Thanks in advance for your suggestions.

+3
3

:

\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

\r\n, lookahead, , <SOURCE> </SOURCE>. , </SOURCE>, <SOURCE>, . , , , .

+2

"" . " ", .

NFA , DFA, . DFA NFA . Perl Compatible Regular Exions Regular. , NFA, "" , , .

PCRE Russ Cox, - , .

, , . (X|HT)?ML.

+2

, , \r\n . lookbehind ( ) , , .

(?=\r\n)(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

. , RegexBuddy . .NET, , . , ( , ). , .

\r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*)

? , RegexOptions.Singleline .

, , , <SOURCE>, , , lookbehind . , :

  • <SOURCE>
  • Replace all CRLFs inside the block (no regular expression required)
  • Replace unit <SOURCE>
+2
source

Source: https://habr.com/ru/post/1753503/


All Articles