More efficient regex or alternative?

I have a file with a little over a million lines.

 {<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceVolume> "693702"^^<xsd:long>}
 {<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceId> <uri::20fb8f7d-30ef-dd11-a78d-001f29e570a8>}

Each line is an expression.

struct Statement
    string C;
    string S;
    string P;
    string O;
    string T;

I am currently using TextReader in a while loop and parsing each line with a regex:

Regex lineParse = new Regex(@"[^<|\""]*\w[^>\""]*", RegexOptions.Singleline | RegexOptions.Compiled);

This parsing takes quite a lot of time, and I hope someone can point me to a more efficient parsing strategy.

Some lines have 5 matches and several 4. Here's how each line is parsed:

{<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceVolume> "693702"^^<xsd:long>}

Statement()
    C = uri::rdfserver#null
    S = uri::d41d8cd98f00b204e9800998ecf8427e
    P = uri::TickerDailyPriceVolume
    O = 693702
    T = xsd:long

{<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceId> <uri::20fb8f7d-30ef-dd11-a78d-001f29e570a8>}

Statement()
    C = uri::rdfserver#null
    S = uri::d41d8cd98f00b204e9800998ecf8427e
    P = uri::TickerDailyPriceId
    O = uri::20fb8f7d-30ef-dd11-a78d-001f29e570a8

Additional information from the comments: “The poor performance that I saw was actually due to the conditional breakpoint that I set in the code. Without this breakpoint, everything is pretty fast. However, if anyone has any improvements I would be interested "- Eric Shunover

+3
4

( ) - :

line.Split(new char[] { '{', '<', '>', '}', ' ', '^', '"' },
           StringSplitOptions.RemoveEmptyEntries);

():

Regex lineParse
    = new Regex(@"^\{(<([^>]+)>\s*){3,4}(""([^""]+)""\^\^<([^>]+)>\s*)?\}$",
                RegexOptions.Compiled);
Match m = lineParse.Match(line);
if (m.Groups[2].Captures.Count == 3)
{
    Data data = new Data { C = m.Groups[2].Captures[0].Value,
        S = m.Groups[2].Captures[1].Value, P = m.Groups[2].Captures[2].Value,
        O = m.Groups[4].Value, T = m.Groups[5].Value };
} else {
    Data data = new Data { C = m.Groups[2].Captures[0].Value,
        S = m.Groups[2].Captures[1].Value, P = m.Groups[2].Captures[2].Value,
        O = m.Groups[2].Captures[3].Value, T = String.Empty };
}

1M (String.Split ):

Method                #1  Wall ( Diff)     #2  Wall ( Diff)
------------------------------------------------------------
line.Split                3.6s (1.00x)         3.1s (1.00x)
myRegex.Match             5.1s (1.43x)         3.3s (1.10x)
itDependsRegex.Matches    6.8s (1.85x)         4.4s (1.44x)
stateMachine              8.4s (2.34x)         5.6s (1.82x)
alanM.Matches             9.1s (2.52x)         7.8s (2.56x)
yourRegex.Matches        18.3s (5.06x)        12.1s (1.82x)

, @AlanM @itdepends . , Regex.Matches , Regex.Match, , , . , @AlanM, , , (). @itdepends , . , , , ... kudos @RexM . Q6600 (# 2) Xeon (# 1).

+18

, .

+6

:

@"<(?<capture>[^>]+)>|""(?<capture>[^""]+)"""

match.Groups [1].Value.

75-80% , , .

Match, . , , :

for(Match match = regex.Match(input); match.Success; match = match.NextMatch())
// min 5.01 sec
// max 5.15 sec

foreach(Match match in regex.Matches(input))
// min 5.66 sec
// max 6.07 sec

Match , Match.

+2

, , , , . - @sixlettervariables, :

@"[^{}<> ^""]+"

But I would still expect the String.Split method to be faster.

+1
source

Source: https://habr.com/ru/post/1702727/


All Articles