Regex.match task in C #

Hi everyone, I am using Regex.match in C # to phase a text file line by line. I find that he will spend more time (about 2-4 seconds) when the line cannot match the patten. But spend less time (less than 1 second) during the match. Who can tell me how I can improve performance?

This is the regex that I use:

^.*?\t.*?\t(?<npk>\d+)\t(?<bol>\w+)\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)\t.*?\t.*?\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})\t.*?\t.*?\t.*?\t.*?\t.*?\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t.*?\t(?<material>[\w\-]+)\tIV$ 
+4
source share
2 answers

Performance problems that only occur when the regular expression cannot match are very common due to catastrophic backtracking . This occurs when a regular expression allows many possible combinations to match the subject text, all of which must be checked by the regular expression mechanism before it can report an error.

In your case, the reason for the failure is obvious:

First, what you are doing should not really be done with a regular expression, but rather with a CSV parser (or TSV parser in your case).

If you're stuck in a regex, you still need to change something. Your problem is that the \t delimiter can also be matched with a period ( . ), So if the whole string does not match, the regex engine should try using permutations as described above.

So a big step forward would be to change everything .*? on [^\t]* , where applicable, and condense repetitions using the {m,n} operator:

 ^(?:[^\t]*\t){2}(?<npk>\d+)\t(?<bol>\w+)(?:\t[^\t]*){9}\t\s*(?<netValue>[\d\.,]+)(?:\t[^\t]*){2}\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})(?:\t[^\t]*){5}\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t[^\t]*\t(?<material>[\w\-]+)\tIV$ 

I hope I was not mistaken :)


For illustration only:

Compliance with this text

 1 2 3 4 5 6 7 8 9 0 

with this excerpt from your regex above

 .*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+) 

adopts regex 39 stepper motor.

When you submit this text though:

 1 2 3 4 5 6 7 8 9 X 

To run the regex engine 4602 command, you must determine that it cannot match.

If you use

 (?:[^\t]*\t){9}\s*(?<netValue>[\d\.,]+) 

instead, the engine requires 30 steps for a successful match and only 39 for an unsuccessful attempt.

+7
source

Precompilation usually helps:

 private static readonly Regex re = new Regex(pattern, RegexOptions.Compiled); 

however, I am wondering if in this particular case it is tied to a regular expression - maybe some expensive backlink. Regex is not always a tool to use, for example ...


Now edit that this is delimited data:

Instead of regular expressions, delimited data can use the parsing approach much more efficiently. You can even get away simply with var parts=line.Split('\t') (and access parts by index), but if that fails, this csv reader has options to manage the delimiters, etc.

+4
source

Source: https://habr.com/ru/post/1340292/


All Articles