Memory usage and known issues with RegEx and various versions of the Framework

We have a Windows service created in .Net 4.0, the services analyze large text files consisting of strings of values ​​separated by commas (several million lines, between 5-10 values), there are no problems, we can read the lines, divide them into a collection Key / Value and process the values. To check the values, we use Data Concurrency to transfer values, which is basically an array of values ​​in specific formats, to a method that performs RegEx checks on individual values.

So far, we have used static regular expressions, not the static RegEx.IsMatch method, but the RegEx static property with RegexOption, defined as RegexOptions.Compiled, as described below.

private static Regex clientIdentityRegEx = new Regex("^[0-9]{4,9}$", RegexOptions.Compiled); 

Using this method, we had fairly standard memory, the memory increased slightly with a large number of values ​​in each row, time was more or less linear with respect to the total number of rows.

To allow regular expressions to be used in other projects from different versions of the Framework, we recently transferred the RegEx static properties to a common utilities project, which is now compiled using CLR.Net 2.0 (the actual regular expressions have not changed), the number of RegEx objects exposed has increased to about 60, with 25 or so. After that, we began to encounter memory problems, increasing the amount of memory by 3 or more times compared to the original project. When we look at the running service, we see that the memory seems to be "leaking" from RegEx.IsMatch, not some specific RegEx, but different depending on what calls.

I found the following comment on an old MSDN message post from one of the BCL commands related to .NET 1.0 / 1.1 RegEx.

There are even more compilation costs that should be mentioned. Emit IL with Reflection.Emit loads a lot of code and uses a lot of memory, and it is not the memory you will ever get. Also. in versions v1.0 and v1.1 we would not be able to free the IL that we generated, that is, you missed the memory using this mode. We fixed this issue in Whidbey. But the bottom line is that you should use this mode only for a finite set of expressions that, as you know, will be reused.

I will add that we have profiled the “majority” of common RegEx calls and cannot replicate the problem individually.

Is this a known issue with the .Net 2.0 CLR?

The article lists the authors . But in the end, you should use this mode only for a finite set of expressions that, as you know, will be reused " , which is likely to be a finite number of expressions used in this way, and could this be the reason?

Update: According to the answer from @Henk Holterman, are there any recommendations for testing regular expressions, in particular RegEx.IsMatch, except for using pure brute force in terms of volume and format of parameters?

Answer: Hanks' answer “A script requires a limited fixed number of RegEx objects” was largely noted, we added static RegEx'es to the class, until we isolate expressions with a noticeable increase in memory usage, they were transferred to separate static classes, which seem to be solved some memory problems.

It seems that although I cannot confirm this, there is a difference between compiled using RegEx between CLR.NET and CLR.NET, since memory problems do not occur when they are run exclusively for the .NET 4.0 platform. (Any confirmations?)

+6
source share
1 answer

The script requires a limited fixed number of RegEx objects. It should not flow. You must ensure that in the new situation, RegEx objects are still reused.

Another possibility is to increase the number (60 of 25) of expressions. Maybe one of them can be a little more complicated, which will lead to excessive outflow?

+1
source

Source: https://habr.com/ru/post/907166/


All Articles