Regex MatchCollection obj hangs after function timeout after Regex.Matches

Iโ€™m kind of new C # and regular expression, but I searched a couple of hours to find a solution to this problem too, so hopefully this is easy for you guys :)

My application uses a regex to match the email addresses on a given string, then iterates over matches:

String EmailPattern = "\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*"; MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern); foreach (Match memail in mcemail) 

It works fine, but when I loaded the line from a specific page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx , the MatchCollection (mcemail) object โ€œhangsโ€ the loop. When using a breakpoint and accessing an object, I get "Ignore function" for all (.Count, etc.).

Update I tried my template and other email templates on the same line, everything (regular expression descriptors, python-based web pages, etc.) Fail / timeout when trying to match this particular line too.

How can I find that matchcollection obj is not "ready" to use?

+6
source share
3 answers

I just did a local test, and it either looks like the clean size of the document, or something in the ViewState forces a Regex timeout check. (Editing: Actually, I'm sure this is size. Removing ViewState significantly reduces size.)

Admittedly, a rough way to solve this would be something like this:

 string[] rawHtmlLines = File.ReadAllLines(@"C:\default.aspx"); string filteredHtml = String.Join(Environment.NewLine, rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray()); string emailPattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"; var emailMatches = Regex.Matches(filteredHtml, emailPattern); foreach (Match match in emailMatches) { //... } 

In general, I suspect that the email template is simply not optimized (or designed) to filter out emails in a large line, but is used only as confirmation for user input. As a general rule, it might be a good idea to limit the line you are searching to only the parts you are really interested in and keep them as small as possible - for example, without leaving a ViewState , which is guaranteed to not contain any readable email addresses .

If performance is important, it might also be better to create filtered HTML using StringBuilder and IndexOf (etc.) instead of separating the strings and LINQing the result :)

Edit:

To further minimize the length of the string that Regex needs to check, you can only include strings containing the @ character, for example:

 string filteredHtml = String.Join(Environment.NewLine, rawHtmlLines.Where(line => line.IndexOf('@') >= 0 && !line.Contains("_VIEWSTATE")).ToArray()); 
+1
source

If you can post a post that is causing the problem (possibly anonymous in some way), this will give us more information, but I think the problem is this little man right here:

 ([-.]\\w+)*\\.\\w+([-.]\\w+)* 

To understand the problem, divide it into groups:

 ([-.]\\w+)* \\.\\w+ ([-.]\\w+)* 

Lines that match \\.\\w+ are a subset of those that match [-.]\\w+ . Therefore, if part of your input looks like foo.bar.baz.blah.yadda.com , your regex engine does not know which group should match it. Does this make sense? Thus, the first ([-.]\\w+)* could correspond to .bar.baz.blah , then \\.\\w+ could correspond to .yadda , then the last ([-.]\\w+)* could would match .com ...

... OR the first sentence can match .bar.baz , the second can match .blah , and the last can match .yadda.com . Since he does not know which one is right, he will continue to try to use different combinations. In the end, it should stop, but it can take a long time. This is called a "catastrophic return."

This problem is compounded by the fact that you are using capture groups, not non-capture groups; those. ([-+.]\\w+) instead of (?:[-+.]\\w+) . This makes the engine try and split and save any matches inside parentheses for later reference. But, as I explained above, it is ambiguous which group each substring belongs to.

You can consider replacing everything after @ as follows:

 \\w[-\\w]*\\.[-.\\w]+ 

This may use some refinements to make it more specific, but you get a general idea. I hope I have explained all this well enough; grouping and backlinks are pretty hard to describe.

EDIT:

Looking back at your sample, there is a deeper problem here that is still related to the return / ambiguity problem that I mentioned. The sentence \\w+([-.]\\w+)* is ambiguous in itself. Breaking it into parts, we have:

 \\w+ ([-.]\\w+)* 

Suppose you have a string of type foobar . Where does the end of \\w+ and ([-.]\\w+)* begin? How many repetitions ([-.]\\w+) are there? Any of the following may work as a match:

 f(oobar) foo(bar) f(o)(oba)(r) f(o)(o)(b)(a)(r) foobar etc... 

The regex engine does not know what is important, so it will try all of them. This is the same problem that I mentioned above, but it means that you have it in several places in your template.

Worse, ([-.]\\w+)* also ambiguous, due to + after \\w . How many groups are there in blah ? I count 16 possible combinations: (blah) , (b)(lah) , (bl)(ah) ...

The number of different possible combinations will be huge, even for relatively small input, so your engine will be in overdrive. I would definitely simplify this if I were you.

+2
source

From "Ignore Function", I assume that you are doing this in the debugger. The debugger has fairly quick timeouts regarding how long the method takes. Not everything happens fast. I would suggest to go to the operation in the code, save the result, and then view this result in the debugger (i.e., Allow calls Matches and put a breakpoint after it).

Now, regarding the determination of whether a string will execute Matches, it will take a long time; what a bit of black art. Basically you have to do some input validation. Just because you got some kind of value from the Internet does not mean that value will work well with matches. The ultimate logic of verification is up to you; but starting with the length of rawHtmlLines may be useful. (i.e. if the length is 1,000,000 bytes, the matches may take some time). But you need to decide what to do if the length is too long; for example, give the user an error.

0
source

Source: https://habr.com/ru/post/888152/


All Articles