Why does this regex take so long to find email addresses in specific files?

Question

Why does this regex take so long to find email addresses in specific files?

I have a regex that looks for email addresses (this was taken from another SO message that I cannot find and was checked on all types of email configurations ... changing this is not quite my question ... but understand if this is the main reason):

/[a-z0-9_\-\+] +@ [a-z0-9\-]+\.([az]{2,3})(?:\.[az]{2})?/i

I am using preg_match_all () in PHP.

This works fine for 99.99 ...% of the files that I look at and takes about 5 ms, but sometimes takes a couple of minutes. These files are larger than the average web page of about 300 thousand, but much larger files are usually processed normally. The only thing I can find in the contents of the file that stands out is a string of thousands of consecutive "random" alphanumeric characters, such as:

 wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...

Here are two pages causing the problem. Browse the source to see long lines.

Any thoughts on what causes this?

- FINAL DECISION -

I tested the various regular expressions suggested in the answers. @FailedDev's answer helped and reduced processing time from a few minutes to a few seconds. @ Hakre's answer solved the problem and reduced the processing time to a few hundred milliseconds. Following is the last regex. This is the second sentence of @hakre.

 /[a-z0-9_\-\+]{1,256} +@ [a-z0-9\-]{1,256}+\.([az]{2,3})(?:\.[az]{2})?/i

+6

php regex web-scraping

T. Brian Jones Nov 17 '11 at 23:03

source share

2 answers

My best guess is to try using available quantifiers:

 [a-z0-9_\-\+]+

to

 [a-z0-9_\-\+]++

This will break the regular expression to improve performance in these situations.

Edit:

Maybe atomic grouping can also help:

 /(?>[a-z0-9_\-+]++)@(?>[a-z0-9\-]++\.)(?>[az]{2,3})(?:\.[az]{2})?/

You must select an option first. It would be interesting to know if there is a difference using also option 2.

+6

FailedDev Nov 17 '11 at 23:10

source share

hakre · Accepted Answer · 2011-11-17T23:30:21+0000

You already know that your regular expression is causing a problem for large files. So maybe you can make it a little smarter?

For example, you use + to match one or more characters. Say you have a string of 10,000 characters. The regular expression should look like 10,000 combinations to find the greatest match. Then you combine it with similar ones. Let's say you have a line with 20,000 characters and two + groups. How can they match in a file. Probably 10,000 x 10,000 possibilities. And so on and so forth.

If you can limit the number of characters (this is a bit like searching for email templates), probably limit the domain name of the email address to 256 and the address itself to 256 characters. Then it will be 256 x 256 possibilities for checking "only":

 /[a-z0-9_\-\+]{1,256}@[a-z0-9\-]{1,256}\.([az]{2,3})(?:\.[az]{2})?/i

This is probably already much faster. Then, using these quantifiers will reduce backtracking for PCRE:

 /[a-z0-9_\-\+]{1,256} +@ [a-z0-9\-]{1,256}+\.([az]{2,3})(?:\.[az]{2})?/i

Which should speed it up again.

Why does this regex take so long to find email addresses in specific files?

More articles: