The fastest way to check if a string exists in a large number of files

Question

The fastest way to check if a string exists in a large number of files

I am currently repeating somewhere between 7,000 and 10,000 text definitions ranging in size from 0 to 5000 characters, and I want to check if any line exists in any of them. I want to do this for somewhere in the area of 5000 different string definitions.

In most cases, I just want to know the exact case-insensitive match, but sometimes a regular expression is required. I was wondering if it would be faster to use a different “search” technique when a regular expression is not required.

The skipped version of the code looks something like this.

foreach (string find in stringsiWantToFind)
{
    Regex rx = new Regex(find, RegexOptions.IgnoreCase);
    foreach (String s in listOfText)
        if (rx.IsMatch(s))
            find.FoundIn(s);
}

I read a little to see if I was something obvious. There are a number of suggestions for using fixed regular expressions, but I don’t see that this is useful, given the “dynamic” nature of the regular expression.

I also read an interesting article in CodeProject, so I'm going to look at using "FastIndexOf" to see how it compares in performance.

I was just wondering if anyone has any advice on this issue and how can performance optimization be optimized?

thank

+3

performance c # regex

MrEdmundo Feb 15 '10 at 18:30

source share

3 answers

gingerbreadboy · Answer 1 · 2010-02-15T19:16:19+0000

- ? , , , . new Regex , .net . using, . Regex, .

Regex rx = new Regex("string1|string2|string3|string5|string-etc", RegexOptions.IgnoreCase);

foreach (string fileName in fileNames)
{
  var fs = new FileStream(fileName.ToString(), FileMode.Open,  FileAccess.ReadWrite, FileShare.ReadWrite);    
  var sr = new StreamReader(fs);    
  var sw = new StreamWriter(fs);

  string readFile = sr.ReadToEnd();
  MatchCollection matches = rx.Matches(readFile );

  foreach (Match match in matches)
  {
    //do stuff
  }
}

Steve Danner · Answer 2 · 2010-02-15T18:38:37+0000

, MS Google Desktop Search. API- , .

Marvin smith · Answer 3 · 2010-02-15T18:39:59+0000

, , :

Combine strings into one big one, work with regular expressions on a global level. This will give you the results of the “string found xx times” using 1 regex instead of iterating over your list.

Hope this helps,

The fastest way to check if a string exists in a large number of files

More articles: