Performance improvements for this word processing code

Question

Performance improvements for this word processing code

I am writing a program that counts the number of words in a text file that is already in lowercase and separated by spaces. I want to use a dictionary and only count the word IF in the dictionary. The problem is that the dictionary is quite large (~ 100,000 words), and each text document also has ~ 50,000 words. Thus, the codes that I wrote below get very slowly (it takes about 15 seconds to process one document on a quad-core i7 machine). I am wondering if something is wrong with my coding, and if the efficiency of the program can be improved. Many thanks for your help. Code below:

public static string WordCount(string countInput)
        {
            string[] keywords = ReadDic(); /* read dictionary txt file*/

            /*then reads the main text file*/
            Dictionary<string, int> dict = ReadFile(countInput).Split(' ')
                .Select(c => c)
                .Where(c => keywords.Contains(c))
                .GroupBy(c => c)
                .Select(g => new { word = g.Key, count = g.Count() })
                .OrderBy(g => g.word)
                .ToDictionary(d => d.word, d => d.count);

            int s = dict.Sum(e => e.Value);
            string k = s.ToString();
            return k;

        }

+3

c # text

johnv Dec 27 '10 at 14:53

source share

4 answers

,

return ReadFile(countInput).Split(' ').Count(c => keywords.Contains(c));

, , HashSet - .
: , ReadDic() - .

+1

The Smallest 27 . '10 14:58

, AsParallel().

+1

Dmitri Nesteruk 27 . '10 14:59

string[] keywords HashSet<string> keywords. "" - , , , , -.

If you want to get REALLY, you can use multiple threads using PLINQ , but I would make sure that you optimized the performance of one thread before going along this route.

0

Brook Dec 27 '10 at 15:00

source share

SLaks · Accepted Answer · 2010-12-27T14:57:19+0000

, , .

File.ReadLines(path).SelectMany(s => s.Split(' '))

ReadAllLines; .

Select .

Contains .
, Where O (n ²).

keywords HashSet<string>.
HashSets , Where O (n), .

Select GroupBy, :

 .GroupBy(c => c, (word, set) => new { word, count = set.Count() })

, OrderBy .

Performance improvements for this word processing code

More articles: