I am writing a program that counts the number of words in a text file that is already in lowercase and separated by spaces. I want to use a dictionary and only count the word IF in the dictionary. The problem is that the dictionary is quite large (~ 100,000 words), and each text document also has ~ 50,000 words. Thus, the codes that I wrote below get very slowly (it takes about 15 seconds to process one document on a quad-core i7 machine). I am wondering if something is wrong with my coding, and if the efficiency of the program can be improved. Many thanks for your help. Code below:
public static string WordCount(string countInput)
{
string[] keywords = ReadDic();
Dictionary<string, int> dict = ReadFile(countInput).Split(' ')
.Select(c => c)
.Where(c => keywords.Contains(c))
.GroupBy(c => c)
.Select(g => new { word = g.Key, count = g.Count() })
.OrderBy(g => g.word)
.ToDictionary(d => d.word, d => d.count);
int s = dict.Sum(e => e.Value);
string k = s.ToString();
return k;
}
johnv source
share