Performance improvements for this word processing code

I am writing a program that counts the number of words in a text file that is already in lowercase and separated by spaces. I want to use a dictionary and only count the word IF in the dictionary. The problem is that the dictionary is quite large (~ 100,000 words), and each text document also has ~ 50,000 words. Thus, the codes that I wrote below get very slowly (it takes about 15 seconds to process one document on a quad-core i7 machine). I am wondering if something is wrong with my coding, and if the efficiency of the program can be improved. Many thanks for your help. Code below:

public static string WordCount(string countInput)
        {
            string[] keywords = ReadDic(); /* read dictionary txt file*/

            /*then reads the main text file*/
            Dictionary<string, int> dict = ReadFile(countInput).Split(' ')
                .Select(c => c)
                .Where(c => keywords.Contains(c))
                .GroupBy(c => c)
                .Select(g => new { word = g.Key, count = g.Count() })
                .OrderBy(g => g.word)
                .ToDictionary(d => d.word, d => d.count);

            int s = dict.Sum(e => e.Value);
            string k = s.ToString();
            return k;

        } 
+3
source share
4 answers

, , .

File.ReadLines(path).SelectMany(s => s.Split(' '))

ReadAllLines; .


Select .


Contains .
, Where O (n 2).

keywords HashSet<string>.
HashSets , Where O (n), .


Select GroupBy, :

 .GroupBy(c => c, (word, set) => new { word, count = set.Count() })

, OrderBy .

+7

,

return ReadFile(countInput).Split(' ').Count(c => keywords.Contains(c));

, , HashSet - .
: , ReadDic() - .

+1

, AsParallel().

+1

string[] keywords HashSet<string> keywords. "" - , , , , -.

If you want to get REALLY, you can use multiple threads using PLINQ , but I would make sure that you optimized the performance of one thread before going along this route.

0
source

Source: https://habr.com/ru/post/1782296/


All Articles