How to write to a file using Parallel.ForEach?

I have a task that reads a large file line by line, performs some logic with it, and returns the line that I need to write to the file. The order of output does not matter. However, when I try to use the code below, it stops / becomes very slow after reading the 15-20k lines of my file.

public static Object FileLock = new Object(); ... Parallel.ForEach(System.IO.File.ReadLines(inputFile), (line, _, lineNumber) => { var output = MyComplexMethodReturnsAString(line); lock (FileLock) { using (var file = System.IO.File.AppendText(outputFile)) { file.WriteLine(output); } } }); 

Why is my program slowing down after a while? Is there a more correct way to accomplish this task?

+5
source share
2 answers

You essentially serialized your request by asking all the streams to write to a file. Instead, you should calculate what you need to write, and then write them when they arrive at the end.

 var processedLines = File.ReadLines(inputFile).AsParallel() .Select(l => MyComplexMethodReturnsAString(l)); File.AppendAllLines(outputFile, processedLines); 

If you need to clear the data as needed, open the stream and enable automatic cleaning (or the handle manually):

 var processedLines = File.ReadLines(inputFile).AsParallel() .Select(l => MyComplexMethodReturnsAString(l)); using (var output = File.AppendText(outputFile)) { output.AutoFlush = true; foreach (var processedLine in processedLines) output.WriteLine(processedLine); } 
+4
source

This is due to how the internal Parallel.ForEach balancer works. When he sees that your threads spend a lot of time blocking, he explains that he can speed up the process by throwing more problems into the problem, which will lead to increased parallel overhead, competition for your FileLock and overall performance degradation.

Why is this happening? Since Parallel.ForEach not designed to work with IO.

How can you fix this? Use Parallel.ForEach only for CPU operation and perform all I / O operations outside of the parallel loop.

A quick workaround is to limit the number of Parallel.ForEach threads allowed to enlist using the overload that takes ParallelOptions , for example:

 Parallel.ForEach( System.IO.File.ReadLines(inputFile), new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, (line, _, lineNumber) => { ... } 
+1
source

Source: https://habr.com/ru/post/1242920/


All Articles