Faster data validation and updating inside the foreach loop

I am reading data from StreamReader line by line inside the following while statement.

while (!sr.EndOfStream) { string[] rows = sr.ReadLine().Split(sep); int incr = 0; foreach (var item in rows) { if (item == "NA" | item == "" | item == "NULL" | string.IsNullOrEmpty(item) | string.IsNullOrWhiteSpace(item)) { rows[incr] = null; } ++incr; } // another logic ... } 

The code works fine, but it is very slow due to huge csv files (500,000,000 rows and hundreds of columns). Is there a faster way to check the data (if it is "NA", "", ... should be replaced with null). I am currently using foreach with the incr variable to update an element inside foreach.

I was wondering if linq or lambda would be faster, but I am very new to these areas.

+5
source share
1 answer

First, do not use foreach when changing collections; this is not a good habit, especially when you are already using a counter variable.

This loop can be made multithreaded using Parallel.For as follows:

Code that uses the usual for:

 while (!sr.EndOfStream) { string[] rows = sr.ReadLine().Split(sep); for (int i = 0; i < rows.Length; i++) { //I simplified your checks, this is safer and simplier. if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL") { rows[i] = null; } } // another logic ... } 

Code using Parallel.For

 while (!sr.EndOfStream) { string[] rows = sr.ReadLine().Split(sep); Parallel.For(0, rows.Length, i => { if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL") { rows[i] = null; } }); // another logic ... } 

EDIT

We could approach this from the other side, but I do not recommend it, because it requires LOT of RAM, because it must read the entire file in memory.

 string[] lines = File.ReadAllLines("test.txt"); Parallel.For(0, lines.Length, x => { string[] rows = lines[x].Split(sep); for (int i = 0; i < rows.Length; i++) { if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL") { rows[i] = null; } } }); 

But I don’t think it is worth it. You decide. These operations do not work so well with parallelization because they take so little time to compute that there is too much overhead.

+5
source

Source: https://habr.com/ru/post/1274377/


All Articles