Why does Seq give a stack overflow on going through a large csv file

I have a csv file with the following structure:

  • The first line is the title bar
  • The rest of the rows are data rows, each with the same number of commas, so we can think of data in column terms

I wrote a little script to go through each line of the file and return a sequence of tuples containing the column heading and the length of the largest data row in this column:

let getColumnInfo (fileName:string) = let delimiter = ',' let readLinesIntoColumns (sr:StreamReader) = seq { while not sr.EndOfStream do yield sr.ReadLine().Split(delimiter) |> Seq.map (fun c -> c.Length ) } use sr = new StreamReader(fileName) let headers = sr.ReadLine().Split(delimiter) let columnSizes = let initial = Seq.map ( fun h -> 0 ) headers let toMaxColLengths (accumulator:seq<int>) (line:seq<int>) = let chooseBigger ab = if a > b then a else b Seq.map2 chooseBigger accumulator line readLinesIntoColumns sr |> Seq.fold toMaxColLengths initial Seq.zip headers columnSizes; 

This works fine with a small file. However, when it tries to process a large file (> 75 MB), it removes fsi using the StackOverflow exception. If I delete the line

 Seq.map2 chooseBigger accumulator line 

the program ends.

Now, my question is: why does F # use the stack? My understanding of sequences in F # is that the entire sequence is not stored in memory, but only the elements being processed. Therefore, I expected that lines that were already processed would not remain on the stack. Where is my misunderstanding?

+4
source share
3 answers

I think this is a good question. Here's a simpler reprogramming:

 let test n = [for i in 1 .. n -> Seq.empty] |> List.fold (Seq.map2 max) Seq.empty |> Seq.iter ignore 

test creates a sequence of empty sequences, calculates max line by line, and then iterates over the resulting (empty) sequence. You will find that with a high n this will lead to a stack overflow, even though there are no values ​​to iterate at all!

It’s a little difficult to explain why, but here is a hit at him. The problem is that when you add sequences, Seq.map2 returns a new sequence that Seq.map2 its work until it is listed. Thus, when you try to iterate over the resulting sequence, you end up moving deep into the chain of computations of n layers.

As Daniel explains, you can avoid this by evaluating the resulting sequence with impatience (for example, translating it into a list).

EDIT

Here is an attempt to explain what is going wrong. When you call Seq.map2 max s1 s2 , neither s1 nor s2 actually listed; you will get a new sequence, which, when enumerating, will list both of them and compare the obtained values. Thus, if we do something like the following:

 let s0 = Seq.empty let s1 = Seq.map2 max Seq.emtpy s0 let s2 = Seq.map2 max Seq.emtpy s1 let s3 = Seq.map2 max Seq.emtpy s2 let s4 = Seq.map2 max Seq.emtpy s3 let s5 = Seq.map2 max Seq.emtpy s4 ... 

Then the call to Seq.map2 always returns immediately and uses the constant stack space. However, enumeration s5 requires enumeration s4, which requires enumeration s3, etc. This means that listing s99999 will create a huge call stack that looks something like this:

 ... (s99996 enumerator).MoveNext() (s99997 enumerator).MoveNext() (s99998 enumerator).MoveNext() (s99999 enumerator).MoveNext() 

and we get a stack overflow.

+6
source

Your code contains so many sequences that are hard to talk about. I guess you are confusing. You can make it a lot easier and more efficient (striving is not so bad):

 let getColumnInfo (fileName:string) = let delimiter = ',' use sr = new StreamReader(fileName) match sr.ReadLine() with | null | "" -> Array.empty | hdr -> let cols = hdr.Split(delimiter) let counts = Array.zeroCreate cols.Length while not sr.EndOfStream do sr.ReadLine().Split(delimiter) |> Array.iteri (fun i fld -> counts.[i] <- max counts.[i] fld.Length) Array.zip cols counts 

This assumes that all rows are nonempty and have the same number of columns.

You can fix your function by changing this line to:

 Seq.map2 chooseBigger accumulator line |> Seq.toList |> seq 
+2
source

why is f # using the stack? My understanding of sequences in F # is that the entire sequence is not stored in memory, but only the elements being processed. Therefore, I expected that lines that were already processed would not remain on the stack. Where is my misunderstanding?

Lines themselves do not eat stack space. The problem is that you accidentally wrote a function that creates a huge, invaluable computation (tree of tricks) that overflows the stack when it is evaluated, because it makes non-zero O (n) calls deep. This tends to happen whenever you create sequences from other sequences and are not forced to evaluate anything.

+1
source

Source: https://habr.com/ru/post/1400597/


All Articles