Loading CSV into memory using Cassava

I am trying to load a CSV into memory as a Vector vector with Cassava. My program works, but uses a huge amount of memory for a 50 MB csv file, and I don’t understand why.

I know that working with Data.Csv.Streaming should be better for large files, but I thought that 50 MB would still be fine. I tried both Data.Csv and Data.Csv.Streaming with more or less canonical examples from the github project page, I also tried to implement my own parser that outputs the Vector vector (I base my code on attoparsec-csv https: / /hackage.haskell.org/package/attoparsec-csv ), and all these solutions use about 2000 MB of memory! I am sure that something is wrong with what I am doing. What is the right way to do this?

My ultimate goal is to fully load the data into memory for further processing later. For example, I could break my data into interesting matrices and work with those who use Hmatrix.

Here are 2 programs I tried with Cassava:

1 / Using Data.Csv

import qualified Data.ByteString.Lazy as BL import qualified Data.Vector as V import Data.Csv import Data.Foldable main = do csv <- BL.readFile "train.csv" let Right res = decode HasHeader csv :: Either String (V.Vector(V.Vector(BL.ByteString))) print $ res V.! 0 

2 / Using Data.Csv.Streaming

 {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy as BL import qualified Data.Vector as V import Data.Csv.Streaming import Data.Foldable main = do csv <- BL.readFile "train.csv" let !a = decode HasHeader csv :: Records(V.Vector(BL.ByteString)) let !res = V.fromList $ Data.Foldable.toList a print $ res V.! 0 

Please note that I am not giving you a program that I made based on attoparsec-csv, because it almost exactly matches Vector instead of List. The memory usage of this solution is still pretty bad.

Interestingly, in the solution Data.Csv.Streaming, if I just print my data using Data.Foldable.for_, everything is very fast using 2 MB of memory. It made me think that my problem is related to the way I create my Vector. Most likely, the accumulation of thunks instead of storing raw data in a compact data structure.

Thank you for your help.

Antoine

+5
source share
1 answer

The difference between Data.CSV and Data.CSV.Streaming is probably not quite what you expect. As you can see, the first one creates the Data.Vector.Vector the csv content. I'm not sure why building this vector should take up so much space - although it does not start to surprise me when I reflect on the fact that the resulting pointer-to-pointer-to-lazy-pointer vectors bytestrings here contain 28203420 different pointers to lazy bytes, 371 for each line, each of which points to a tiny bit of the original byte stream, usually "0". Following http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html , this means that a typical sequence of two bytes in the original byte stream - almost all of them look like this: " , 0 "i.e. [44,48] - is replaced by the number of pointers and constructors: only lazy bytestring content forces each pair of bytes to occupy something like 11 words ( Chunk and Empty constructors for a lazy byte string, as well as material for strict bytestring, which J Tibell puts on 9 words) ... plus the original bytes (minus those that represent commas and spaces). In a 64-bit system, this is a rather gigantic escalation in size.

Data.CSV.Streaming is actually not that different: basically it builds a slightly decorated list, not a vector, so in principle it can be lazily evaluated, and in ideal cases all this will not need to be implemented in memory, as you noticed. However, in such a monadic context, you will “extract a list from IO”, which is not entirely guaranteed to create chaos and confusion.

If you want to transfer csv content correctly, you should use ... one of the streaming libraries. (I have no advice to get all this in mind, except for the obvious that cassava reads each line into a pretty compact data type, not a vector of pointers to lazy bytes, here, although we have 371 "fields" )

So here is your program using cassava-streams , which uses the incremental cassava interface (genuine), and then uses io-streams to create a stream of records:

  {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy as BL import qualified Data.Vector as V import Data.Foldable import System.IO.Streams (InputStream, OutputStream) import qualified System.IO.Streams as Streams import qualified System.IO.Streams.Csv as CSV import System.IO type StreamOfCSV = InputStream (V.Vector(BL.ByteString)) main = withFile "train.csv" ReadMode $ \h -> do input <- Streams.handleToInputStream h raw_csv_stream <- CSV.decodeStream HasHeader input csv_stream <- CSV.onlyValidRecords raw_csv_stream :: IO StreamOfCSV m <- Streams.read csv_stream print m 

It ends immediately, using no more memory than hello-world , printing the first record. You can see a bit more manipulation in the source code https://github.com/pjones/cassava-streams/blob/master/src/System/IO/Streams/Csv/Tutorial.hs There are similar libraries for other streaming libraries. If the data structure (for example, the matrix) that you want to create can fit in memory, you should be able to build it by folding lines with Streams.fold , and there should be no problem if the information you are trying to extract from each row is correctly evaluated before it is consumed by the bend operation. If you can arrange that cassava displays a non-recursive data structure with unpacked fields, then you can create an Unbox instance for this type and collapse the entire csv into one tightly packed unboxed vector. In this case, there are 371 different fields in each row, so I don't think this is an option.

The following is the equivalent of the Data.CSV.Streaming program:

  main = withFile "train.csv" ReadMode $ \h -> do input <- Streams.handleToInputStream h raw_csv_stream <- CSV.decodeStream HasHeader input csv_stream <- CSV.onlyValidRecords raw_csv_stream :: IO StreamOfCSV csvs <- Streams.toList csv_stream print (csvs !! 0) 

It has all the same problems since it uses Streams.toList to collect a giant list before trying to find the first item.

- Addition

Here's what the pipe-csv option is for, which simply compresses each collapsible string in an unboxed vector Int manually (this is easier than finding Doubles , which is what this csv really stores using readInt from the bytestring package.)

 import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as B import qualified Data.Vector as V import qualified Data.Vector.Unboxed as U import Data.Csv import qualified Pipes.Prelude as P import qualified Pipes.ByteString as Bytes import Pipes import qualified Pipes.Csv as Csv import System.IO import Control.Applicative import qualified Control.Foldl as L main = withFile "train.csv" ReadMode $ \h -> do let csvs :: Producer (V.Vector ByteString) IO () csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat -- shamelessly reading integral part only, counting bad parses as 0 simplify bs = case B.readInt bs of Nothing -> 0 Just (n, bs') -> n uvectors :: Producer (U.Vector Int) IO () uvectors = csvs >-> P.map (V.map simplify) >-> P.map (V.foldr U.cons U.empty) runEffect $ uvectors >-> P.print 

You can add lines using the folds in the foldl library or any records you want to write, replacing the last line with something like this

  let myfolds = liftA3 (,,) (L.generalize (L.index 13)) -- the thirteenth row, if it exists (L.randomN 3) -- three random rows (L.generalize L.length) -- number of rows (thirteen,mvs,len) <- L.impurely P.foldM myfolds uvectors case mvs of Nothing -> return () Just vs -> print (vs :: V.Vector (U.Vector Int)) print thirteen print len 

in this case, I collect the thirteenth row, three random rows and the total number of records - any number of other folds can be combined with them. In particular, we could collect all the lines into a giant vector using L.vector , which is probably still a bad idea considering the size of this csv file. Below we return to the starting point, we collect everything and print the 17th line of the completed vector of vectors, i.e. A kind of big matrix.

  vec_vec <- L.impurely P.foldM L.vector uvectors print $ (vec_vec :: V.Vector (U.Vector Int)) V.! 17 

It takes up a lot of memory, but not particularly annoying my little laptop.

+5
source

Source: https://habr.com/ru/post/1245771/


All Articles