The difference between Data.CSV
and Data.CSV.Streaming
is probably not quite what you expect. As you can see, the first one creates the Data.Vector.Vector
the csv content. I'm not sure why building this vector should take up so much space - although it does not start to surprise me when I reflect on the fact that the resulting pointer-to-pointer-to-lazy-pointer vectors bytestrings here contain 28203420 different pointers to lazy bytes, 371 for each line, each of which points to a tiny bit of the original byte stream, usually "0". Following http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html , this means that a typical sequence of two bytes in the original byte stream - almost all of them look like this: " , 0 "i.e. [44,48]
- is replaced by the number of pointers and constructors: only lazy bytestring content forces each pair of bytes to occupy something like 11 words ( Chunk
and Empty
constructors for a lazy byte string, as well as material for strict bytestring, which J Tibell puts on 9 words) ... plus the original bytes (minus those that represent commas and spaces). In a 64-bit system, this is a rather gigantic escalation in size.
Data.CSV.Streaming
is actually not that different: basically it builds a slightly decorated list, not a vector, so in principle it can be lazily evaluated, and in ideal cases all this will not need to be implemented in memory, as you noticed. However, in such a monadic context, you will “extract a list from IO”, which is not entirely guaranteed to create chaos and confusion.
If you want to transfer csv content correctly, you should use ... one of the streaming libraries. (I have no advice to get all this in mind, except for the obvious that cassava reads each line into a pretty compact data type, not a vector of pointers to lazy bytes, here, although we have 371 "fields" )
So here is your program using cassava-streams
, which uses the incremental cassava interface (genuine), and then uses io-streams
to create a stream of records:
{-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy as BL import qualified Data.Vector as V import Data.Foldable import System.IO.Streams (InputStream, OutputStream) import qualified System.IO.Streams as Streams import qualified System.IO.Streams.Csv as CSV import System.IO type StreamOfCSV = InputStream (V.Vector(BL.ByteString)) main = withFile "train.csv" ReadMode $ \h -> do input <- Streams.handleToInputStream h raw_csv_stream <- CSV.decodeStream HasHeader input csv_stream <- CSV.onlyValidRecords raw_csv_stream :: IO StreamOfCSV m <- Streams.read csv_stream print m
It ends immediately, using no more memory than hello-world
, printing the first record. You can see a bit more manipulation in the source code https://github.com/pjones/cassava-streams/blob/master/src/System/IO/Streams/Csv/Tutorial.hs There are similar libraries for other streaming libraries. If the data structure (for example, the matrix) that you want to create can fit in memory, you should be able to build it by folding lines with Streams.fold
, and there should be no problem if the information you are trying to extract from each row is correctly evaluated before it is consumed by the bend operation. If you can arrange that cassava displays a non-recursive data structure with unpacked fields, then you can create an Unbox instance for this type and collapse the entire csv into one tightly packed unboxed vector. In this case, there are 371 different fields in each row, so I don't think this is an option.
The following is the equivalent of the Data.CSV.Streaming
program:
main = withFile "train.csv" ReadMode $ \h -> do input <- Streams.handleToInputStream h raw_csv_stream <- CSV.decodeStream HasHeader input csv_stream <- CSV.onlyValidRecords raw_csv_stream :: IO StreamOfCSV csvs <- Streams.toList csv_stream print (csvs !! 0)
It has all the same problems since it uses Streams.toList
to collect a giant list before trying to find the first item.
- Addition
Here's what the pipe-csv option is for, which simply compresses each collapsible string in an unboxed vector Int
manually (this is easier than finding Doubles
, which is what this csv really stores using readInt
from the bytestring package.)
import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as B import qualified Data.Vector as V import qualified Data.Vector.Unboxed as U import Data.Csv import qualified Pipes.Prelude as P import qualified Pipes.ByteString as Bytes import Pipes import qualified Pipes.Csv as Csv import System.IO import Control.Applicative import qualified Control.Foldl as L main = withFile "train.csv" ReadMode $ \h -> do let csvs :: Producer (V.Vector ByteString) IO () csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat -- shamelessly reading integral part only, counting bad parses as 0 simplify bs = case B.readInt bs of Nothing -> 0 Just (n, bs') -> n uvectors :: Producer (U.Vector Int) IO () uvectors = csvs >-> P.map (V.map simplify) >-> P.map (V.foldr U.cons U.empty) runEffect $ uvectors >-> P.print
You can add lines using the folds in the foldl
library or any records you want to write, replacing the last line with something like this
let myfolds = liftA3 (,,) (L.generalize (L.index 13)) -- the thirteenth row, if it exists (L.randomN 3) -- three random rows (L.generalize L.length) -- number of rows (thirteen,mvs,len) <- L.impurely P.foldM myfolds uvectors case mvs of Nothing -> return () Just vs -> print (vs :: V.Vector (U.Vector Int)) print thirteen print len
in this case, I collect the thirteenth row, three random rows and the total number of records - any number of other folds can be combined with them. In particular, we could collect all the lines into a giant vector using L.vector
, which is probably still a bad idea considering the size of this csv file. Below we return to the starting point, we collect everything and print the 17th line of the completed vector of vectors, i.e. A kind of big matrix.
vec_vec <- L.impurely P.foldM L.vector uvectors print $ (vec_vec :: V.Vector (U.Vector Int)) V.! 17
It takes up a lot of memory, but not particularly annoying my little laptop.