How to load a large csv file with mixed type columns using bigmemory package

Is there a way to combine the use of scan () and read.big.matrix () from the bigmemory package to read in a 200-bit CSV file with mixed-type columns, so that the result is a dataframe with an integer, character and number columns?

+6
source share
4 answers

Try the ff package for this.

library(ff) help(read.table.ffdf) 

The function 'read.table.ffdf reads split flat files into' ffdf objects very similar to (and using) 'read.table. It can also work with any convenient wrappers, such as "read.csv" and its own convenient wrapper (for example, "read.csv.ffdf) for R regular wrappers.

For 200Mb, this should be as simple a task as this.

  x <- read.csv.ffdf(file=csvfile) 

(For much larger files, you will most likely need to examine some configuration options, depending on your computer and OS).

+9
source

And, in this life there are things that are impossible, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: the matrix must be of the same atomic type. This is not a data frame.

Since the matrix must be of the same type, the bigmemory snooker bigmemory process several types in itself is bad. It can be done? I won’t go there. What for? Because everything else will assume that it receives a matrix, not a data frame. This will lead to more questions and sadness.

Now, what you can do is determine the types of each of the columns and generate a set of separate large memory files, each of which contains elements of a certain type. For instance. charBM = symbolic large matrix, intBM = integer large matrix, etc. You can then develop a wrapper that creates a data frame from all of this. Nevertheless, I do not recommend: treating different objects as they are, or forcing homogeneity, if possible, rather than trying to create a large grid data file.

@mdsumner correctly indicates ff . Another storage option is HDF5, which you can access via ncdf4 in R. Unfortunately, these other packages are not as pleasant as bigmemory .

+6
source

According to the help file, no.

Files must contain only one atomic type (for example, an integer). You, the user, should know if your file has a row and / or column names and various combinations of options should be helpful in getting the desired behavior.

I am not familiar with this package / function, but in R matrices can have only one atomic type (unlike, for example, data.frames).

+3
source

The best solution is to read the file line by line and analyze it, so the reading process will occupy almost all of the linear memory.

0
source

Source: https://habr.com/ru/post/894470/


All Articles