Downloading ffdf data takes up a lot of memory

I had a strange problem: I am saving ffdf data using

 save.ffdf() 

from the ffbase package and when I load them into a new R session, doing

 load.ffdf("data.f") 

it is loaded into RAM about 90% of the memory than the same data as the data.frame object in R. Having this problem, it makes no sense to use ffdf , right? I cannot use ffsave because I work on a server and I do not have a zip application.

 packageVersion(ff) # 2.2.10 packageVersion(ffbase) # 0.6.3 

Any ideas on?

[edit] sample code to help clarify:

 data <- read.csv.ffdf(file = fn, header = T, colClasses = classes) # file fn is a csv database with 5 columns and 2.6 million rows, # with some factor cols and some integer cols. data.1 <- data save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example. 

closing session R ... opening again:

 load.ffdf(file.name) # file.name is a string pointing to the file. #that gives me object data, with class(data) = ffdf. 

then I have a ffdf data object [5], and its memory size is almost as large as:

 data.R <- data[,] # which is a data.frame. 

[end of editing]

* [SECOND EDITING :: FULL PLAYBACK CODE:]

Since my question has not yet been answered, and I still find the problem, I gave a reproducible example:

 dir1 <- 'P:/Projects/RLargeData'; setwd(dir1); library(ff) library(ffbase) memory.limit(size=4000) N = 1e7; df <- data.frame( x = c(1:N), y = sample(letters, N, replace =T), z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")), w = factor( sample(c(1:N/10) , N, replace=T)) ) df[1:10,] dff <- as.ffdf(df) head(dff) #str(dff) save.ffdf(dff, dir = "dframeffdf") dim(dff) # on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes) ### resetting R :: fresh RStudio Session dir1 <- 'P:/Projects/RLargeData'; setwd(dir1); library(ff) library(ffbase) memory.size() # 15.63 load.ffdf(dir = "dframeffdf") memory.size() # 384.42 gc() memory.size() # 287 

So, we have 384 MB of memory, and after gc () - 287, which is around the size of the data on the disk. (also checked in the "Process explorer" application for Windows)

 > sessionInfo() R version 2.15.2 (2012-10-26) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252 attached base packages: [1] tools stats graphics grDevices utils datasets methods base other attached packages: [1] ffbase_0.7-1 ff_2.2-10 bit_1.1-9 

[END SECOND EDIT]

+1
source share
2 answers

In ff, when you have factor columns, factor levels are always in RAM. columns of ff characters do not currently exist, and columns of characters are converted to factors in ffdf.

As for your example: your “w” column in “dff” contains more than 6 Mio levels. These levels are in RAM. If you did not have columns with a large number of levels, you would not see an increase in RAM, as shown below, using your example.

 N = 1e7; df <- data.frame( x = c(1:N), y = sample(letters, N, replace =T), z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")), w = sample(c(1:N/10) , N, replace=T)) dff <- as.ffdf(df) save.ffdf(dff, dir = "dframeffdf") ### resetting R :: fresh RStudio Session library(ff) library(ffbase) memory.size() # 14.67 load.ffdf(dir = "dframeffdf") memory.size() # 14.78 
+2
source

The ffdf package (s) have mechanisms for dividing an object in physical and virtual repositories. I suspect that you are implicitly creating elements in physical memory, but since you are not suggesting coding of how this workspace was created, there are only so many assumptions possible.

0
source

Source: https://habr.com/ru/post/1391529/


All Articles