Th...">

Fread takes up memory when "skip" large

I have a large csv file (20G, almost 200 million lines) which I cannot load into memory in general ----> Therefore, I want to load it in parts.

I did not find a way to use the file connection in fread (for example, in readLines) ----> Therefore, I tried to use "skip":

for(i in 1:100){ lines=fread(filename,nrows=rowPerRead,skip=(i-1)*rowPerRead) } 

It works fine from the start. But it becomes slower, as the gap becomes larger - in a non-linear way. It turns out that although these lines are skipped, it still takes up a lot of memory and only clears when the process is running. And as soon as memory is used up, the process becomes very slow.

 > system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1,quote="") }) user system elapsed 0.71 0.04 0.73 > system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1e8,quote="") }) Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:01:47 user system elapsed 21.89 13.76 106.60 > system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1.4e8,quote="") }) Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:02:48 user system elapsed 16.95 12.49 169.76 > 

memory usage for 2nd and 3rd launch. enter image description here

So my questions are: 1. Is there a more efficient way to store files with a large skip? 2. Is there a way to start fread from the file connection --- so that I can continue from the last read instead of restarting from the beginning.

+5
source share
1 answer

You can use the fread feature to accept a shell command that preprocesses the file as input. Using this option, we can run the gawk script to extract the necessary lines. Please note that you may need to install gawk if it is not already on your system (usually Linux and Unix-like machines already have this, on Windows you may need to install it).

 n = 100 # lines to skip cmd = paste0('gawk "NR > ', n, '" ', filename) lines = fread(cmd, nrows = rowPerRead) 
+1
source

Source: https://habr.com/ru/post/1274447/


All Articles