If you have a fixed sample size that you want to select, and you donβt know in advance how many rows the file has, then here is an example code that will lead to a simple random data selection without saving the entire data set in memory:
n <- 1000 con <- file("jan08.csv", open = "r") head <- readLines(con, 1) sampdat <- readLines(con, n) k <- n while (length(curline <- readLines(con, 1))) { k <- k + 1 if (runif(1) < n/k) { sampdat[sample(n, 1)] <- curline } } close(con) delaysamp <- read.csv(textConnection(c(head, sampdat)))
If you work with a large data set more than once, then it is better to read the data in the database, and then select from there.
The ff package is another option for storing a large data file in a file, but the ability to capture parts of it inside R in a simple way.
source share