How to select a specific proportion of lines from a large file in R?

Question

How to select a specific proportion of lines from a large file in R?

I have a huge coordinate file of about 125 million lines. I want to try these lines to get 1% of all lines so that I can build them. Is there any way to do this in R? The file is very simple, it has only 3 columns, and I'm only interested in the first two. A sample file will look like this:

1211 2234 1233 2348 . . .

Any help / pointer is appreciated.

+4

r large-files sampling

Sam Sep 09 '13 at 19:25

source share

4 answers

If you have a fixed sample size that you want to select, and you don’t know in advance how many rows the file has, then here is an example code that will lead to a simple random data selection without saving the entire data set in memory:

 n <- 1000 con <- file("jan08.csv", open = "r") head <- readLines(con, 1) sampdat <- readLines(con, n) k <- n while (length(curline <- readLines(con, 1))) { k <- k + 1 if (runif(1) < n/k) { sampdat[sample(n, 1)] <- curline } } close(con) delaysamp <- read.csv(textConnection(c(head, sampdat)))

If you work with a large data set more than once, then it is better to read the data in the database, and then select from there.

The ff package is another option for storing a large data file in a file, but the ability to capture parts of it inside R in a simple way.

+3

Greg snow Sep 09 '13 at 22:24

source share

The LaF command and sample_line are one of the options for reading a sample from a file:

 datafile <- "file.txt" # file from working directory sample_line(datafile, length(datafile)/100) # this give 1 % of lines

More on sample_line: https://rdrr.io/cran/LaF/man/sample_lines.html

+2

vtenhunen Nov 26 '16 at 11:31

source share

If you do not want to read the file in R, something like this?

 mydata<-matrix(nrow=1250000,ncol=2) # assuming 2 columns in your source file for (j in 1:1250000) mydata[j,] <- scan('myfile',skip= j*100 -1,nlines=1)

plus any arguments that may be required for the data type in your file, noheader, etc. And if you do not need evenly distributed samples, you will need to generate (1% of 125 million) 1.25 million integer values, randomly selected for 1: 1.25e8.

EDIT: my apologies - I neglected to put the argument nlines=1 .

0

Carl Witthoft Sep 09 '13 at 20:00

source share

Jilber urbina · Accepted Answer · 2013-09-09T19:31:04+0000

As far as I understand your question, this may be useful

 > set.seed(1) > big.file <- matrix(rnorm(1e3, 100, 3), ncol=2) # simulating your big data > > > # choosing 1% randomly > one.percent <- big.file[sample(1:nrow(big.file), 0.01*nrow(big.file)), ] [,1] [,2] [1,] 99.40541 106.50735 [2,] 98.44774 98.53949 [3,] 101.50289 102.74602 [4,] 96.24013 104.97964 [5,] 101.67546 102.30483

Then you can build it

 > plot(one.percent)

How to select a specific proportion of lines from a large file in R?

More articles: