How to select a specific proportion of lines from a large file in R?

I have a huge coordinate file of about 125 million lines. I want to try these lines to get 1% of all lines so that I can build them. Is there any way to do this in R? The file is very simple, it has only 3 columns, and I'm only interested in the first two. A sample file will look like this:

1211 2234 1233 2348 . . . 

Any help / pointer is appreciated.

+4
source share
4 answers

As far as I understand your question, this may be useful

 > set.seed(1) > big.file <- matrix(rnorm(1e3, 100, 3), ncol=2) # simulating your big data > > > # choosing 1% randomly > one.percent <- big.file[sample(1:nrow(big.file), 0.01*nrow(big.file)), ] [,1] [,2] [1,] 99.40541 106.50735 [2,] 98.44774 98.53949 [3,] 101.50289 102.74602 [4,] 96.24013 104.97964 [5,] 101.67546 102.30483 

Then you can build it

 > plot(one.percent) 
+1
source

If you have a fixed sample size that you want to select, and you don’t know in advance how many rows the file has, then here is an example code that will lead to a simple random data selection without saving the entire data set in memory:

 n <- 1000 con <- file("jan08.csv", open = "r") head <- readLines(con, 1) sampdat <- readLines(con, n) k <- n while (length(curline <- readLines(con, 1))) { k <- k + 1 if (runif(1) < n/k) { sampdat[sample(n, 1)] <- curline } } close(con) delaysamp <- read.csv(textConnection(c(head, sampdat))) 

If you work with a large data set more than once, then it is better to read the data in the database, and then select from there.

The ff package is another option for storing a large data file in a file, but the ability to capture parts of it inside R in a simple way.

+3
source

The LaF command and sample_line are one of the options for reading a sample from a file:

 datafile <- "file.txt" # file from working directory sample_line(datafile, length(datafile)/100) # this give 1 % of lines 

More on sample_line: https://rdrr.io/cran/LaF/man/sample_lines.html

+2
source

If you do not want to read the file in R, something like this?

 mydata<-matrix(nrow=1250000,ncol=2) # assuming 2 columns in your source file for (j in 1:1250000) mydata[j,] <- scan('myfile',skip= j*100 -1,nlines=1) 

plus any arguments that may be required for the data type in your file, noheader, etc. And if you do not need evenly distributed samples, you will need to generate (1% of 125 million) 1.25 million integer values, randomly selected for 1: 1.25e8.

EDIT: my apologies - I neglected to put the argument nlines=1 .

0
source

Source: https://habr.com/ru/post/1501354/


All Articles