Faster coding than using for loop

Question

Faster coding than using for loop

Suppose I have the following data frame

set.seed(36) n <- 300 dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500))) d <- dat[order(dat$y),]

For each value of d$y<=300 I need to create a res variable in which the numerator is the sum of the indicator (d$x <= d$y[i]) , and the denominator is the sum of the indicator (d$y >= d$y[i]) . I wrote the codes in for loop :

 res <- NULL for( i in seq_len(sum(d$y<=300)) ){ numerator <- sum(d$x <= d$y[i]) denominator <- sum(d$y >= d$y[i]) res[i] <- numerator / denominator }

But I am worried that the number of observations x and y large, that is, the number of rows of the data frame is increasing, for loop will work slowly. In addition, if I simulate data 1000 times and run for loop each time, the program will be ineffective.

What could be a more efficient code solution?

+5

r dplyr

user 31466 Feb 24 '17 at 3:07

source share

2 answers

Instead of starting the cycle, immediately generate all the numerator and denominator. It also allows you to keep track of which res is associated with x and y . Later you can save only the ones you want.

You can use outer for elementary comparison between vectors.

 numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself res2 = numerator/denominator #Obtain 'res' for all rows #I would first 'cbind' res2 to d and only then remove the ones for 'y <=300' res2 = res2[d$y <= 300] #Keep only those 'res' that you want

Since it uses rowSums , it should be faster.

+3

db Feb 24 '17 at 3:27

source share

Frank · Accepted Answer · 2017-02-24T04:42:21+0000

It depends on d already sorted as you like:

 # example data set.seed(36) n <- 1e5 dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500))) d <- dat[order(dat$y),]

My suggestion (thanks @alexis_laz for the denominator):

 system.time(res3 <- { xs <- sort(d$x) # sorted x yt <- d$y[d$y <= 300] # truncated y num = findInterval(yt, xs) den = length(d$y) - match(yt, d$y) + 1L num/den }) # user system elapsed # 0 0 0

OP approach:

 system.time(res <- { res <- NULL for( i in seq_len(sum(d$y<=300)) ){ numerator <- sum(d$x <= d$y[i]) denominator <- sum(d$y >= d$y[i]) res[i] <- numerator / denominator } res }) # user system elapsed # 50.77 1.13 52.10 # verify it matched all.equal(res,res3) # TRUE

@db:

 system.time(res2 <- { numerator = rowSums(outer(d$y, d$x, ">=")) denominator = rowSums(outer(d$y, d$y, "<=")) res2 = numerator/denominator res2 = res2[d$y <= 300] res2 }) # Error: cannot allocate vector of size 74.5 Gb # ^ This error is common when using outer() on large-ish problems

Vectorization. Typically, tasks are faster in R if they can be vectorized. Key functions related to ordered vectors confuse names ( findInterval , sort , order and cut ), but, fortunately, they all work on vectors.

Continuous and discrete . The above match should be a quick way to calculate the denominator, whether the data is continuous or has mass points / repeating values. If the data is continuous (and therefore does not have repetitions), the denominator can only be seq(length(xs), length = length(yt), by=-1) . If it is completely discrete and has many repetitions (for example, here), there may be a way to do it faster, perhaps as one of them:

  den2 <- inverse.rle(with(rle(yt), list( values = length(xs) - length(yt) + rev(cumsum(rev(lengths))), lengths = lengths))) tab <- unname(table(yt)) den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab) # verify all.equal(den,den2) # TRUE all.equal(den,den3) # TRUE

findInterval will still work for the numerator for continuous data. This is not ideal for the case of duplicate values considered here, I think (since we excessively find the interval for many duplicate yt values). Similar ideas to accelerate this action are probably applicable.

Other options. . As @chinsoon suggested, the data.table package might be appropriate if findInterval too slow, as it has many functions oriented towards sorted data, but it is not obvious to me how to apply it here.

Faster coding than using for loop

More articles: