It depends on d already sorted as you like:
# example data set.seed(36) n <- 1e5 dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500))) d <- dat[order(dat$y),]
My suggestion (thanks @alexis_laz for the denominator):
system.time(res3 <- { xs <- sort(d$x)
OP approach:
system.time(res <- { res <- NULL for( i in seq_len(sum(d$y<=300)) ){ numerator <- sum(d$x <= d$y[i]) denominator <- sum(d$y >= d$y[i]) res[i] <- numerator / denominator } res })
@db:
system.time(res2 <- { numerator = rowSums(outer(d$y, d$x, ">=")) denominator = rowSums(outer(d$y, d$y, "<=")) res2 = numerator/denominator res2 = res2[d$y <= 300] res2 })
Vectorization. Typically, tasks are faster in R if they can be vectorized. Key functions related to ordered vectors confuse names ( findInterval , sort , order and cut ), but, fortunately, they all work on vectors.
Continuous and discrete . The above match should be a quick way to calculate the denominator, whether the data is continuous or has mass points / repeating values. If the data is continuous (and therefore does not have repetitions), the denominator can only be seq(length(xs), length = length(yt), by=-1) . If it is completely discrete and has many repetitions (for example, here), there may be a way to do it faster, perhaps as one of them:
den2 <- inverse.rle(with(rle(yt), list( values = length(xs) - length(yt) + rev(cumsum(rev(lengths))), lengths = lengths))) tab <- unname(table(yt)) den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
findInterval will still work for the numerator for continuous data. This is not ideal for the case of duplicate values ββconsidered here, I think (since we excessively find the interval for many duplicate yt values). Similar ideas to accelerate this action are probably applicable.
Other options. . As @chinsoon suggested, the data.table package might be appropriate if findInterval too slow, as it has many functions oriented towards sorted data, but it is not obvious to me how to apply it here.