A more efficient strategy for which () or match ()

I have a vector of positive and negative numbers

vec<-c(seq(-100,-1), rep(0,20), seq(1,100)) 

the vector is larger than the example, and takes a random set of values. I have to re-find the number of negative numbers in the vector ... I find this to be pretty inefficient.

Since I only need to find the number of negative numbers, and the vector is sorted, I need to know only the index of the first 0 or a positive number (in actual random vectors, there may be 0).

I am currently using this code to find the length

 length(which(vec<0)) 

but it makes R go through the whole vector, but since it is sorted, there is no need.

I could use

 match(0, vec) 

but my vector does not always have 0s

So my question is: is there some kind of match () function that applies the condition instead of finding a specific value? Or is there a more efficient way to run my code ()?

thanks

+6
source share
3 answers

The solutions proposed so far include creating a logical(length(vec)) and performing a full or partial scan. As you noticed, the vector is sorted. We can use this by doing a binary search. I began to think that I would be super-smart and implement it in C even faster, but I had problems debugging the indexing of the algorithm (which is the difficult part!). So I wrote this in R:

 f3 <- function(x) { imin <- 1L imax <- length(x) while (imax >= imin) { imid <- as.integer(imin + (imax - imin) / 2) if (x[imid] >= 0) imax <- imid - 1L else imin <- imid + 1L } imax } 

For comparison with other offers

 f0 <- function(v) length(which(v < 0)) f1 <- function(v) sum(v < 0) f2 <- function(v) which.min(v < 0) - 1L 

and for pleasure

 library(compiler) f3.c <- cmpfun(f3) 

Going to

 > vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6)) > identical(f0(vec), f1(vec)) [1] TRUE > identical(f0(vec), f2(vec)) [1] TRUE > identical(f0(vec), f3(vec)) [1] TRUE > identical(f0(vec), f3.c(vec)) [1] TRUE > microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec)) Unit: microseconds expr min lq median uq max neval f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903 100 f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293 100 f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889 100 f3(vec) 51.715 56.050 75.4495 78.5295 100.730 100 f3.c(vec) 11.612 17.147 28.5570 31.3160 49.781 100 

There are probably a few difficult cases where I was wrong! Moving to C, I did

 library(inline) f4 <- cfunction(c(x = "numeric"), " int imin = 0, imax = Rf_length(x) - 1, imid; while (imax >= imin) { imid = imin + (imax - imin) / 2; if (REAL(x)[imid] >= 0) imax = imid - 1; else imin = imid + 1; } return ScalarInteger(imax + 1); ") 

with

 > identical(f3(vec), f4(vec)) [1] TRUE > microbenchmark(f3(vec), f3.c(vec), f4(vec)) Unit: nanoseconds expr min lq median uq max neval f3(vec) 52096 53192.0 54918.5 55539.0 69491 100 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038 100 f4(vec) 553 796.0 893.5 1004.5 2908 100 

findInterval appeared when a similar question was asked in R-help . It is slow but safe, checking that the vec actually sorted and dealing with NA values. If someone wants to live on the edge (perhaps no worse than the f3 or f4 implementation), then

 f5.i <- function(v) .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE)) 

almost as fast as the C implementation, but probably more robust and vectorized (i.e., finding the vector of values ​​in the second argument for simplified calculations in the form of a range).

+15
source

Use sum() and a logical comparison:

 sum( vec < 0 ) [1] 100 

This will be pretty fast, and when you sum the boolean, TRUE is 1 and FALSE is 0, then the total will be the number of negative values.

Oh, I feel the need for a comparative comparison ... :-) Vector length 2e5

 library(microbenchmark) vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5)) microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) ) Unit: milliseconds expr min lq median uq max neval (which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911 100 (sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088 2.662164 100 
+3
source

You can use which.min

  which.min(vec < 0) - 1L 

This will return the first FALSE value, i.e. first 0.

+2
source

Source: https://habr.com/ru/post/943614/


All Articles