Why is the standard median R function much slower than the simple C ++ alternative?

Question

Why is the standard median R function much slower than the simple C ++ alternative?

I performed the following implementation of the median in C++and used it in Rthrough Rcpp:

// [[Rcpp::export]]
double median2(std::vector<double> x){
  double median;
  size_t size = x.size();
  sort(x.begin(), x.end());
  if (size  % 2 == 0){
      median = (x[size / 2 - 1] + x[size / 2]) / 2.0;
  }
  else {
      median = x[size / 2];
  }
  return median;
}

If later I compare the performance with the standard built-in median function R, I get the following results through microbenchmark

> x = rnorm(100)
> microbenchmark(median(x),median2(x))
Unit: microseconds
       expr    min     lq     mean median     uq     max neval
  median(x) 25.469 26.990 34.96888 28.130 29.081 518.126   100
 median2(x)  1.140  1.521  2.47486  1.901  2.281  47.897   100

Why is the standard median function so slow? This is not what I expect ...

+4

c ++ performance r microbenchmark rcpp

Ruben Jan 13 '16 at 15:50

source share

3 answers

[ , .]

. , , .

O (n log n) O (n), std::nth_element std::sort. std::nth_element, , std::min_element, , std::nth_element , std::min_element nth_element, . nth_element :

std::nth_element " ", () std::min_element , .

, ( ) - :

auto pos = x.begin() + x.size()/2;

std::nth_element(x.begin(), pos, x.end());
return *pos;

... ( ):

std::nth_element(x.begin(), pos, x.end());
auto pos2 = std::min_element(pos+1, x.end());
return (*pos + *pos2) / 2.0;

+2

Jerry Coffin 13 . '16 16:58

, "" .

: , , , ( ), ,

( ), .

0

tofro 13 . '16 16:07

nrussell · Accepted Answer · 2016-01-13T16:09:25+0000

As @joran noted, your code is very specialized, and, generally speaking, less generalized functions, algorithms, etc. often more effective. Take a look at median.default:

median.default
# function (x, na.rm = FALSE) 
# {
#   if (is.factor(x) || is.data.frame(x)) 
#     stop("need numeric data")
#   if (length(names(x))) 
#     names(x) <- NULL
#   if (na.rm) 
#     x <- x[!is.na(x)]
#   else if (any(is.na(x))) 
#     return(x[FALSE][NA])
#   n <- length(x)
#   if (n == 0L) 
#     return(x[FALSE][NA])
#   half <- (n + 1L)%/%2L
#   if (n%%2L == 1L) 
#     sort(x, partial = half)[half]
#   else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
# }

, , , , . , , , , :

median(c(1, 2, NA))
#[1] NA

median2(c(1, 2, NA))
#[1] 2

, , , , NA s, :

median, , , S3, ,
median ; Date, POSIXt , , :

median(Sys.Date() + 0:4)
#[1] "2016-01-15"

median(Sys.time() + (0:4) * 3600 * 24)
#[1] "2016-01-15 11:14:31 EST"

Edit: , , NumericVector -. , Rcpp::clone , ( std::vector<double>), SEXP std::vector.

, , NumericVector std::vector<double>:

#include <Rcpp.h>

// [[Rcpp::export]]
double cpp_med(Rcpp::NumericVector x){
  std::size_t size = x.size();
  std::sort(x.begin(), x.end());
  if (size  % 2 == 0) return (x[size / 2 - 1] + x[size / 2]) / 2.0;
  return x[size / 2];
}

microbenchmark::microbenchmark(
  median(x),
  median2(x),
  cpp_med(x),
  times = 200L
)
# Unit: microseconds
#       expr    min      lq      mean  median      uq     max neval
#  median(x) 74.787 81.6485 110.09870 92.5665 129.757 293.810   200
# median2(x)  6.474  7.9665  13.90126 11.0570  14.844 151.817   200
# cpp_med(x)  5.737  7.4285  11.25318  9.0270  13.405  52.184   200

- - . std::nth_element, :

#include <Rcpp.h>

// [[Rcpp::export]]
double cpp_med2(Rcpp::NumericVector xx) {
  Rcpp::NumericVector x = Rcpp::clone(xx);
  std::size_t n = x.size() / 2;
  std::nth_element(x.begin(), x.begin() + n, x.end());

  if (x.size() % 2) return x[n]; 
  return (x[n] + *std::max_element(x.begin(), x.begin() + n)) / 2.;
}

set.seed(123)
xx <- rnorm(10e5)

all.equal(cpp_med2(xx), median(xx))
all.equal(median2(xx), median(xx))

microbenchmark::microbenchmark(
  cpp_med2(xx), median2(xx), 
  median(xx), times = 200L
)
# Unit: milliseconds
#         expr      min       lq     mean   median       uq       max neval
# cpp_med2(xx) 10.89060 11.34894 13.15313 12.72861 13.56161  33.92103   200
#  median2(xx) 84.29518 85.47184 88.57361 86.05363 87.70065 228.07301   200
#   median(xx) 46.18976 48.36627 58.77436 49.31659 53.46830 250.66939   200

Why is the standard median R function much slower than the simple C ++ alternative?

More articles: