The way to work with each row of the data table without using the application function

I wrote a simple function below:

mcs <- function(v) { ifelse(sum((diff(sort(v)) > 6) > 0), NA, sd(v)) }

It is supposed to take a vector, sort it, and then check if there is a difference greater than 6 in each successive difference. It returns NA if the difference is greater than 6 and standard deviation, if none.

I would like to apply this function to all rows of a data table (selecting only certain columns), and then add the return value for each row as a new column record in the data table.

For example, for a data table such as

> dat <- data.table(A=c(1,2,3,4,5), B=c(2,3,4,10,6), C=c(3,4,10,6,8),   
D=c(3,3,3,3,3))  
> dat  
   A  B  C D  
1: 1  2  3 3  
2: 2  3  4 3  
3: 3  4 10 3  
4: 4 10  6 3  
5: 5  6  8 3  

I would like to generate the output below. (I applied the function in columns 2, 3, and 4 of each row.)

> dat
   A  B  C D        sd
1: 1  2  3 3 0.5773503
2: 2  3  4 3 0.5773503
3: 3  4 10 3 3.7859389
4: 4 10  6 3 3.5118846
5: 5  6  8 3 2.5166115

I found out that a row operation can be performed using data tables using the following method:

> dat[, sd:=apply(.SD, 1, mcs), .SDcols=(c(2,3,4))]

, , . , script. . ~ 300 000 , , ~ 800 , . , , R -, . script , ( , ), , . , , . , .

, , , . R, , . , , , , , , . .

Edit
, mcs. .

2
, , , .

+4
1

, , ++ Rcpp, 100 .

, , 1000 5:

set.seed(123)
dat <- data.table(A = rnorm(1e3, sd=4), B = rnorm(1e3, sd=4), C = rnorm(1e3, sd=4),
                  D = rnorm(1e3, sd=4), E = rnorm(1e3, sd=4))

++, , , ++ R apply, :

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector mcs2(DataFrame x) {
    int n = x.nrows();
    int m = x.size();
    NumericMatrix mat(n, m);
    for ( int j = 0; j < m; ++j ) {
        mat(_, j) = NumericVector(x[j]);
    }
    NumericVector result(n);
    for ( int i = 0; i < n; ++i ) {
        NumericVector tmp = mat(i, _);
        std::sort(tmp.begin(), tmp.end());
        bool do_sd = true;
        for ( int j = 1; j < m; ++j ) {
            if ( tmp[j] - tmp[j-1] > 6.0 ) {
                result[i] = NA_REAL;
                do_sd = false;
                break;
            }
        }
        if ( do_sd ) {
            result[i] = sd(tmp);
        }
        do_sd = true;
    }
    return result;
}

, :

all.equal(apply(dat[, 2:4], 1, mcs1), mcs2(dat[,2:4]))

[1] TRUE

:

benchmark(mcs1 = dat[, sd:=apply(.SD, 1, mcs1), .SDcols=(c(2,3,4))],
          mcs2 = dat[, sd:=mcs2(.SD), .SDcols=(c(2,3,4))],
          order = 'relative',
          columns = c('test', 'elapsed', 'relative', 'user.self'))


  test elapsed relative user.self
2 mcs2    0.19    1.000     0.183
1 mcs1   21.34  112.316    20.044

++ Rcpp, Hadley Wickham Advanced R. - Rcpp. , , , , . Rcpp, .

, , Rcpp, . ,

install.packages(Rcpp)

R. , ; Linux Debian, Ubuntu,

sudo apt install r-base-dev

. Mac Windows, Wickham, .

Rcpp ++ . , "SOanswer.cpp". mcs2() R, R script:

library(Rcpp)
sourceCpp("SOanswer.cpp") # assuming the file is in your working directory

! R script mcs2() . Rcpp, Wickham , , RStudio ( , ), , Rcpp.

+3

Source: https://habr.com/ru/post/1690506/


All Articles