Loop optimization on a large data file in R, possibly using Rcpp

I have a loop in R that is pretty slow (but it works). Currently, this calculation takes about 3 minutes on my laptop, and I think it can be improved. In the end, I go through many data files that perform calculations based on the results of this code, and would like to make the current code faster if possible.

In principle, for each date, for 11 different values ​​of X, the cycle captures the last values ​​of summer precipitation (Y) for X years, finds linear inverse weighing (Z), so that the oldest values ​​of precipitation have the least value, (Y) and weight ( Z) to get the vector A, then takes the sum of A as the final result. This is done for thousands of dates.

However, I could not come up with or find advice in order to make it faster in R, so I tried to rewrite it in Rcpp, in which I had limited knowledge. My Rcpp code does not duplicate the R code exactly, as the resulting matrix is ​​different (wrong) from what it should be (out1 vs out2; I know that out1 is correct). It seems that the Rcpp code is faster, but I can only check it with a few columns because it starts to fail (fatal error in RStudio) if I try to run all 11 columns (i <= 10).

I am looking for feedback on how I can improve the R code and / or fix the Rcpp code to ensure the correct result, and not a failure in the process.

(Although the code I posted below does not show it, the data is loaded into R in the same way as [dataframe] for several calculations performed outside the code shown. Only column 2 of the data frame is used.)

The data file is here: https://drive.google.com/file/d/0Bw_Ca37oxVmJekFBR2t4eDdKeGM/view?usp=sharing

Attempt at R

library(readxl)

library(readxl)
library(Rcpp)
file = data.frame(read_excel("lake.xlsx", trim_ws=T)
col_types=c("date","numeric","numeric","date",rep("numeric",4),"text")))
file[,1] = as.Date(file[,1], "%Y/%m/%d", tz="UTC")
file[,4] = as.Date(file[,4], "%Y/%m/%d", tz="UTC")

rainSUM = function(df){
rainsum = data.frame("6m"=as.numeric(), "1yr"=as.numeric(), "2yr"=as.numeric(), "3yr"=as.numeric(), "4yr"=as.numeric(), "5yr"=as.numeric(), "6yr"=as.numeric(), "7yr"=as.numeric(), "8yr"=as.numeric(), "9yr"=as.numeric(), "10yr"=as.numeric()) # create dataframe for storing the sum of weighted last d values

  Tdays <- length(df[,1])

  for(i in 1:11) {           # loop through the lags
    if (i==1) {
      d <- 183               # 6 month lag only has 183 days,
    } else {
      d <- (i-1)*366         # the rest have 366 days times the number of years
    }
    w <- 0:(d-1)/d

    for(k in 1:Tdays) {      # loop through rows of rain dataframe (k = row)
      if(d>k){               # get number of rain values needed for the lag
        rainsum[k,i] <- sum(df[1:k,2] * w[(d-k+1):d])                                
      } else{
        rainsum[k,i] <- sum(df[(k-d+1):k,2] * w)                          
      }
    }
  }
  return(rainsum)
}

out1 <- rainSUM(file)

Trying in Rcpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]

NumericVector myseq(int first, int last) {  // simulate R X:Y sequence (step of 1)
  NumericVector y(0);
  for (int i=first; i<=last; ++i)
    y.push_back(i);
  return(y);
}

// [[Rcpp::export]]   
NumericVector splicer(NumericVector vec, int first, int last) {  // splicer
  NumericVector y(0);
  for (int i=first; i<=last; ++i)
    y.push_back(vec[i]);
  return(y);
}

// [[Rcpp::export]]              
NumericVector weighty(int d) {             // calculate inverse linear weight according to the number of days in lag 
  NumericVector a = myseq(1,d);            // sequence 1:d; length d
  NumericVector b = (a-1)/a;               // inverse linear
  return(b);                               // return vector                              
} 

// [[Rcpp::export]]              
NumericMatrix rainsumCPP(DataFrame df, int raincol) {
  NumericVector q(0);
  NumericMatrix rainsum(df.nrows(), 11);       // matrix with number of row days as data file and 11 columns 
  NumericVector p = df( raincol-1 );           // grab rain values (remember C++ first index is 0)
  for(int i = 0; i <= 10; i++) {                // loop through 11 columns (C++ index starts at 0!)
    if (i==0) {
      int d = 183;                             // 366*years lag days 
      NumericVector w = weighty(d);            // get weights for this lag series
      for(int k = 0; k < df.nrows(); k++) {    // loop through days (rows)
        if(d>k){                               // if not enough lag days for row, use what available
          NumericVector m = splicer(p, 0, k);  // subset rain values according to the day being considered
          NumericVector u = splicer(w, (d-k), (d-1));   // same for weight
          m = m*u;                              // multiply rain values by weights
          rainsum(k,i) = sum(m);               // add the sum of the weighted rain to the rainsum matrix
        } else{
          NumericVector m = splicer(p, k-d+1, k);
          m = m*w;
          rainsum(k,i) = sum(m);
        }
      }
    }
    else {
      int d = i*366;                           // 183 lag days if column 0
      NumericVector w = weighty(d);            // get weights for this lag series
      for(int k = 0; k < df.nrows(); k++) {    // loop through days (rows)
        if(d>k){                               // if not enough lag days for row, use what available
          NumericVector m = splicer(p, 0, k);  // subset rain values according to the day being considered
          NumericVector u = splicer(w, (d-k), (d-1));   // same for weight
          m = m*u;                             // multiply rain values by weights
          rainsum(k,i) = sum(m);               // add the sum of the weighted rain to the rainsum matrix
        } else{
          NumericVector m = splicer(p, k-d+1, k);
          m = m*w;
          rainsum(k,i) = sum(m);
        }
      } 
    }
  }
  return(rainsum);
}


/*** R
out2 = rainsumCPP(file, raincol) # raincol currently = 2
  */
+4
source share
1 answer

Congratulations! You have index out of bounds (OOB) causing undefined behavior (UB) ! You may discover this in the future by changing a vector accessory from []to ()and for matrix accessories from ()to .at().

Switching to these accessors gives:

Error in rainsumCPP(file, 2) : 
  Index out of bounds: [index=182; extent=182].

, , 0 1 , (, - 1).

, .

myseq(), splicer() weighty() R- . , all.equal(R_result, Rcpp_Result). : 1. myseq, splicer 2. weighty.

, , , .

// [[Rcpp::export]]
NumericVector myseq(int first, int last) {  // simulate R X:Y sequence (step of 1)
  int vec_len = abs(last - first);

  NumericVector y = no_init(vec_len);

  int count = 0;
  for (int i = first; i < last; ++i) {
    y(count) = count;
    count++;
  }

  return y;
}

// [[Rcpp::export]]   
NumericVector splicer(NumericVector vec, int first, int last) {  // splicer

  int vec_len = abs(last - first);

  NumericVector y = no_init(vec_len);

  int count = 0;
  for (int i = first; i < last; ++i) {
    y(count) = vec(i);

    count++;
  }

  return y;
}

// [[Rcpp::export]]              
NumericVector weighty(int d) {       // calculate inverse linear weight according to the number of days in lag 
  NumericVector a = myseq(0, d - 1); // (fixed) sequence 1:d; length d
  NumericVector b = a / d;           // (fixed) inverse linear
  return(b);                         // return vector                              
}

, , rainsumCpp, , R.

+1

Source: https://habr.com/ru/post/1687334/


All Articles