I have a loop in R that is pretty slow (but it works). Currently, this calculation takes about 3 minutes on my laptop, and I think it can be improved. In the end, I go through many data files that perform calculations based on the results of this code, and would like to make the current code faster if possible.
In principle, for each date, for 11 different values of X, the cycle captures the last values of summer precipitation (Y) for X years, finds linear inverse weighing (Z), so that the oldest values of precipitation have the least value, (Y) and weight ( Z) to get the vector A, then takes the sum of A as the final result. This is done for thousands of dates.
However, I could not come up with or find advice in order to make it faster in R, so I tried to rewrite it in Rcpp, in which I had limited knowledge. My Rcpp code does not duplicate the R code exactly, as the resulting matrix is different (wrong) from what it should be (out1 vs out2; I know that out1 is correct). It seems that the Rcpp code is faster, but I can only check it with a few columns because it starts to fail (fatal error in RStudio) if I try to run all 11 columns (i <= 10).
I am looking for feedback on how I can improve the R code and / or fix the Rcpp code to ensure the correct result, and not a failure in the process.
(Although the code I posted below does not show it, the data is loaded into R in the same way as [dataframe] for several calculations performed outside the code shown. Only column 2 of the data frame is used.)
The data file is here: https://drive.google.com/file/d/0Bw_Ca37oxVmJekFBR2t4eDdKeGM/view?usp=sharing
Attempt at R
library(readxl)
library(readxl)
library(Rcpp)
file = data.frame(read_excel("lake.xlsx", trim_ws=T)
col_types=c("date","numeric","numeric","date",rep("numeric",4),"text")))
file[,1] = as.Date(file[,1], "%Y/%m/%d", tz="UTC")
file[,4] = as.Date(file[,4], "%Y/%m/%d", tz="UTC")
rainSUM = function(df){
rainsum = data.frame("6m"=as.numeric(), "1yr"=as.numeric(), "2yr"=as.numeric(), "3yr"=as.numeric(), "4yr"=as.numeric(), "5yr"=as.numeric(), "6yr"=as.numeric(), "7yr"=as.numeric(), "8yr"=as.numeric(), "9yr"=as.numeric(), "10yr"=as.numeric())
Tdays <- length(df[,1])
for(i in 1:11) {
if (i==1) {
d <- 183
} else {
d <- (i-1)*366
}
w <- 0:(d-1)/d
for(k in 1:Tdays) {
if(d>k){
rainsum[k,i] <- sum(df[1:k,2] * w[(d-k+1):d])
} else{
rainsum[k,i] <- sum(df[(k-d+1):k,2] * w)
}
}
}
return(rainsum)
}
out1 <- rainSUM(file)
Trying in Rcpp
#include <Rcpp.h>
using namespace Rcpp;
NumericVector myseq(int first, int last) {
NumericVector y(0);
for (int i=first; i<=last; ++i)
y.push_back(i);
return(y);
}
NumericVector splicer(NumericVector vec, int first, int last) {
NumericVector y(0);
for (int i=first; i<=last; ++i)
y.push_back(vec[i]);
return(y);
}
NumericVector weighty(int d) {
NumericVector a = myseq(1,d);
NumericVector b = (a-1)/a;
return(b);
}
NumericMatrix rainsumCPP(DataFrame df, int raincol) {
NumericVector q(0);
NumericMatrix rainsum(df.nrows(), 11);
NumericVector p = df( raincol-1 );
for(int i = 0; i <= 10; i++) {
if (i==0) {
int d = 183;
NumericVector w = weighty(d);
for(int k = 0; k < df.nrows(); k++) {
if(d>k){
NumericVector m = splicer(p, 0, k);
NumericVector u = splicer(w, (d-k), (d-1));
m = m*u;
rainsum(k,i) = sum(m);
} else{
NumericVector m = splicer(p, k-d+1, k);
m = m*w;
rainsum(k,i) = sum(m);
}
}
}
else {
int d = i*366;
NumericVector w = weighty(d);
for(int k = 0; k < df.nrows(); k++) {
if(d>k){
NumericVector m = splicer(p, 0, k);
NumericVector u = splicer(w, (d-k), (d-1));
m = m*u;
rainsum(k,i) = sum(m);
} else{
NumericVector m = splicer(p, k-d+1, k);
m = m*w;
rainsum(k,i) = sum(m);
}
}
}
}
return(rainsum);
}