A back, I wrote a function to populate time series matrices that had NA values โโin accordance with the required specifications and had octave applications on several matrices, which are about 50,000 rows, 350 columns. A matrix can contain either numeric or symbolic values. The main problem is that the correction of the matrix is โโslow, and I thought that I would appreciate some experts on how to do this faster.
I think switching to rcpp or parallel can help, but I think it could be my design, not R itself, which is inefficient. I generally vecotrize everything in R, but since the missing values โโdo not match the pattern, I did not find any other way than working with a matrix based on each row.
The function must be called so that it can carry the missing missing values, and also called to quickly fill the last values โโwith the last known.
Here is an example matrix:
testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39, NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA, 4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L, 9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e", "ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31", "2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31", "2011-04-30", "2011-05-31"))) print(testMatrix) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
This is the function I'm using right now:
# ----------------------------------------------------------------------------
Then I call it with something like:
> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE) > print(fixedMatrix1) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
or
> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE) > print(fixedMatrix2) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39
This example is fast, but is there a way to do it faster for large matrices?
> n <- 38 > m <- 5000 > bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE) > system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE)) user system elapsed 86.47 0.06 87.24
In this fictitious there are many NSs only lines and completely filled, but normal ones can take about 15-20 minutes.
UPDATE
Regarding Charles's comment on na.locf, which does not fully reflect the logic of the above: The following is a simplified version of how the final function excludes input checks, etc.:
FillGaps <- function( dataMatrix, fillGapMax ) { require("zoo") numRow <- nrow(dataMatrix) numCol <- ncol(dataMatrix) iteration <- (numCol-fillGapMax) if(length(iteration)>0) { for (i in iteration:1) { tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE] tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax)) dataMatrix[,i:(i+fillGapMax)] <- tempMatrix } } return(dataMatrix) }