How can I make this matrix fill function R faster?

Question

How can I make this matrix fill function R faster?

A back, I wrote a function to populate time series matrices that had NA values in accordance with the required specifications and had octave applications on several matrices, which are about 50,000 rows, 350 columns. A matrix can contain either numeric or symbolic values. The main problem is that the correction of the matrix is slow, and I thought that I would appreciate some experts on how to do this faster.

I think switching to rcpp or parallel can help, but I think it could be my design, not R itself, which is inefficient. I generally vecotrize everything in R, but since the missing values do not match the pattern, I did not find any other way than working with a matrix based on each row.

The function must be called so that it can carry the missing missing values, and also called to quickly fill the last values with the last known.

Here is an example matrix:

testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39, NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA, 4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L, 9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e", "ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31", "2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31", "2011-04-30", "2011-05-31"))) print(testMatrix) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA

This is the function I'm using right now:

 # ---------------------------------------------------------------------------- # GetMatrixWithBlanksFilled # ---------------------------------------------------------------------------- # # Arguments: # inputMatrix --- A matrix with gaps in the time series rows # fillGapMax --- The max number of columns to carry a number # forward if there are no more values in the # time series row. # # Returns: # A matrix with gaps filled. GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) { if("DEBUG_ON" %in% ls(globalenv())){browser()} cntRow <- nrow(inputMatrix) cntCol <- ncol(inputMatrix) # if (forwardLooking) { for (i in 1:cntRow) { # Store the location of the first non NA element in the row firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1] if (!(is.na(firstValueCol))) { if (!(firstValueCol == cntCol)) { nextValueCol <- firstValueCol # If there is aa value number in the row and it not at the end of the time # series, start iterating through the row while there are more NA values and # more data values and not at the end of the row continue. while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) { # Find the next NA element nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1] # Find the next value element nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1] # If there is another value element then fill up all NA elements in between with the last known value if (!is.na(nextValueCol)) { inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)] } else { # If there is no other value element then fill up all NA elements up to the max number supplied # with the last known value unless it close to the end of the row then just fill up to the end. inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)] nextValueCol <- cntCol } } } } } } else { for (i in 1:cntRow) { if (is.na(inputMatrix[i,ncol(inputMatrix)])) { tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])] if (length(tempRow[!is.na(tempRow)])>0) { lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])] inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])] } } } } return(inputMatrix) }

Then I call it with something like:

 > fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE) > print(fixedMatrix1) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39

or

 > fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE) > print(fixedMatrix2) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39

This example is fast, but is there a way to do it faster for large matrices?

 > n <- 38 > m <- 5000 > bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE) > system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE)) user system elapsed 86.47 0.06 87.24

In this fictitious there are many NSs only lines and completely filled, but normal ones can take about 15-20 minutes.

UPDATE

Regarding Charles's comment on na.locf, which does not fully reflect the logic of the above: The following is a simplified version of how the final function excludes input checks, etc.:

 FillGaps <- function( dataMatrix, fillGapMax ) { require("zoo") numRow <- nrow(dataMatrix) numCol <- ncol(dataMatrix) iteration <- (numCol-fillGapMax) if(length(iteration)>0) { for (i in iteration:1) { tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE] tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax)) dataMatrix[,i:(i+fillGapMax)] <- tempMatrix } } return(dataMatrix) }

+6

matrix r

Hansi Jun 16 '11 at 10:30

source share

1 answer

daroczig · Accepted Answer · 2011-06-16T10:50:16+0000

Maybe I'm wrong, but I think this is implemented in the zoo package: use the na.locf function.

With your given sample matrix, we must first transfer it, and after calling the function na we “relay” the result matrix. For instance:

 > t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12)) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39

And with a little maxgap :

 > t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0)) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA

The performance obtained with na.locf can be seen:

 > system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE)) user system elapsed 79.238 0.540 80.398 > system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12))) user system elapsed 17.129 0.267 17.513

How can I make this matrix fill function R faster?

More articles: