How can I make this matrix fill function R faster?

A back, I wrote a function to populate time series matrices that had NA values โ€‹โ€‹in accordance with the required specifications and had octave applications on several matrices, which are about 50,000 rows, 350 columns. A matrix can contain either numeric or symbolic values. The main problem is that the correction of the matrix is โ€‹โ€‹slow, and I thought that I would appreciate some experts on how to do this faster.

I think switching to rcpp or parallel can help, but I think it could be my design, not R itself, which is inefficient. I generally vecotrize everything in R, but since the missing values โ€‹โ€‹do not match the pattern, I did not find any other way than working with a matrix based on each row.

The function must be called so that it can carry the missing missing values, and also called to quickly fill the last values โ€‹โ€‹with the last known.

Here is an example matrix:

testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39, NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA, 4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L, 9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e", "ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31", "2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31", "2011-04-30", "2011-05-31"))) print(testMatrix) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA 

This is the function I'm using right now:

 # ---------------------------------------------------------------------------- # GetMatrixWithBlanksFilled # ---------------------------------------------------------------------------- # # Arguments: # inputMatrix --- A matrix with gaps in the time series rows # fillGapMax --- The max number of columns to carry a number # forward if there are no more values in the # time series row. # # Returns: # A matrix with gaps filled. GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) { if("DEBUG_ON" %in% ls(globalenv())){browser()} cntRow <- nrow(inputMatrix) cntCol <- ncol(inputMatrix) # if (forwardLooking) { for (i in 1:cntRow) { # Store the location of the first non NA element in the row firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1] if (!(is.na(firstValueCol))) { if (!(firstValueCol == cntCol)) { nextValueCol <- firstValueCol # If there is aa value number in the row and it not at the end of the time # series, start iterating through the row while there are more NA values and # more data values and not at the end of the row continue. while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) { # Find the next NA element nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1] # Find the next value element nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1] # If there is another value element then fill up all NA elements in between with the last known value if (!is.na(nextValueCol)) { inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)] } else { # If there is no other value element then fill up all NA elements up to the max number supplied # with the last known value unless it close to the end of the row then just fill up to the end. inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)] nextValueCol <- cntCol } } } } } } else { for (i in 1:cntRow) { if (is.na(inputMatrix[i,ncol(inputMatrix)])) { tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])] if (length(tempRow[!is.na(tempRow)])>0) { lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])] inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])] } } } } return(inputMatrix) } 

Then I call it with something like:

 > fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE) > print(fixedMatrix1) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39 

or

 > fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE) > print(fixedMatrix2) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39 

This example is fast, but is there a way to do it faster for large matrices?

 > n <- 38 > m <- 5000 > bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE) > system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE)) user system elapsed 86.47 0.06 87.24 

In this fictitious there are many NSs only lines and completely filled, but normal ones can take about 15-20 minutes.

UPDATE

Regarding Charles's comment on na.locf, which does not fully reflect the logic of the above: The following is a simplified version of how the final function excludes input checks, etc.:

 FillGaps <- function( dataMatrix, fillGapMax ) { require("zoo") numRow <- nrow(dataMatrix) numCol <- ncol(dataMatrix) iteration <- (numCol-fillGapMax) if(length(iteration)>0) { for (i in iteration:1) { tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE] tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax)) dataMatrix[,i:(i+fillGapMax)] <- tempMatrix } } return(dataMatrix) } 
+6
source share
1 answer

Maybe I'm wrong, but I think this is implemented in the zoo package: use the na.locf function.

With your given sample matrix, we must first transfer it, and after calling the function na we โ€œrelayโ€ the result matrix. For instance:

 > t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12)) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39 

And with a little maxgap :

 > t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0)) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA 

The performance obtained with na.locf can be seen:

 > system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE)) user system elapsed 79.238 0.540 80.398 > system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12))) user system elapsed 17.129 0.267 17.513 
+6
source

Source: https://habr.com/ru/post/890665/


All Articles