Is it possible to vectorize the code when the data is in the list?

I am optimizing my code, and I am having some problems. I know that the greatest accelerations in R come from code vectorization instead of using loops. However, I have data in the lists, and I'm not sure if I can engineer my code or not. I tried using the apply functions (e.g. lapply , vapply ), but I read that these functions are only for writing cleaner code and actually use loops under the hood!

Here are my three biggest bottlenecks in my code, although I don't think anything can be done for the first part.

1) reading data

I work with lots of 1000 matrices in sizes 277x349. This is the biggest bottleneck in my script, but I eased the problem a bit by using the doMC package to take advantage of multiple cores with the foreach function. The result is a list containing 1000 277x349 matrices.

For the purpose of the question, let's say we have a list of 1000 matrices of dimensions 277 x 349

 # Fake data l <- list() for(i in 1:1000) { l[[i]] <- matrix(rnorm(277*349), nrow=277, ncol=349) } 

2) Bottleneck number 1

I need to make comparisons with some reference matrix (of the same size). This results in comparing 1000 matrices in my list of my reference matrices to get a vector of 1000 distances. If I know that matrices are the same size, can I vectorize this step?

Here is the code:

 # The reference matrix r <- matrix(rnorm(277*349), nrow=277, ncol=349) # The number of non NA values in matrix. Do not need to worry about this... K <- 277*349 # Make a function to calculate distances distance <- function(xi, xj, K, na.rm=TRUE) { sqrt(sum((xi - xj)^2, na.rm=na.rm)/K) } # Get a vector containing all the distances d <- vapply(l, distance, c(0), xj=r, K=K) 

This step is most likely performed using vapply , but this is the third slowest part of the code.

3) Bottleneck number 2

Now I want to make a weighted average matrix J of the โ€œclosestโ€ matrices to my original matrix. (There is a sorting step, but suppose d[1] < d[2] < ... < d[1000] for simplicity). I want to get a weighted average matrix when J = 1,2, ..., 1000

 # Get the weighted matrix weightedMatrix <- function(listOfData, distances, J) { # Calculate weights: w <- d[1:J]^{-2} / sum(d[1:J]^{-2}) # Get the weighted average matrix # *** I use a loop here *** x_bar <- matrix(0, nrow=nrow(listOfData[[1]]), ncol=ncol(listOfData[[1]])) for(i in 1:J) { x_bar <- x_bar + {listOfData[[i]] * w[i]} } return(x_bar) } # Oh no! Another loop... res <- list() for(i in 1:length(l) ) { res[[i]] <- weightedMatrix(l, d, J=i) } 

I'm a bit stumped. I do not see an intuitive way to vectorize operations on a list of matrices.

The script I am writing will be called quite often, so even a small improvement can be put together!


EDIT:

RE: 1) Reading data

I forgot to mention that my data is in a special format, so I have to use a special data reading function to read the data in R. The files are in netcdf4 and I use the nc_open function from the ncdf4 package to access the files, and then I have to use the ncvar_get function to read the variable of interest. It's nice that the data in the files can be read from disk, and then I can read the data in memory using ncvar_get to perform operations with them using R.

If I know the size of my matrices and how many of them I will have, I asked my question with a list of data, because the foreach function, which allows me to perform parallel calculations, outputs the results from a parallel loop in the list. I found that using the foreach function, the step of reading data was about 3 times faster.

I assume that after that I can arrange the data in the form of a 3D array, but maybe the time taken to select the 3d array may take longer than it saves? I will have to try this tomorrow.


EDIT 2:

Here are some of the timings I took for my script.

Original Script:

 [1] "Reading data to memory" user system elapsed 176.063 44.070 26.611 [1] "Calculating Distances" user system elapsed 2.312 0.000 2.308 [1] "Calculating the best 333 weighted matrices" user system elapsed 63.697 28.495 9.092 

So far I have made the following improvements: (1) Preliminarily highlighted the list before reading the data, (2) Improved the weighted matrix calculations in accordance with the proposal of Martin Morgan.

 [1] "Reading data to memory" user system elapsed 192.448 38.578 27.872 [1] "Calculating Distances" user system elapsed 2.324 0.000 2.326 [1] "Calculating all 1000 weighted matrices" user system elapsed 1.376 0.000 1.374 

Some notes:

I use 12 cores in my foreach to read in data ( registerDoMC(12) ). The whole script takes about 40 s / 36 seconds to run before / after improvement.

Time for my bottleneck number 2 improved quite a bit. Previously, I only calculated the upper third (i.e. 333) weighted matrices, but now the script can simply calculate all the weighted matrices in a fraction of the original time.

Thanks for the help, I will try to tweak my code later to see if I can change my script to work with 3D arrays instead of lists. I'm going to spend some time checking the calculations to make sure they work!

+6
source share
1 answer

My "low hanging fruits" ( scan , pre-selection and filling) seem to be irrelevant, so ...

The operations in calculating the distance sort the view, vectorized enough for me. You can probably squeeze some extra speed by calculating one distance across all your matrices, but this probably makes the code less clear.

WeightedMatrix calculation looks like there is room for improvement. Let calculate

 w <- d^(-2) / cumsum(d^(-2)) 

For a weighted matrix m I think that the relation between successive matrices is simply m' = m * (1 - w[i]) + l[[i]] * w[i] , therefore

 res <- vector("list", length(l)) for (i in seq_along(l)) if (i == 1L) { res[[i]] = l[[i]] * w[[i]] } else { res[[i]] = res[[i - 1]] * (1 - w[[i]]) + l[[i]] * w[[i]] } 

This changes the calculation of res from quadratic to linear. My thoughts on better than linear execution were just (perhaps erroneous) guesses; I did not pursue this.

Returning to the preliminary highlighting and filling and comments of @flodel, we have

 f0 <- function(n) { ## good: pre-allocate and fill l = vector("list", n) for (i in seq_along(l)) l[[i]] = 1 l } f1 <- function(n) { ## bad: copy and append l = list() for (i in seq_len(n)) l[[i]] = 1 l } 

which give the same result

 > identical(f0(100), f1(100)) [1] TRUE 

but with different performance

 > sapply(10^(1:5), function(i) system.time(f0(i))[3]) elapsed elapsed elapsed elapsed elapsed 0.000 0.000 0.002 0.014 0.134 > sapply(10^(1:5), function(i) system.time(f1(i))[3]) elapsed elapsed elapsed elapsed elapsed 0.000 0.001 0.005 0.253 24.520 

Despite the fact that it does not matter for the scale of the current problem, it does not matter, it seems that you need to adopt a better strategy for pre-distribution and filling, so as not to guess if this is relevant or not. Better use the *apply family, or in this case replicate , so as not to think about it

 l <- replicate(1000, matrix(rnorm(277*349), nrow=277, ncol=349), simplify=FALSE) 
+13
source

Source: https://habr.com/ru/post/947171/


All Articles