I am optimizing my code, and I am having some problems. I know that the greatest accelerations in R come from code vectorization instead of using loops. However, I have data in the lists, and I'm not sure if I can engineer my code or not. I tried using the apply functions (e.g. lapply , vapply ), but I read that these functions are only for writing cleaner code and actually use loops under the hood!
Here are my three biggest bottlenecks in my code, although I don't think anything can be done for the first part.
1) reading data
I work with lots of 1000 matrices in sizes 277x349. This is the biggest bottleneck in my script, but I eased the problem a bit by using the doMC package to take advantage of multiple cores with the foreach function. The result is a list containing 1000 277x349 matrices.
For the purpose of the question, let's say we have a list of 1000 matrices of dimensions 277 x 349
# Fake data l <- list() for(i in 1:1000) { l[[i]] <- matrix(rnorm(277*349), nrow=277, ncol=349) }
2) Bottleneck number 1
I need to make comparisons with some reference matrix (of the same size). This results in comparing 1000 matrices in my list of my reference matrices to get a vector of 1000 distances. If I know that matrices are the same size, can I vectorize this step?
Here is the code:
This step is most likely performed using vapply , but this is the third slowest part of the code.
3) Bottleneck number 2
Now I want to make a weighted average matrix J of the โclosestโ matrices to my original matrix. (There is a sorting step, but suppose d[1] < d[2] < ... < d[1000] for simplicity). I want to get a weighted average matrix when J = 1,2, ..., 1000
# Get the weighted matrix weightedMatrix <- function(listOfData, distances, J) {
I'm a bit stumped. I do not see an intuitive way to vectorize operations on a list of matrices.
The script I am writing will be called quite often, so even a small improvement can be put together!
EDIT:
RE: 1) Reading data
I forgot to mention that my data is in a special format, so I have to use a special data reading function to read the data in R. The files are in netcdf4 and I use the nc_open function from the ncdf4 package to access the files, and then I have to use the ncvar_get function to read the variable of interest. It's nice that the data in the files can be read from disk, and then I can read the data in memory using ncvar_get to perform operations with them using R.
If I know the size of my matrices and how many of them I will have, I asked my question with a list of data, because the foreach function, which allows me to perform parallel calculations, outputs the results from a parallel loop in the list. I found that using the foreach function, the step of reading data was about 3 times faster.
I assume that after that I can arrange the data in the form of a 3D array, but maybe the time taken to select the 3d array may take longer than it saves? I will have to try this tomorrow.
EDIT 2:
Here are some of the timings I took for my script.
Original Script:
[1] "Reading data to memory" user system elapsed 176.063 44.070 26.611 [1] "Calculating Distances" user system elapsed 2.312 0.000 2.308 [1] "Calculating the best 333 weighted matrices" user system elapsed 63.697 28.495 9.092
So far I have made the following improvements: (1) Preliminarily highlighted the list before reading the data, (2) Improved the weighted matrix calculations in accordance with the proposal of Martin Morgan.
[1] "Reading data to memory" user system elapsed 192.448 38.578 27.872 [1] "Calculating Distances" user system elapsed 2.324 0.000 2.326 [1] "Calculating all 1000 weighted matrices" user system elapsed 1.376 0.000 1.374
Some notes:
I use 12 cores in my foreach to read in data ( registerDoMC(12) ). The whole script takes about 40 s / 36 seconds to run before / after improvement.
Time for my bottleneck number 2 improved quite a bit. Previously, I only calculated the upper third (i.e. 333) weighted matrices, but now the script can simply calculate all the weighted matrices in a fraction of the original time.
Thanks for the help, I will try to tweak my code later to see if I can change my script to work with 3D arrays instead of lists. I'm going to spend some time checking the calculations to make sure they work!