Writing to large matrices inside a function is fast and slow

Question

Writing to large matrices inside a function is fast and slow

[Question changed after answers]

Thanks for answers. In my question, I was unclear, for which I apologize.

I will try to give more detailed information about our situation. We have c. 100 that we save in the environment. Each of them is very large. If at all possible, we want to avoid copying these matrices when performing updates. We often encounter a memory limit of 2 GB, so this is very important for us.

So, our two requirements: 1) avoid copies and 2) indirectly access matrices by name. Speed, although important, is a side issue that could be addressed by avoiding copying.

It seems to me that Tommy's solution included making a copy (although it fully answered my actual original question, so I'm wrong).

Below is the code that seems most obvious to us, but it clearly creates a copy (as shown by the increase in memory.size)

myenv <- new.env() myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200) testfnDirect <- function(paramEnv) { print(memory.size()) for (i in 1:300) { temp <- paramEnv$testmat1[10,] paramEnv$testmat1[10,] <- temp * 0 } print(memory.size()) } system.time(testfnDirect(myenv))

Using the keyword c seems to avoid this, as shown below:

 myenv <- new.env() myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200) testfnDirect <- function(paramEnv) { print(gc()) varname <- "testmat1" # unused, but see text with (paramEnv, { for (i in 1:300) { temp <- testmat1[10,] testmat1[10,] <- temp * 0 } }) print(gc()) } system.time(testfnDirect(myenv))

However, this code works by accessing testmat1 directly by name. Our problem is that we need to indirectly touch on this issue (we do not know in advance which matrices we will update).

Is there a way to change testfnDirect so that we use the varname variable rather than hardcoding testmat

+4

matrix r

Sjc Nov 17 '11 at 15:02

source share

2 answers

Patrick burns · Answer 1 · 2011-11-18T18:43:29+0000

A fairly recent change to the data.table package was to avoid copying when changing values. Therefore, if your application can handle data.tables for other operations, this might be the solution. (And it will be fast.)

Tommy · Answer 2 · 2011-11-17T17:49:24+0000

Well, it would be nice if you could explain why the first solution is not in order ... It looks much ahead And it works faster.

To answer the questions:

The "nested replacement" operation, such as foo[bar][baz] <- 42 , is very complex and optimized for certain cases to avoid copying. But it is very likely that your specific use case is not optimized. This will result in more copies and lower performance.
A way to test this theory is to call gcinfo(TRUE) before your tests. Then you will see that the first solution launches 2 garbage collectors, and the second causes about 160!
Here's a variant of your second solution, converting the environment to a list, doing its job and converting it back to the environment. It is as fast as your first decision.

code:

 testfnList <- function() { mylist <- as.list(myenv, all.names=TRUE) thisvar <- "testmat2" for (i in 1:300) { temp <- mylist[[thisvar]][10,] mylist[[thisvar]][10,] <- temp * 0 } myenv <<- as.environment(mylist) } system.time(testfnList()) # 0.02 secs

... that, of course, would be more accurate if you passed myenv function as an argument. A small improvement (if you loop a lot, not just 300 times) will index by number instead of name (does not work for environments, but for lists). Just change thisvar :

 thisvar <- match("testmat2", names(mylist))

Writing to large matrices inside a function is fast and slow

More articles: