Very slow assignment to a vector with an unnamed (name) in R

my code fell into a performance vulnerability that I could reproduce in this snippet

rm (z) z = c() system.time({z[as.character(1:10^5)] = T}) user system elapsed 48.716 0.023 48.738 

I tried pre-allocating z with

 z = logical(10^5) 

but it does not matter. Then I preassigned the names

 names(z) = character(10^5) 

There is still no difference in speed.

 system.time({z[as.character(1:10^5)] = T}) user system elapsed 50.345 0.035 50.381 

If I repeat the test, with or without preliminary selections, the speed will return to reasonable levels (more than 100 times faster).

 system.time({z[as.character(1:10^5)] = T}) user system elapsed 0.037 0.001 0.039 

Finally, I found a not quite suitable solution:

 names(z) = as.character(1:10^5) system.time({z[as.character(1:10^5)] = T}) user system elapsed 0.035 0.001 0.035 

To go back to slow time, you can rm (z) and initialize it differently, but even changing the names back to something else takes the time back to slow. I say that this is not quite a workaround because I don’t understand why it works, so it’s hard to generalize the actual use case when I don’t know the names in advance. Of course, given the difference in two orders of magnitude, it can be assumed that some operation is involved, other than vexerization or the interpreter, but you can see that my code is free from a loop and does not cause any interpreted code that I can think of. Then, while trying with smaller vectors, I saw that the runtime grows much faster than linear, possibly quadratic, which indicates something else. The question is what is the reason for this speed behavior and what is the decision to make it faster.

Platform - OS X mt lion with R 15.2. Thanks

Antonio

+4
source share
4 answers

That seems pretty funny. It seems that R extends the vector one element at a time for each unsurpassed name. Here we (a) select only the last value if the names are duplicated, and then (b) update the existing named elements and (c) add new elements

 updateNamed <- function(z, z1) { z1 <- z1[!duplicated(names(z1), fromLast=TRUE)] # last value of any dup idx <- names(z1) %in% names(z) # existing names... z[ names(z1)[idx] ] <- z1[idx] # ...updated c(z, z1[!idx]) # new names appended } 

How does this

 > z <- setNames(logical(2), c("a", 2)) > updateNamed(z, setNames(c(TRUE, FALSE, TRUE, FALSE), c("a", 2, 2, "c"))) a 2 c TRUE TRUE FALSE 

and faster

 > n <- 3*10^4 > z <- logical(n) > z1 <- setNames(rep(TRUE, n), as.character(1:n)) > system.time(updateNamed(z, z1)) user system elapsed 0.036 0.000 0.037 

Careful consideration should be given to how names are used, for example, adding to a previously nameless vector.

 > length(updateNamed(z, z1)) [1] 60000 

when updating (with the "last" value) named vector

 > length(updateNamed(z1, !z1)) [1] 30000 

and also, as indicated in ?"[<-" , strings with zero length "" do not match.

 > z = TRUE; z[""] = FALSE; z TRUE FALSE 
+3
source

I can talk about what happens because the timings seem to go according to my assumption.

Here are three relevant runs:

 # run 1 - slow rm (z) n <- 3*10^4 z <- vector("logical", n) system.time({ z[as.character(1:n)] <- T }) # user system elapsed # 5.08 0.00 5.10 # run 2 - fast rm (z) n <- 3*10^4 z <- vector("logical", n) system.time({ names(z) <- as.character(1:n) z[as.character(1:n)] <- T }) # user system elapsed # 0.03 0.00 0.03 # run 3 - slow again rm (z) n <- 3*10^4 z <- vector("logical", n) system.time({ for (i in 1:n) names(z)[i] <- as.character(i) z[as.character(1:n)] <- T }) # user system elapsed # 6.10 0.00 6.09 

Launch No. 3 is what, in my opinion, happens in the background, or at least something like that: by assigning by name, R searches for names one by one, and if not, assigning it to the end of the vector names. Doing this one at a time is what kills him ...


You also pointed out that pre-assigning names as follows names(z) <- character(1:n) did not help. Hehe, look that character(1:n) returns "" , so it does not set names as you thought. No wonder this helps a little. You used as.character instead of character .


Finally, you ask, what is the solution to make it faster? I would say that you have already found one (Run No. 2). You can also do:

 keys <- as.character(1:n) values <- rep(T, n) z <- setNames(values, keys) 
+3
source

To fix this problem (in the general case), you can separate the naming from the destination:

 z[1:10^5] = T names(z) = as.character(1:10^5) 

But I really don’t know why the slowdown occurs (it seems that the full as.character is called for every z element in your expressions, but this is just an assumption).

-1
source

I can’t point a finger at it, but I suspect that simplifying the example may help explain something:

 R> z = logical(6); z[1:3] = T; z[as.character(1:3)] = T; z 1 2 3 TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE 

and, moreover, while z[1:5] can be direct, presumably vectorized, a search for z[as.character(1:5)] will include a search for the name in the index, if this does not apply to adding by time points etc.

-1
source

Source: https://habr.com/ru/post/1481445/


All Articles