Why is it so slow? (loop in line DF compared to autonomous vector)

I have a piece of code, and the total time elapsed after 30 seconds is the following code about 27 seconds. I narrowed the offensive code:

d$dis300[i] <- h 

So, I move on to this other part and now it works very fast (as expected).

My question is why it is too slow against the second. Datos DF is around 7500x18 vars

First: (27 seconds have passed)

 d$dis300 <- 0 for (i in 1:netot) { h <- aaa[d$ent[i], d$dis[i]] if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i])) d$dis300[i] <- h } 

Second: (0.2 s passed)

 d$dis300 <- 0 for (i in 1:netot) { h <- aaa[d$ent[i], d$dis[i]] if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i])) foo[i] <- h } d$foo <- foo 

You can see that both are โ€œthe sameโ€, but the offending one has this DF instead of a single vector.

Any comments are really appreciated. I came from a different type of language, and it was a little younger. At least I have a solution, but I like to prevent similar problems in the future.

Thank you for your time,

+6
source share
2 answers

The reason is that d$dis300[i] <- h calls $<-.data.frame .

This is a pretty complicated function, as you can see:

 `$<-.data.frame` 

You do not say what foo , but if it is an atomic vector, the function $<- implemented in C for speed.

However, I hope you declare foo as follows:

 foo <- numeric(netot) 

This ensures that you do not need to redistribute the vector for each assignment in the loop:

 foo <- 0 # BAD! system.time( for(i in 1:5e4) foo[i] <- 0 ) # 4.40 secs foo <- numeric(5e4) # Pre-allocate system.time( for(i in 1:5e4) foo[i] <- 0 ) # 0.09 secs 

Using the *apply family, you are not worried about this:

 d$foo <- vapply(1:netot, function(i, aaa, ent, dis) { h <- aaa[ent[i], dis[i]] if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", ent[i], dis[i])) h }, numeric(1), aaa=aaa, ent=d$ent, dis=d$dis) 

... here I also extracted d$ent and d$dis outside the loop, which should also improve the situation. It is not possible to start it yourself, because you did not give reproducible data. But here is a similar example:

 d <- data.frame(x=1) system.time( vapply(1:1e6, function(i) d$x, numeric(1)) ) # 3.20 secs system.time( vapply(1:1e6, function(i, x) x, numeric(1), x=d$x) ) # 0.56 secs 

... but finally it seems that all of this can be reduced to (with the exception of your error detection code):

 d$foo <- aaa[cbind(d$ent, d$dis)] 
+10
source

Tommy is the best answer. It was too big for comments, so adding it as an answer ...

Here is how you can see the copies (total DF , as joran commented):

 > DF = data.frame(a=1:3,b=4:6) > tracemem(DF) [1] "<0x0000000003104800" > for (i in 1:3) {DF$b[i] <- i; .Internal(inspect(DF))} tracemem[0000000003104800 -> 000000000396EAD8]: tracemem[000000000396EAD8 -> 000000000396E4F0]: $<-.data.frame $<- tracemem[000000000396E4F0 -> 000000000399CDC8]: $<-.data.frame $<- @000000000399CDC8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0) @000000000399CD90 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3 @000000000399CCE8 13 INTSXP g0c2 [] (len=3, tl=0) 1,5,6 ATTRIB: # .. snip .. tracemem[000000000399CDC8 -> 000000000399CC40]: tracemem[000000000399CC40 -> 000000000399CAB8]: $<-.data.frame $<- tracemem[000000000399CAB8 -> 000000000399C9A0]: $<-.data.frame $<- @000000000399C9A0 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0) @000000000399C968 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3 @000000000399C888 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,6 ATTRIB: # .. snip .. tracemem[000000000399C9A0 -> 000000000399C7E0]: tracemem[000000000399C7E0 -> 000000000399C700]: $<-.data.frame $<- tracemem[000000000399C700 -> 00000000039C78D8]: $<-.data.frame $<- @00000000039C78D8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0) @00000000039C78A0 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3 @0000000003E07890 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3 ATTRIB: # .. snip .. > DF ab 1 1 1 2 2 2 3 3 3 

Each of these tracemem[] lines corresponds to a copy of the entire object. You can also see that the hexadecimal addresses of the column vector a change every time, even though it is not modified by the purpose of b .

AFAIK, without going directly to C code, the only way (currently) in R is to change the data.frame element without copying any memory - this is the operator := and set() both in the data.table package. There are 17 questions about link assignment with := here in Stack Overflow.

But in this case, Tommy one liner is definitely better, since you do not even need a loop.

+2
source

Source: https://habr.com/ru/post/914081/


All Articles