Speed up embedded application code in R

Question

Speed up embedded application code in R

I have a large data frame in R containing 2 columns (sample with a and b below).

set.seed(12);n =5;n_a=5;n_b=5
id_lengths = sample(1:n,n_a,replace=T)
a = rep(1:n_a,id_lengths)
b = sample(1:n_b,length(a),replace =T)
data = data.frame(a = a,b = b)

I want to get a sorted vector of unique values in the column "a" for each "a". This vector should be sorted by maximum overlap based on column "b". I use the code below to achieve results.

get_similar_ids = function(z){
    tmp = sapply(a_list,FUN = function(z1){length(intersect(z1,z))})
    sort(tmp,decreasing=T)
}
a_list = split(data$b,data$a)
lapply(a_list,FUN=get_similar_ids)

Results:

The problem is that the actual data has large n_a (~ 1700000), n_b (~ 250000) and n (~ 15), which leads to data from strings of more than 13 million, and this code does not reproduce at all for such large values. Any ideas on how to speed up these operations?

+4

performance loops r lapply

Tusheet Apr 08 '14 at 9:16

source share

2

James · Answer 1 · 2014-04-08T10:02:45+0000

:

(x <- with(data,(table(a,b)>0) %*% (table(b,a)>0)))
   a
a   1 2 3 4 5
  1 1 1 0 0 0
  2 1 3 1 0 1
  3 0 1 3 1 0
  4 0 0 1 1 0
  5 0 1 0 0 1

, :

lapply(unique(data$a), function(y) sort(x[,y],decreasing=TRUE))
[[1]]
1 2 3 4 5 
1 1 0 0 0 

[[2]]
2 1 3 5 4 
3 1 1 1 0 

[[3]]
3 2 4 1 5 
3 1 1 0 0 

[[4]]
3 4 1 2 5 
1 1 0 0 0 

[[5]]
2 5 1 3 4 
1 1 0 0 0

Gavin Kelly · Answer 2 · 2014-04-08T10:12:39+0000

, , intersect , mulitple b a. , n ( , a, b,...?), , :

d2 <- data[!duplicated(data),]
mer <- merge(d2, d2, by="b")
table(paste0(d2$a.x, d2$a.y))

a- . table(d2$a.x, d2$a.y), .

Speed ​​up embedded application code in R

More articles:

Speed up embedded application code in R