Speed ​​up embedded application code in R

I have a large data frame in R containing 2 columns (sample with a and b below).

set.seed(12);n =5;n_a=5;n_b=5
id_lengths = sample(1:n,n_a,replace=T)
a = rep(1:n_a,id_lengths)
b = sample(1:n_b,length(a),replace =T)
data = data.frame(a = a,b = b)

I want to get a sorted vector of unique values ​​in the column "a" for each "a". This vector should be sorted by maximum overlap based on column "b". I use the code below to achieve results.

get_similar_ids = function(z){
    tmp = sapply(a_list,FUN = function(z1){length(intersect(z1,z))})
    sort(tmp,decreasing=T)
}
a_list = split(data$b,data$a)
lapply(a_list,FUN=get_similar_ids)

Results:

$`1`
1 2 3 4 5
1 1 0 0 0

$`2`
2 1 3 5 4
3 1 1 1 0

$`3`
3 2 4 1 5
3 1 1 0 0

$`4`
3 4 1 2 5
1 1 0 0 0

$`5`
2 5 1 3 4
1 1 0 0 0

The problem is that the actual data has large n_a (~ 1700000), n_b (~ 250000) and n (~ 15), which leads to data from strings of more than 13 million, and this code does not reproduce at all for such large values. Any ideas on how to speed up these operations?

+4
source share
2

:

(x <- with(data,(table(a,b)>0) %*% (table(b,a)>0)))
   a
a   1 2 3 4 5
  1 1 1 0 0 0
  2 1 3 1 0 1
  3 0 1 3 1 0
  4 0 0 1 1 0
  5 0 1 0 0 1

, :

lapply(unique(data$a), function(y) sort(x[,y],decreasing=TRUE))
[[1]]
1 2 3 4 5 
1 1 0 0 0 

[[2]]
2 1 3 5 4 
3 1 1 1 0 

[[3]]
3 2 4 1 5 
3 1 1 0 0 

[[4]]
3 4 1 2 5 
1 1 0 0 0 

[[5]]
2 5 1 3 4 
1 1 0 0 0 
+4

, , intersect , mulitple b a. , n ( , a, b,...?), , :

d2 <- data[!duplicated(data),]
mer <- merge(d2, d2, by="b")
table(paste0(d2$a.x, d2$a.y))

a- . table(d2$a.x, d2$a.y), .

0

Source: https://habr.com/ru/post/1535512/


All Articles