I have a large data frame in R containing 2 columns (sample with a and b below).
set.seed(12);n =5;n_a=5;n_b=5
id_lengths = sample(1:n,n_a,replace=T)
a = rep(1:n_a,id_lengths)
b = sample(1:n_b,length(a),replace =T)
data = data.frame(a = a,b = b)
I want to get a sorted vector of unique values in the column "a" for each "a". This vector should be sorted by maximum overlap based on column "b". I use the code below to achieve results.
get_similar_ids = function(z){
tmp = sapply(a_list,FUN = function(z1){length(intersect(z1,z))})
sort(tmp,decreasing=T)
}
a_list = split(data$b,data$a)
lapply(a_list,FUN=get_similar_ids)
Results:
$`1`
1 2 3 4 5
1 1 0 0 0
$`2`
2 1 3 5 4
3 1 1 1 0
$`3`
3 2 4 1 5
3 1 1 0 0
$`4`
3 4 1 2 5
1 1 0 0 0
$`5`
2 5 1 3 4
1 1 0 0 0
The problem is that the actual data has large n_a (~ 1700000), n_b (~ 250000) and n (~ 15), which leads to data from strings of more than 13 million, and this code does not reproduce at all for such large values. Any ideas on how to speed up these operations?
source
share