Rejoining huge datasets

Question

Rejoining huge datasets

Multiple matrices with two columns need to be combined as shown below.

matrix1 1,3 1,5 3,6 matrix2 1,4 1,5 3,6 3,7 output 1,3,1 1,4,1 1,5,2 3,6,2 3,7,1

The third column at the exit is an account of how many times the pair has been seen in all matrices. I wrote the code for this

 require(data.table) set.seed(1000) data.lst <- lapply(1:200, function(n) { x <- matrix(sample(1:1000,2000,replace=T), ncol=2); x[!duplicated(x),] }) #method 1 pair1.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0)) for(mat in data.lst) { pair1.dt <- rbind(pair1.dt, data.table(i=mat[,1],j=mat[,2],cnt=1))[, .(cnt=sum(cnt)), .(i,j)] } #method 2 pair2.dt <- data.table(i=integer(0), j=integer(0), cnt=integer(0)) for(mat in data.lst) { pair2.dt <- merge(pair2.dt, data.table(i=mat[,1],j=mat[,2],cnt=1), by=c("i","j"), all=T)[, cnt:=rowSums(.SD,na.rm=T), .SDcols=c("cnt.x","cnt.y")][, c("cnt.x","cnt.y"):=NULL] } cat(sprintf("num.rows => pair1: %d, pair2: %d", pair1.dt[,.N], pair2.dt[,.N]), "\n")

In the real task, each of the matrices has 10 milliseconds of rows and can be overlapped by 30-40%. I am trying to find the fastest way to do this. I tried using Matrix :: sparseMatrix. Although it is much faster, I ran into the error "long vectors are not yet supported." Here I have several different approaches to the database. I am looking for suggestions to speed up this code and suggest alternative approaches.

+5

r data.table

ironv Apr 18 '16 at 19:21

source share

3 answers

First make data.tables of them:

 dt.lst = lapply(data.lst, as.data.table)

Stacking. For comparison, here is a quick way related to styling:

 res0 = rbindlist(dt.lst)[, .(n = .N), by=V1:V2]

The OP said this was not possible because the intermediate result made by rbindlist would be too large.

Enumeration first. With a small range of values, I suggest listing them all ahead:

 res1 = CJ(V1 = 1:1000, V2 = 1:1000)[, n := 0L] for (k in seq_along(dt.lst)) res1[ dt.lst[[k]], n := n + .N, by=.EACHI ] fsetequal(res0, res1[n>0]) # TRUE

The OP pointed out that there are 1e12 possible values, so this does not seem to be a good idea. Instead, we can use

 res2 = dt.lst[[1L]][0L] for (k in seq_along(dt.lst)) res2 = funion(res2, dt.lst[[k]]) res2[, n := 0L] setkey(res2, V1, V2) for (k in seq_along(dt.lst)) res2[ dt.lst[[k]], n := n + .N, by=.EACHI ] fsetequal(res0, res2) # TRUE

This is the slowest of the three variants of the above example, but it seems to me better for me in the light of the problems associated with OP.

Growing inside the loop. Finally...

 res3 = dt.lst[[1L]][0L][, n := NA_integer_][] for (k in seq_along(dt.lst)){ setkey(res3, V1, V2) res3[dt.lst[[k]], n := n + .N, by=.EACHI ] res3 = rbind( res3, fsetdiff(dt.lst[[k]], res3[, !"n", with=FALSE], all=TRUE)[, .(n = .N), by=V1:V2] ) } fsetequal(res0, res3) # TRUE

Growing objects inside the loop are very discouraged and inefficient in R, but this allows you to do this in one loop instead of two.

Other options and notes. I suspect it is best to use a hash. They are available in the hash package and probably also through the Rcpp package.

fsetequal , fsetdiff and funion are the latest additions to the package development version. Find out more on the official website of the data.table project.

By the way, if the entries inside each matrix are different, you can replace .N with 1L everywhere above and discard by=.EACHI and all=TRUE .

+5

Frank Apr 18 '16 at 19:59

source share

Perhaps you can process your data in batches, as your memory allows you to:

 maxRows = 5000 # approximately how many rows can you hold in memory tmp.lst = list() nrows = 0 idx = 1 for (i in seq_along(data.lst)) { tmp.lst[[idx]] = as.data.table(data.lst[[i]])[, cnt := 1] idx = idx + 1 nrows = nrows + nrow(data.lst[[i]]) # if too many rows, collapse (can also replace by some memory condition) if (nrows > maxRows) { tmp.lst = list(rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2]) idx = 2 nrows = nrow(tmp.lst[[1]]) } } #final collapse res = rbindlist(tmp.lst)[, .(cnt = sum(cnt)), by = V1:V2]

+2

eddi Apr 18 '16 at 9:38

source share

Scott Morken · Accepted Answer · 2016-04-18T21:06:32+0000

Using Rcpp. This method will use the hash property std :: unordered_map.

 #include "Rcpp.h" #include <stdint.h> #include <unordered_map> using namespace std; using namespace Rcpp; //[[Rcpp::export]] Rcpp::XPtr<int> CreateMap(){ std::unordered_map<int64_t, int>* myMap = new std::unordered_map<int64_t, int>(); Rcpp::XPtr<int> p((int*)myMap,false); return p; } //[[Rcpp::export]] void FreeMap(Rcpp::XPtr<int> map_ptr){ std::unordered_map<int64_t, int>* myMap = (std::unordered_map<int64_t, int>*)(int*)map_ptr; delete myMap; } //[[Rcpp::export]] void AccumulateValues(Rcpp::XPtr<int> map_ptr, SEXP mat){ NumericMatrix m(mat); std::unordered_map<int64_t, int>* myMap = (std::unordered_map<int64_t, int>*)(int*)map_ptr; for(int i = 0; i<m.nrow(); i++){ int c1 = m(i, 0); int c2 = m(i, 1); int64_t key = ((int64_t)c1 << 32) + c2; (*myMap)[key] ++; } } //[[Rcpp::export]] SEXP AsMatrix(Rcpp::XPtr<int> map_ptr){ std::unordered_map<int64_t, int>* myMap = (std::unordered_map<int64_t, int>*)(int*)map_ptr; NumericMatrix m(myMap->size(),3); int index = 0; for ( auto it = myMap->begin(); it != myMap->end(); ++it ){ int64_t key = it->first; m(index, 0) = (int)(key >> 32); m(index, 1) = (int)key; m(index, 2) = it->second; index++; } return m; }

R code:

 myMap<-CreateMap() AccumulateValues(myMap, matrix1) AccumulateValues(myMap, matrix2) result<-AsMatrix(myMap) FreeMap(myMap)

also required

 PKG_CXXFLAGS = "-std=c++0x"

in makevars package

Rejoining huge datasets

More articles: