Combining data in R on a pre-sorted column?

I usually work with large data frames that are pretty well sorted (or can be easily sorted).

Given two data frames, both are sorted by 'user'

some.data <user> <data_1> <data_2> user <user> <user_attr_1> <user_attr_2> 

And I run m = merge(some.data,user) , I get the result as:

 m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2> 

And that is wonderful.

But merge does not take advantage of this data, which is sorted by a common column, which leads to heavy merging of the CPU / memory. However, this merge can be done in O (n)

I am wondering if there is a way in R for efficiently merging on sorted datasets?

+6
source share
2 answers

I have no experience with this, but as far as I know, this is one of the problems that was developed to improve the data.table package.

For most practical purposes, data.table = data.frame + index . As a result, when used properly, this improves the performance of several large operations.

There is a danger that turning your data.frame into data.table (i.e. adding an index) may take some time (although I expect it to be well optimized), but as soon as you get it, functions like merge may Easy to use index for better performance.

+5
source

If your set of common keys / indices completely overlaps, then ...

Reduce(`&`, user$user.id %in% some.data$user.id)

... returns TRUE, and they, as you said, are sorted, and there are no duplicate keys , then your join problem comes down to adding columns to data.frame. Something in the lines along ...

 library(log4r) t1 <- system.time(z <- merge(user, some.data, by='user.id')) info(my.logger, paste('Elapsed time with merge():', t1['elapsed'])) t2 <- Sys.time() r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2) r[,names(some.data)] <- some.data[,names(some.data) t3 <- Sys.time() info(my.logger, paste('Elapsed time without:', t3-t2)) 

If the above assumptions are not fulfilled, then he gets a slightly more erratic union of joins of both key sets and the translation function, scroll NA), but the only assumption of merging and overlapping gives you a long way to go.

Note also that the seconds approach time is biased, since it calls Sys.time () twice, unlike the merge () time, which calls system.time () and only once. (Sorry my lame use of SO markup)

0
source

Source: https://habr.com/ru/post/900282/


All Articles