Merging a large data frame runs out of memory several times

I try to filter a large data frame twice (DF1 and DF2) , and then combine the two filtered data frames into one data frame (DF1+DF2->DF3) multiple times and combine the results into a single data frame (DF=DF3[1]+DF3[2]...DF[n]) , but continue to work (8 GB). The start and end frames of data are easily placed on a laptop, so its processing, which runs out of memory.

Which method is the fastest and requires the least memory? Should I run the code in parts and recombine, get a larger weapon, or is it a task for a relational database or MapReduce ?

The code below illustrates the problem.

 #create combination df Combn <- data.frame(t(combn(as.vector(rep(LETTERS[1:26])),2))) %>% mutate_all(as.character) #create data df Nrows <- 1000000 Data <- data.frame(Symbol=rep(LETTERS[1:26])) %>% mutate(Symbol=as.character(Symbol)) %>% bind_rows(replicate(Nrows-1,.,simplify=FALSE)) %>% arrange(Symbol) %>% group_by(Symbol) %>% mutate(Idx=seq(1:Nrows)) %>% mutate(Px=round(runif(Nrows)*20)) FnPDList <- function(Combn,Data){ Dfs <- list() for(i in 1:nrow(Combn)){ print(i) Symbol.1 <- Combn$X1[i] Symbol.2 <- Combn$X2[i] Sym.2 <- Data %>% filter(Symbol==Symbol.2) Df <- Data %>% filter(Symbol==Symbol.1) %>% left_join(Sym.2,by="Idx",suffix=c(".1",".2")) Dfs[[i]] <- Df } return(Dfs) } #splitting into n parts works X <- FnPDList(slice(Combn,1:10),Data) Z <- do.call(bind_rows,X) #trying to solve in one go exhausts memory X <- FnPDList(Combn,Data) Z <- do.call(bind_rows,X) 
+5
source share

Source: https://habr.com/ru/post/1273374/


All Articles