Improving the performance of updating the contents of a large data frame using the contents of a similar data frame

Question

Improving the performance of updating the contents of a large data frame using the contents of a similar data frame

I am looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of data sets, each of which contains thousands of rows and more than 10,000 columns. The “update” dataset will overlap the corresponding “base” dataset anywhere from a few percent to maybe 50 percent, rollise. Datasets have a key column, and each given dataset will have only one row for each unique key value.

Basic rule: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with this value. ("The same cell" means the same column value as "key" and colname.)

Note that the update dataset most likely contains new lines (“inserts”) that I can process with rbind.

Therefore, the basic data frame is “df1”, where column “K” is a unique key column and “P1” .. “P3” is 10,000 columns whose names will differ from one pair of data sets to the next

K P1 P2 P3 1 A 1 1 1 2 B 1 1 1 3 C 1 1 1

... and the "df2" update data frame:

  K P1 P2 P3 1 B 2 NA 2 2 C NA 2 2 3 D 2 2 2

The result I need is the following: where 1 for "B" and "C" were overwritten 2, but not NA overwritten:

  K P1 P2 P3 1 A 1 1 1 2 B 2 1 2 3 C 1 2 2 4 D 2 2 2

This doesn't seem to be a merge candidate, since merge gives me either duplicate rows (relative to the key column) or duplicate columns (like P1.x, P1.y) that I have to sort out to collapse somehow.

I tried to pre-select the matrix with the sizes of the final rows / columns and fill it with the contents of df1, and then iterate over the overlapping rows of df2, but I cannot get more than 20 cells per second., Requiring the hours to be completed (compared to minutes for the equivalent DATA UPDATE functionality in SAS).

I'm sure something is missing, but I can not find a comparable example.

I see using ddply that looks close, but not a general solution. The data.table package did not seem to help, since it does not seem obvious to me that this is a join problem, at least not many columns.

Also a solution that focuses only on intersecting lines is adequate, as I can identify others and embed them.

Here is the code to create the data frames above:

 cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n"); cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n"); df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE); df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);

thanks

+6

r dataframe data.table

gkaupas Apr 4 '12 at 16:37

source share

4 answers

This is most likely not the fastest solution, but made entirely in the database.

(updated response to Tommy's comments)

 #READING IN YOUR DATA FRAMES df1 <- read.table(text=" K P1 P2 P3 1 A 1 1 1 2 B 1 1 1 3 C 1 1 1", header=TRUE) df2 <- read.table(text=" K P1 P2 P3 1 B 2 NA 2 2 C NA 2 2 3 D 2 2 2", header=TRUE) all <- c(levels(df1$K), levels(df2$K)) #all cells of key column dups <- all[duplicated(all)] #the overlapping key cells ndups <- all[!all %in% dups] #unique key cells df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ], FUN = decider)) #repalce all NAs of df2 with df1 values if they exist df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4 df5 <- df5[order(df5$K), ] #reorder based on key column rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames df5

This gives:

  K P1 P2 P3 1 A 1 1 1 2 B 2 1 2 3 C 1 2 2 4 D 2 2 2

On closer reading, not all columns have the same name, but I accept the same order. this might be a more useful approach:

 all <- c(levels(df1$K), levels(df2$K)) dups <- all[duplicated(all)] ndups <- all[!all %in% dups] LS <- list(df1, df2) LS2 <- lapply(seq_along(LS), function(i) { colnames(LS[[i]]) <- colnames(LS[[2]]) return(LS[[i]]) } ) LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ]) LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ]) decider <- function(x, y) ifelse(is.na(x), y, x) DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider)) DF$K <- LS4[[1]]$K LS3[[3]] <- DF df5 <- do.call("rbind", LS3) df5 <- df5[order(df5$K), ] rownames(df5) <- 1:nrow(df5) df5

+2

Tyler rinker Apr 4 '12 at 16:50

source share

EDIT: ignore this answer. Bad idea for loop by line. It works, but very slow. Left for posterity! See My second attempt as a separate answer.

 require(data.table) dt1 = as.data.table(df1) dt2 = as.data.table(df2) K = dt2[[1]] for (i in 1:nrow(dt2)) { k = K[i] p = unlist(dt2[i,-1,with=FALSE]) p = p[!is.na(p)] dt1[J(k),names(p):=as.list(p),with=FALSE] }

or, can you use matrix instead of data.frame ? If so, it could be a single row using the A[B] syntax, where B is a two-column matrix containing the row and column numbers to update.

+1

Matt dowle Apr 10 '12 at 12:15

source share

Below is the correct answer for small example data, an attempt is made to minimize the number of "copies" of tables and uses the new list fread and (new?) Rbindlist. Does it work with your large actual dataset? I did not quite follow all the comments in the original post about memory problems that you experienced while trying to smooth / normalize / stack, so apologize if you already tried this route.

 library(data.table) library(reshape2) cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n") cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n") dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table setnames(dt2s,"value","value.new") dt1s[dt2s,value:=value.new] # Update new values dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape setkey(dtout,K) setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

0

dnlbrky Jan 16 '13 at 4:16

source share

Matt dowle · Accepted Answer · 2012-04-19T00:52:23+0000

This is a column loop, setting dt1 by reference and (hopefully) should be quick.

 dt1 = as.data.table(df1) dt2 = as.data.table(df2) if (!identical(names(dt1),names(dt2))) stop("Assumed for now. Can relax later if needed.") w = chmatch(dt2$K, dt1$K) for (i in 2:ncol(dt2)) { nna = !is.na(dt2[[i]]) set(dt1,w[nna],i,dt2[[i]][nna]) } dt1 = rbind(dt1,dt2[is.na(w)]) dt1 K P1 P2 P3 [1,] A 1 1 1 [2,] B 2 1 2 [3,] C 1 2 2 [4,] D 2 2 2

Improving the performance of updating the contents of a large data frame using the contents of a similar data frame

More articles: