I am looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of data sets, each of which contains thousands of rows and more than 10,000 columns. The βupdateβ dataset will overlap the corresponding βbaseβ dataset anywhere from a few percent to maybe 50 percent, rollise. Datasets have a key column, and each given dataset will have only one row for each unique key value.
Basic rule: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with this value. ("The same cell" means the same column value as "key" and colname.)
Note that the update dataset most likely contains new lines (βinsertsβ) that I can process with rbind.
Therefore, the basic data frame is βdf1β, where column βKβ is a unique key column and βP1β .. βP3β is 10,000 columns whose names will differ from one pair of data sets to the next
K P1 P2 P3 1 A 1 1 1 2 B 1 1 1 3 C 1 1 1
... and the "df2" update data frame:
K P1 P2 P3 1 B 2 NA 2 2 C NA 2 2 3 D 2 2 2
The result I need is the following: where 1 for "B" and "C" were overwritten 2, but not NA overwritten:
K P1 P2 P3 1 A 1 1 1 2 B 2 1 2 3 C 1 2 2 4 D 2 2 2
This doesn't seem to be a merge candidate, since merge gives me either duplicate rows (relative to the key column) or duplicate columns (like P1.x, P1.y) that I have to sort out to collapse somehow.
I tried to pre-select the matrix with the sizes of the final rows / columns and fill it with the contents of df1, and then iterate over the overlapping rows of df2, but I cannot get more than 20 cells per second., Requiring the hours to be completed (compared to minutes for the equivalent DATA UPDATE functionality in SAS).
I'm sure something is missing, but I can not find a comparable example.
I see using ddply that looks close, but not a general solution. The data.table
package did not seem to help, since it does not seem obvious to me that this is a join problem, at least not many columns.
Also a solution that focuses only on intersecting lines is adequate, as I can identify others and embed them.
Here is the code to create the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n"); cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n"); df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE); df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
thanks