Merge / merge columns with the same name but incomplete data

I have two data frames that have multiple columns with the same names and others with different names. Data frames look something like this:

df1 ID hello world hockey soccer 1 1 NA NA 7 4 2 2 NA NA 2 5 3 3 10 8 8 23 4 4 4 17 5 12 5 5 NA NA 3 43 df2 ID hello world football baseball 1 1 2 3 43 6 2 2 5 1 24 32 3 3 NA NA 2 23 4 4 NA NA 5 15 5 5 9 7 12 23 

As you can see, in two common columns (“hello” and “world”), some data is in one of the data frames, and the rest is in the other.

What I'm trying to do is (1) combine 2 frames of data using "id", (2) combine all the data from the hi and world columns in both frames into 1 hi column and 1 world , and (3) have the final data frame, also contain all the other columns in the two source frames (hockey, football, soccer, baseball). So, I want the end result to be like this:

  ID hello world hockey soccer football baseball 1 1 2 3 7 4 43 6 2 2 5 3 2 5 24 32 3 3 10 8 8 23 2 23 4 4 4 17 5 12 5 15 5 5 9 7 3 43 12 23 

I am new to R, so only the codes that I tried are variants of merge and I tried the answer I found here that was based on a similar question: R: merging copies of the same variable . However, my data sets are actually much larger than what I show here (about 20 relevant columns (for example, “hello” and “world”) and 100 mismatched (for example, “hockey” and “football”) so I'm looking something that does not require me to write all this manually.

Any idea if this can be done? Sorry, I can’t provide an example of my efforts, but I really don’t know where to start:

 mydata <- merge(df1, df2, by=c("ID"), all = TRUE) 

To play back data frames:

 df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9), world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12), baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world", "football", "baseball"), class = "data.frame", row.names = c(NA, -5L)) df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA), world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3), soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey", "soccer"), class = "data.frame", row.names = c(NA, -5L)) 
+18
source share
6 answers

Here's an approach that involves melt your data, merging molten data and using dcast to get it back to wide form. I added comments to understand what was going on.

 ## Required packages library(data.table) library(reshape2) dcast.data.table( merge( ## melt the first data.frame and set the key as ID and variable setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable), ## melt the second data.frame melt(as.data.table(df2), id.vars = "ID"), ## you'll have 2 value columns... all = TRUE)[, value := ifelse( ## ... combine them into 1 with ifelse is.na(value.x), value.y, value.x)], ## This is your reshaping formula ID ~ variable, value.var = "value") # ID hello world football baseball hockey soccer # 1: 1 2 3 43 6 7 4 # 2: 2 5 1 24 32 2 5 # 3: 3 10 8 2 23 8 23 # 4: 4 4 17 5 15 5 12 # 5: 5 9 7 12 23 3 43 
+12
source

Nobody posted the dplyr solution, so here you can find a short version in dplyr . The approach is simply to make full_join which concatenates all the rows, then group and summarise to remove the redundant missing cells.

 library(tidyverse) df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df1 %>% full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>% group_by(ID) %>% summarize_all(na.omit) #> # A tibble: 5 x 7 #> ID hello world hockey soccer football baseball #> <int> <int> <int> <int> <int> <int> <int> #> 1 1 2 3 7 4 43 6 #> 2 2 5 1 2 5 24 32 #> 3 3 10 8 8 23 2 23 #> 4 4 4 17 5 12 5 15 #> 5 5 9 7 3 43 12 2 

Created on 2018-07-13 by reprex package (v0.2.0).

+8
source

Here's another data.table approach using binary merging

 library(data.table) setkey(setDT(df1), ID) ; setkey(setDT(df2), ID) # Converting to data.table objects and setting keys df1 <- df1[df2][, `:=`(i.hello = NULL, i.world = NULL)] # Full left join df1[df2[complete.cases(df2)], `:=`(hello = i.hello, world = i.world)][] # Joining only on non-missing values # ID hello world football baseball hockey soccer # 1: 1 2 3 43 6 7 4 # 2: 2 5 1 24 32 2 5 # 3: 3 10 8 2 23 8 23 # 4: 4 4 17 5 15 5 12 # 5: 5 9 7 12 23 3 43 
+6
source

@ ananda-mahto's answer is more elegant, but here is my suggestion:

 library(reshape2) df1=melt(df1,id='ID',na.rm=TRUE) df2=melt(df2,id='ID',na.rm=TRUE) DF=rbind(df1,df2) # Not needeed, added na.rm=TRUE based on @ananda-mahto valid comment # DF<-DF[!is.na(DF$value),] dcast(DF,ID~variable,value.var='value') 
+5
source

Here is a more tidyr -oriented approach that does something similar to the currently accepted answer. The approach is to simply stack the data frames on top of each other using bind_rows (which matches the column names), gather all columns without ID with na.rm = TRUE and then spread them back. This should be resistant to situations where the condition "if the NA value in" df1 "has the value in" df2 "(and vice versa)" is not always satisfied, compared to the summarise option.

 library(tidyverse) df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df1 %>% bind_rows(df2) %>% gather(variable, value, -ID, na.rm = TRUE) %>% spread(variable, value) #> # A tibble: 5 x 7 #> ID baseball football hello hockey soccer world #> <int> <int> <int> <int> <int> <int> <int> #> 1 1 6 43 2 7 4 3 #> 2 2 32 24 5 2 5 1 #> 3 3 23 2 10 8 23 8 #> 4 4 15 5 4 5 12 17 #> 5 5 2 12 9 3 43 7 

Created on 2018-07-13 by reprex package (v0.2.0).

+4
source

Using tidyverse we could use coalesce .

None of the solutions below creates extra rows; the data remains approximately the same in size and similar form in the entire chain.

Solution 1

 list(df1,df2) %>% transpose(union(names(df1),names(df2))) %>% map_dfc(. %>% compact %>% invoke(coalesce,.)) # # A tibble: 5 x 7 # ID hello world football baseball hockey soccer # <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 1 2 3 43 6 7 4 # 2 2 5 1 24 32 2 5 # 3 3 10 8 2 23 8 23 # 4 4 4 17 5 15 5 12 # 5 5 9 7 12 23 3 43 

Explanations

  • Wrap both data frames in list
  • transpose it, so every new element in the root has an output column name. The default behavior for transpose by default is to accept the first argument as a template, so unfortunately we have to be explicit in order to get all of them.
  • compact these elements, since they were all 2 in length, but with one of them it was NULL when this column was missing on one side.
  • coalesce those that basically mean returning the first non NA you find, when setting arguments side by side.

if repeating df1 and df2 in the second line is a problem, use the following instead:

 transpose(invoke(union, setNames(map(., names), c("x","y")))) 

Decision 2

The same philosophy, but this time we focus on the names:

 map_dfc(set_names(union(names(df1), names(df2))), ~ invoke(coalesce, compact(list(df1[[.x]], df2[[.x]])))) # # A tibble: 5 x 7 # ID hello world football baseball hockey soccer # <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 1 2 3 43 6 7 4 # 2 2 5 1 24 32 2 5 # 3 3 10 8 2 23 8 23 # 4 4 4 17 5 15 5 12 # 5 5 9 7 12 23 3 43 

Here it is once used for those who prefer:

 union(names(df1), names(df2)) %>% set_names %>% map_dfc(~ list(df1[[.x]], df2[[.x]]) %>% compact %>% invoke(coalesce, .)) 

Explanations

  • set_names gives names of vector characters identical to its values, so map_dfc can map_dfc output columns.
  • df1[[.x]] will return NULL when .x not a df1 column, we will use this.
  • df1 and df2 are mentioned 2 times each, and I can’t think about it.

Solution 1 is cleaner regarding these points, so I recommend it.

+4
source

Source: https://habr.com/ru/post/1207794/


All Articles