Combine the list of data frames with different identifiers

I have variable length lists with dataframes. I want to combine the dfs in each list into one df using the specified column name or index, which depends on df. Here is an example with 3 dfs

my.list <- list( data.frame(a = 1:10, b = letters[1:10], c = 101:110), data.frame(d = 6:15, e = letters[1:10], f = 1:10), data.frame(l = 2:11, m = letters[11:20], o = 1:10)) 

and I want to combine a specific column of each df specified in ids

 ids <- c('a', 'f', 'l') 

to get something like

 id bcdemo 1 a 101 6 a NA NA 2 b 102 7 bk 1 3 c 103 8 cl 2 4 d 104 9 dm 3 5 e 105 10 en 4 6 f 106 11 fo 5 7 g 107 12 gp 6 8 h 108 13 hq 7 9 i 109 14 ir 8 10 j 110 15 js 9 11 NA NA NA NA t 10 

I tried to do this with merge and / or Reduce but could not miss the ids

+5
source share
5 answers

We can change the names unique to all list items by changing the column name corresponding to "identifiers" to "id" and then do Reduce with merge

 lst <- Map(function(x, y) {names(x)[match(y, names(x))] <- 'id'; x}, my.list, ids) Reduce(function(...) merge(..., by = 'id', all = TRUE), lst) # id bcdemo #1 1 a 101 6 a <NA> NA #2 2 b 102 7 bk 1 #3 3 c 103 8 cl 2 #4 4 d 104 9 dm 3 #5 5 e 105 10 en 4 #6 6 f 106 11 fo 5 #7 7 g 107 12 gp 6 #8 8 h 108 13 hq 7 #9 9 i 109 14 ir 8 #10 10 j 110 15 js 9 #11 11 <NA> NA NA <NA> t 10 
+6
source

Here is a data.table answer with a similar approach, like @akrun's answer.

However, instead of renaming the columns, we will set them as keys. Then we can unite by keys, not by name. This saves the column names.

 library(data.table) funky <- function(x) { setDT(my.list[[x]]) setkeyv(my.list[[x]], ids[x]) return(NULL) } 

Thus, the index x will be passed to this function. First, it will set data.frame at xth my.list to data.table . He will then set the key of this new data.table based on the name of the column specified at the same position in ids . Finally, since this is all done in place, return NULL to prevent useless printing.

Now apply this function to all objects in the list.

 a <- lapply(seq_along(ids), funky) Reduce(function(x, y) merge(x, y, by.x = key(x), by.y = key(y), all = TRUE), my.list) 

Unpacking Reduce , we can specify the columns to combine using key(x) and key(y) . This is a step that allows us to avoid changing column names.

 # abcdemo # 1: 1 a 101 6 a NA NA # 2: 2 b 102 7 bk 1 # 3: 3 c 103 8 cl 2 # 4: 4 d 104 9 dm 3 # 5: 5 e 105 10 en 4 # 6: 6 f 106 11 fo 5 # 7: 7 g 107 12 gp 6 # 8: 8 h 108 13 hq 7 # 9: 9 i 109 14 ir 8 # 10: 10 j 110 15 js 9 # 11: 11 NA NA NA NA t 10 
+6
source

The idea may be to convert the columns of interest to the names of the growths, and then combine in the names of the growths, i.e.

 l1 <- Map(function(x, y) {rownames(x) <- x[[y]]; x}, my.list, ids) Reduce(function(x, y)merge(x, y, all = TRUE), lapply(l1, function(x) data.frame(x, id = rownames(x)))) # id abcdeflmo #1 1 1 a 101 6 a 1 NA <NA> NA #2 10 10 j 110 15 j 10 10 s 9 #3 2 2 b 102 7 b 2 2 k 1 #4 3 3 c 103 8 c 3 3 l 2 #5 4 4 d 104 9 d 4 4 m 3 #6 5 5 e 105 10 e 5 5 n 4 #7 6 6 f 106 11 f 6 6 o 5 #8 7 7 g 107 12 g 7 7 p 6 #9 8 8 h 108 13 h 8 8 q 7 #10 9 9 i 109 14 i 9 9 r 8 #11 11 NA <NA> NA NA <NA> NA 11 t 10 
+5
source

@Frank made a comment that made me think of a simple, simple loop:

 # initialise result result <- my.list[[1L]] # add/merge remaining data.frames from list using the given column in ids to merge on for (i in tail(seq_along(my.list), -1L)) { result <- merge(result, my.list[[i]], by.x = ids[1L], by.y = ids[i], all = TRUE) } result 
  abcdemo 1 1 a 101 6 a <NA> NA 2 2 b 102 7 bk 1 3 3 c 103 8 cl 2 4 4 d 104 9 dm 3 5 5 e 105 10 en 4 6 6 f 106 11 fo 5 7 7 g 107 12 gp 6 8 8 h 108 13 hq 7 9 9 i 109 14 ir 8 10 10 j 110 15 js 9 11 11 <NA> NA NA <NA> t 10 

This approach does not require renaming a single column of any of the data. frames in the list before mergers. However, to match the expected OP result, the id column can be renamed afterwards:

 tmp <- colnames(result) colnames(result) <- replace(tmp, tmp == ids[1L], "id") result 
  id bcdemo 1 1 a 101 6 a <NA> NA 2 2 b 102 7 bk 1 3 3 c 103 8 cl 2 4 4 d 104 9 dm 3 5 5 e 105 10 en 4 6 6 f 106 11 fo 5 7 7 g 107 12 gp 6 8 8 h 108 13 hq 7 9 9 i 109 14 ir 8 10 10 j 110 15 js 9 11 11 <NA> NA NA <NA> t 10 

Note that the OP has indicated several times that the ids vector contains the name of the column that should be concatenated for each of data.frames:

I want to combine on a specific column of each df specified in ids , and Essentially, I know the variables (ids), but they differ between dfs

Therefore, I am afraid that answers using match() might be wrong.

+3
source

to combine something, I can advise you to use the sqldf command from the sqldf package, and you can do it like this:

 A = data.frame(a = 1:10, b = letters[1:10], c = 101:110) B = data.frame(d = 6:15, e = letters[1:10], f = 1:10) C = data.frame(l = 2:11, m = letters[11:20], o = 1:10) joined_df <- sqldf('select A.*,B.*,C.* from A left join B on Aa=Bf left join C on Aa=C.l') 
-2
source

Source: https://habr.com/ru/post/1270058/


All Articles