Combine panel data to obtain balanced panel data

I have several data frames in the form of a data pane. Now I want to combine these panel data frames into one data panel. These data frames have a common and different between them. I illustrate the following:

df1:

Month variable Beta1 Beta2 Beta3 Beta4 Beta5 Beta6 Jan-05 A 1 2 3 4 5 6 Feb-05 A 2 3 4 5 6 7 Mar-05 A 3 4 5 6 7 8 Apr-05 A 4 5 6 7 8 9 May-05 A 5 6 7 8 9 10 Jun-05 A 6 7 8 9 10 11 Jul-05 A 7 8 9 10 11 12 Aug-05 A 8 9 10 11 12 13 Sep-05 A 9 10 11 12 13 14 Oct-05 A 10 11 12 13 14 15 Nov-05 A 11 12 13 14 15 16 Dec-05 A 12 13 14 15 16 17 Jan-05 B 12 12 12 12 12 12 Feb-05 B 12 12 12 12 12 12 Mar-05 B 12 12 12 12 12 12 Apr-05 B 12 12 12 12 12 12 May-05 B 12 12 12 12 12 12 Jun-05 B 12 12 12 12 12 12 Jul-05 B 12 12 12 12 12 12 Aug-05 B 12 12 12 12 12 12 Sep-05 B 12 12 12 12 12 12 Oct-05 B 12 12 12 12 12 12 Nov-05 B 12 12 12 12 12 12 Dec-05 B 12 12 12 12 12 12 

df2:

 Month variable Beta1 Beta2 Beta3 Beta4 Beta5 Beta6 Jan-06 A 1 2 3 4 5 6 Feb-06 A 2 3 4 5 6 7 Mar-06 A 3 4 5 6 7 8 Apr-06 A 4 5 6 7 8 9 May-06 A 5 6 7 8 9 10 Jun-06 A 6 7 8 9 10 11 Jul-06 A 7 8 9 10 11 12 Aug-06 A 8 9 10 11 12 13 Sep-06 A 9 10 11 12 13 14 Oct-06 A 10 11 12 13 14 15 Nov-06 A 11 12 13 14 15 16 Dec-06 A 12 13 14 15 16 17 Jan-06 C 12 12 12 12 12 12 Feb-06 C 12 12 12 12 12 12 Mar-06 C 12 12 12 12 12 12 Apr-06 C 12 12 12 12 12 12 May-06 C 12 12 12 12 12 12 Jun-06 C 12 12 12 12 12 12 Jul-06 C 12 12 12 12 12 12 Aug-06 C 12 12 12 12 12 12 Sep-06 C 12 12 12 12 12 12 Oct-05 C 12 12 12 12 12 12 Nov-05 C 12 12 12 12 12 12 Dec-05 C 12 12 12 12 12 12 

The desired result is as follows: I want to combine the data frames of the panel so that each variable is located chronically, and if the data cannot be valid, it means that it has NA under Beta1, Beta2, etc.

  Month variable Beta1 Beta2 Beta3 Beta4 Beta5 Beta6 Jan-05 A 1 2 3 4 5 6 Feb-05 A 2 3 4 5 6 7 Mar-05 A 3 4 5 6 7 8 Apr-05 A 4 5 6 7 8 9 May-05 A 5 6 7 8 9 10 Jun-05 A 6 7 8 9 10 11 Jul-05 A 7 8 9 10 11 12 Aug-05 A 8 9 10 11 12 13 Sep-05 A 9 10 11 12 13 14 Oct-05 A 10 11 12 13 14 15 Nov-05 A 11 12 13 14 15 16 Dec-05 A 12 13 14 15 16 17 Jan-06 A 1 2 3 4 5 6 Feb-06 A 2 3 4 5 6 7 Mar-06 A 3 4 5 6 7 8 Apr-06 A 4 5 6 7 8 9 May-06 A 5 6 7 8 9 10 Jun-06 A 6 7 8 9 10 11 Jul-06 A 7 8 9 10 11 12 Aug-06 A 8 9 10 11 12 13 Sep-06 A 9 10 11 12 13 14 Oct-06 A 10 11 12 13 14 15 Nov-06 A 11 12 13 14 15 16 Dec-06 A 12 13 14 15 16 17 Jan-05 B 12 12 12 12 12 12 Feb-05 B 12 12 12 12 12 12 Mar-05 B 12 12 12 12 12 12 Apr-05 B 12 12 12 12 12 12 May-05 B 12 12 12 12 12 12 Jun-05 B 12 12 12 12 12 12 Jul-05 B 12 12 12 12 12 12 Aug-05 B 12 12 12 12 12 12 Sep-05 B 12 12 12 12 12 12 Oct-05 B 12 12 12 12 12 12 Nov-05 B 12 12 12 12 12 12 Dec-05 B 12 12 12 12 12 12 Jan-06 B NA NA NA NA NA NA Feb-06 B NA NA NA NA NA NA Mar-06 B NA NA NA NA NA NA Apr-06 B NA NA NA NA NA NA May-06 B NA NA NA NA NA NA Jun-06 B NA NA NA NA NA NA Jul-06 B NA NA NA NA NA NA Aug-06 B NA NA NA NA NA NA Sep-06 B NA NA NA NA NA NA Oct-06 B NA NA NA NA NA NA Nov-06 B NA NA NA NA NA NA Dec-06 B NA NA NA NA NA NA Jan-05 C NA NA NA NA NA NA Feb-05 C NA NA NA NA NA NA Mar-05 C NA NA NA NA NA NA Apr-05 C NA NA NA NA NA NA May-05 C NA NA NA NA NA NA Jun-05 C NA NA NA NA NA NA Jul-05 C NA NA NA NA NA NA Aug-05 C NA NA NA NA NA NA Sep-05 C NA NA NA NA NA NA Oct-05 C NA NA NA NA NA NA Nov-05 C NA NA NA NA NA NA Dec-05 C NA NA NA NA NA NA Jan-06 C 12 12 12 12 12 12 Feb-06 C 12 12 12 12 12 12 Mar-06 C 12 12 12 12 12 12 Apr-06 C 12 12 12 12 12 12 May-06 C 12 12 12 12 12 12 Jun-06 C 12 12 12 12 12 12 Jul-06 C 12 12 12 12 12 12 Aug-06 C 12 12 12 12 12 12 Sep-06 C 12 12 12 12 12 12 Oct-06 C 12 12 12 12 12 12 Nov-06 C 12 12 12 12 12 12 Dec-06 C 12 12 12 12 12 12 

As I mentioned earlier, I would have a few frames of data and merging them would probably lead to hundreds of thousands of rows, so I could solve problems with memory and space. I would really appreciate your help.

+5
source share
2 answers

There is a function. Combine data frames with rbind . Then use complete . It will look at groups in variable and fill in any missing values:

 library(tidyr) df3 <- do.call(rbind.data.frame, list(df1, df2)) df3$Month <- as.character(df3$Month) df4 <- complete(df3, Month, variable) df4$Month <- as.yearmon(df4$Month, "%b %Y") df5 <- df4[order(df4$variable,df4$Month),] df5 # Source: local data frame [72 x 8] # # Month variable Beta1 Beta2 Beta3 Beta4 Beta5 Beta6 # (yrmn) (fctr) (int) (int) (int) (int) (int) (int) # 1 Jan 2005 A 1 2 3 4 5 6 # 2 Feb 2005 A 2 3 4 5 6 7 # 3 Mar 2005 A 3 4 5 6 7 8 # 4 Apr 2005 A 4 5 6 7 8 9 # 5 May 2005 A 5 6 7 8 9 10 # 6 Jun 2005 A 6 7 8 9 10 11 # 7 Jul 2005 A 7 8 9 10 11 12 # 8 Aug 2005 A 8 9 10 11 12 13 # 9 Sep 2005 A 9 10 11 12 13 14 # 10 Oct 2005 A 10 11 12 13 14 15 # .. ... ... ... ... ... ... ... ... 

Alternative implementation with dplyr and tidyr:

 library(dplyr) library(tidyr) df3 <- bind_rows(df1, df2) %>% complete(Month, variable) 
+5
source

Two alternative possibilities in which the data.table attributes are especially important are of interest when speed and memory are a problem:

base R:

Bind data files together to one:

 df3 <- rbind(df1,df2) 

Create a reference data block with all possible combinations of Month and variable using expand.grid :

 ref <- expand.grid(Month = unique(df3$Month), variable = unique(df3$variable)) 

Combine them with all.x=TRUE to ensure that the missing combinations are filled with NA values:

 merge(ref, df3, by = c("Month", "variable"), all.x = TRUE) 

Or (thanx to @PierreLafortune):

 merge(ref, df3, by=1:2, all.x = TRUE) 

data.table:

Bind dataframes to one with "rbindlist", which returns "data.table":

 library(data.table) DT <- rbindlist(list(df1,df2)) 

Join the link to make sure all combinations are present and the missing ones are filled with NA:

 DT[CJ(Month, variable, unique = TRUE), on = c(Month="V1", variable="V2")] 

All together in one call:

 DT <- rbindlist(list(df1,df2))[CJ(Month, variable, unique = TRUE), on = c(Month="V1", variable="V2")] 

An alternative is wrapping rbindlist in setkey and then expanding with CJ (cross join):

 DT <- setkey(rbindlist(list(df1,df2)), Month, variable)[CJ(Month, variable, unique = TRUE)] 
+4
source

Source: https://habr.com/ru/post/1243784/


All Articles