R elegant way to balance unbalanced panel data

Is there an elegant way to balance an unbalanced panel dataset? I would like to start with an unbalanced panel (i.e., some people lack some data) and end up with a balanced panel (i.e. all people have no data). The following is sample code. The correct end result is for all observations of Frank and Edward to stay and for all of Tony's observations that need to be deleted, as he has some missing data. Thanks.

unbal <- data.frame(PERSON=c(rep('Frank',5),rep('Tony',5),rep('Edward',5)), YEAR=c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2005,2001,2002,2003,2004,2005), Y=c(21,22,23,24,25,5,6,NA,7,8,31,32,33,34,35), X=c(1:15)) unbal 
+6
source share
4 answers

One way to balance the panel is to remove people with incomplete data, another way is to fill in a value, such as NA or 0 for missing observations. For the first approach, you can use complete.cases to search for strings that do not have NA . Then you can find all PERSON with at least one missing case.

 missing.at.least.one <- unique(unbal$PERSON[!complete.cases(unbal)]) unbal[!(unbal$PERSON %in% missing.at.least.one),] # PERSON YEAR YX # 1 Frank 2001 21 1 # 2 Frank 2002 22 2 # 3 Frank 2003 23 3 # 4 Frank 2004 24 4 # 5 Frank 2005 25 5 # 11 Edward 2001 31 11 # 12 Edward 2002 32 12 # 13 Edward 2003 33 13 # 14 Edward 2004 34 14 # 15 Edward 2005 35 15 
+5
source

So, I'm not sure if it satisfies the “elegant” requirement, but it uses a general purpose function that you can use to get balanced data.

 balanced<-function(data, ID, TIME, VARS, required=c("all","shared")) { if(is.character(ID)) { ID <- match(ID, names(data)) } if(is.character(TIME)) { TIME <- match(TIME, names(data)) } if(missing(VARS)) { VARS <- setdiff(1:ncol(data), c(ID,TIME)) } else if (is.character(VARS)) { VARS <- match(VARS, names(data)) } required <- match.arg(required) idf <- do.call(interaction, c(data[, ID, drop=FALSE], drop=TRUE)) timef <- do.call(interaction, c(data[, TIME, drop=FALSE], drop=TRUE)) complete <- complete.cases(data[, VARS]) tbl <- table(idf[complete], timef[complete]) if (required=="all") { keep <- which(rowSums(tbl==1)==ncol(tbl)) idx <- as.numeric(idf) %in% keep } else if (required=="shared") { keep <- which(colSums(tbl==1)==nrow(tbl)) idx <- as.numeric(timef) %in% keep } data[idx, ] } 

You can get the desired result with

 balanced(unbal, "PERSON","YEAR") # PERSON YEAR YX # 1 Frank 2001 21 1 # 2 Frank 2002 22 2 # 3 Frank 2003 23 3 # 4 Frank 2004 24 4 # 5 Frank 2005 25 5 # 11 Edward 2001 31 11 # 12 Edward 2002 32 12 # 13 Edward 2003 33 13 # 14 Edward 2004 34 14 # 15 Edward 2005 35 15 

The first parameter is the data.frame that you want to multiply. The second parameter ( ID= ) is a symbolic vector of column names that identify each "person" in the data set. Then the TIME= parameter TIME= also a symbol vector defining different observation times for each identifier. Finally, you can specify the argument VARS= to indicate which fields should be NA (by default, all values ​​except ID or TIME). Finally, there is one last parameter called required , which indicates whether each identifier should have an observation for each TIME (by default), or if you set it to "shared", it will return only TIMES so that all identifiers have no missing values ​​for .

So for example

 balanced(unbal, "PERSON","YEAR", "X") # PERSON YEAR YX # 1 Frank 2001 21 1 # 2 Frank 2002 22 2 # 3 Frank 2003 23 3 # 4 Frank 2004 24 4 # 5 Frank 2005 25 5 # 6 Tony 2001 5 6 # 7 Tony 2002 6 7 # 8 Tony 2003 NA 8 # 9 Tony 2004 7 9 # 10 Tony 2005 8 10 # 11 Edward 2001 31 11 # 12 Edward 2002 32 12 # 13 Edward 2003 33 13 # 14 Edward 2004 34 14 # 15 Edward 2005 35 15 

it is only required that “X” be NA for all PERSON / YEARS, and since this is true for all records, no additional settings are made.

If you do

 balanced(unbal, "PERSON","YEAR", required="shared") # PERSON YEAR YX # 1 Frank 2001 21 1 # 2 Frank 2002 22 2 # 4 Frank 2004 24 4 # 5 Frank 2005 25 5 # 6 Tony 2001 5 6 # 7 Tony 2002 6 7 # 9 Tony 2004 7 9 # 10 Tony 2005 8 10 # 11 Edward 2001 31 11 # 12 Edward 2002 32 12 # 14 Edward 2004 34 14 # 15 Edward 2005 35 15 

then you get data for the years 2001, 2002, 2004, 2005 for ALL people, since all of them have data for these years.

Now let’s use create a slightly different sample dataset

 unbal2 <- unbal unbal2[15, 2] <- 2006 tail(unbal2) # PERSON YEAR YX # 10 Tony 2005 8 10 # 11 Edward 2001 31 11 # 12 Edward 2002 32 12 # 13 Edward 2003 33 13 # 14 Edward 2004 34 14 # 15 Edward 2006 35 15 

Note that Edward is the only person who matters in 2006. It means that

 balanced(unbal2, "PERSON","YEAR") # [1] PERSON YEAR YX # <0 rows> (or 0-length row.names) 

now returns nothing but

 balanced(unbal2, "PERSON","YEAR", required="shared") # PERSON YEAR YX # 1 Frank 2001 21 1 # 2 Frank 2002 22 2 # 4 Frank 2004 24 4 # 6 Tony 2001 5 6 # 7 Tony 2002 6 7 # 9 Tony 2004 7 9 # 11 Edward 2001 31 11 # 12 Edward 2002 32 12 # 14 Edward 2004 34 14 

will return data for 2001,2002, 2004, since all persons have data for these years.

+3
source

The solution I used was to temporarily change the data frame in a wide format over the years in the form of columns and units in the form of rows, and then check the complete list of rows. This is easiest to do if you have one variable of interest, which, if absent, means that the entire observation is missing.

I use the following libraries:

 library(data.table) library(reshape2) 

First, take a subset of your main data frame (unbal), which is just an ID variable ("NAME"), a time variable ("YEAR"), and an interest variable ("X" or "Y" ").

 df<- unbal[c("NAME", "YEAR", "X" )] 

Secondly, change the new data frame to make it widescreen. This creates a data frame in which each “NAME” represents one row and “X” represents a column for each year.

 df <- dcast(df, NAME ~ YEAR, value.var = "X") 

Third, run complete.cases for each row. Any NAME with missing data will be completely deleted.

 df <- df[complete.cases(df),] 

Fourth, reformat the data frame to a long format (by default this gives your variables common names, so you might want to change the names to what happened before).

 df <- melt(df, id.vars = "ID") setnames(df, "variable", "YEAR") 

NOTE: YEAR becomes the default factor variable using the approach. If the YEAR variable is numeric, you need to change the variable accordingly. For instance:

 test4$year <- as.character(test4$year) test4$year <- as.numeric(test4$year) 

Fifth and sixth, take only the variables "NAME" and "YEAR" in the data frame you created, and then combine it with the original data frame (and be sure to delete cases in the original data frame that are not found in the data frame you created)

 df <- df[c("NAME", "YEAR")] balanced <- merge.data.frame(df, unbal, by = c("NAME", "YEAR"), all.x = TRUE) 
+2
source

This is the solution I am using - it uses convenient functions (including good merging capabilities) of the data.table package and assumes that your data is already data.table objects. It is relatively simple and hopefully easy to follow. It returns a balanced panel with entries for each unique combination of “individuals” and “time periods”, i.e. Panels where there is observation of each person for each period of time.

 library(data.table) Balance_Panel = function(Data, Indiv_ColName, Time_ColName){ Individuals = unique(Data[, get(Indiv_ColName)]) Times = unique(Data[, get(Time_ColName)]) Full_Panel = data.table(expand.grid(Individuals, Times)) setnames(Full_Panel, c(Indiv_ColName, Time_ColName)) setkeyv(Full_Panel, c(Indiv_ColName, Time_ColName)) setkeyv(Data, c(Indiv_ColName, Time_ColName)) return(Data[Full_Panel]) } 

Usage example:

 Balanced_Data = Balance_Panel(Data, "SubjectID", "ObservationTime") 
+1
source

Source: https://habr.com/ru/post/974841/


All Articles