So, I'm not sure if it satisfies the “elegant” requirement, but it uses a general purpose function that you can use to get balanced data.
balanced<-function(data, ID, TIME, VARS, required=c("all","shared")) { if(is.character(ID)) { ID <- match(ID, names(data)) } if(is.character(TIME)) { TIME <- match(TIME, names(data)) } if(missing(VARS)) { VARS <- setdiff(1:ncol(data), c(ID,TIME)) } else if (is.character(VARS)) { VARS <- match(VARS, names(data)) } required <- match.arg(required) idf <- do.call(interaction, c(data[, ID, drop=FALSE], drop=TRUE)) timef <- do.call(interaction, c(data[, TIME, drop=FALSE], drop=TRUE)) complete <- complete.cases(data[, VARS]) tbl <- table(idf[complete], timef[complete]) if (required=="all") { keep <- which(rowSums(tbl==1)==ncol(tbl)) idx <- as.numeric(idf) %in% keep } else if (required=="shared") { keep <- which(colSums(tbl==1)==nrow(tbl)) idx <- as.numeric(timef) %in% keep } data[idx, ] }
You can get the desired result with
balanced(unbal, "PERSON","YEAR")
The first parameter is the data.frame that you want to multiply. The second parameter ( ID= ) is a symbolic vector of column names that identify each "person" in the data set. Then the TIME= parameter TIME= also a symbol vector defining different observation times for each identifier. Finally, you can specify the argument VARS= to indicate which fields should be NA (by default, all values ​​except ID or TIME). Finally, there is one last parameter called required , which indicates whether each identifier should have an observation for each TIME (by default), or if you set it to "shared", it will return only TIMES so that all identifiers have no missing values ​​for .
So for example
balanced(unbal, "PERSON","YEAR", "X")
it is only required that “X” be NA for all PERSON / YEARS, and since this is true for all records, no additional settings are made.
If you do
balanced(unbal, "PERSON","YEAR", required="shared")
then you get data for the years 2001, 2002, 2004, 2005 for ALL people, since all of them have data for these years.
Now let’s use create a slightly different sample dataset
unbal2 <- unbal unbal2[15, 2] <- 2006 tail(unbal2)
Note that Edward is the only person who matters in 2006. It means that
balanced(unbal2, "PERSON","YEAR") # [1] PERSON YEAR YX # <0 rows> (or 0-length row.names)
now returns nothing but
balanced(unbal2, "PERSON","YEAR", required="shared")
will return data for 2001,2002, 2004, since all persons have data for these years.