I wrote a small function to split my dataset into training and test sets. However, I ran into difficulties when dealing with factor variables. At the stage of checking the model of my code, I get an error message if the model was built on a data set that has no representation from each level of the factor. How can I fix this partition () function to include at least one observation from each level of the factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample(letters, 100, rep = T)), c = factor(sample(c("apple", "orange"), 100, rep = T))) set.seed(123) partition <- function(data, train.size = .7){ train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ] test <- data[-as.numeric(row.names(train)), ] partitioned.data <- list(train = train, test = test) return(partitioned.data) } part.data <- partition(test.df) table(part.data$train[,'b']) table(part.data$test[,'b'])
EDIT is a new function using the caret package and createDataPartition ():
partition <- function(data, factor=NULL, train.size = .7){ if (("package:caret" %in% search()) == FALSE){ stop("Install and Load 'caret' package") } if (is.null(factor)){ train.index <- createDataPartition(as.numeric(row.names(data)), times = 1, p = train.size, list = FALSE) train <- data[train.index, ] test <- data[-train.index, ] } else{ train.index <- createDataPartition(factor, times = 1, p = train.size, list = FALSE) train <- data[train.index, ] test <- data[-train.index, ] } partitioned.data <- list(train = train, test = test) return(partitioned.data) }