How can I ensure that a section has representative observations from each factor level?

Question

How can I ensure that a section has representative observations from each factor level?

I wrote a small function to split my dataset into training and test sets. However, I ran into difficulties when dealing with factor variables. At the stage of checking the model of my code, I get an error message if the model was built on a data set that has no representation from each level of the factor. How can I fix this partition () function to include at least one observation from each level of the factor variable?

test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample(letters, 100, rep = T)), c = factor(sample(c("apple", "orange"), 100, rep = T))) set.seed(123) partition <- function(data, train.size = .7){ train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ] test <- data[-as.numeric(row.names(train)), ] partitioned.data <- list(train = train, test = test) return(partitioned.data) } part.data <- partition(test.df) table(part.data$train[,'b']) table(part.data$test[,'b'])

EDIT is a new function using the caret package and createDataPartition ():

 partition <- function(data, factor=NULL, train.size = .7){ if (("package:caret" %in% search()) == FALSE){ stop("Install and Load 'caret' package") } if (is.null(factor)){ train.index <- createDataPartition(as.numeric(row.names(data)), times = 1, p = train.size, list = FALSE) train <- data[train.index, ] test <- data[-train.index, ] } else{ train.index <- createDataPartition(factor, times = 1, p = train.size, list = FALSE) train <- data[train.index, ] test <- data[-train.index, ] } partitioned.data <- list(train = train, test = test) return(partitioned.data) }

+4

r statistics partitioning categorical-data factors

zap2008 May 11 '13 at 5:01

source share

1 answer

Tommy levi · Accepted Answer · 2013-05-11T06:53:16+0000

Try the caret package, especially the createDataPartition() function. It should do exactly what you need, available on CRAN, here is the page:

carriage - data splitting

The function I mentioned is partly some code that I found some time ago on the net, and then slightly modified it to better handle extreme cases (for example, when you ask for a sample size larger than a set or a subset).

 stratified <- function(df, group, size) { # USE: * Specify your data frame and grouping variable (as column # number) as the first two arguments. # * Decide on your sample size. For a sample proportional to the # population, enter "size" as a decimal. For an equal number # of samples from each group, enter "size" as a whole number. # # Example 1: Sample 10% of each group from a data frame named "z", # where the grouping variable is the fourth variable, use: # # > stratified(z, 4, .1) # # Example 2: Sample 5 observations from each group from a data frame # named "z"; grouping variable is the third variable: # # > stratified(z, 3, 5) # require(sampling) temp = df[order(df[group]),] colsToReturn <- ncol(df) #Don't want to attempt to sample more than possible dfCounts <- table(df[group]) if (size > min(dfCounts)) { size <- min(dfCounts) } if (size < 1) { size = ceiling(table(temp[group]) * size) } else if (size >= 1) { size = rep(size, times=length(table(temp[group]))) } strat = strata(temp, stratanames = names(temp[group]), size = size, method = "srswor") (dsample = getdata(temp, strat)) dsample <- dsample[order(dsample[1]),] dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL) return(dsample) }

How can I ensure that a section has representative observations from each factor level?

More articles: