Stratified 10x Cross Validation

Question

Stratified 10x Cross Validation

I started to create some training and test sets using 10x cross-search for an artificial dataset:

rows <- 1000 X1<- sort(runif(n = rows, min = -1, max =1)) occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1))) true.presence <- rbinom(n = rows, size = 1, prob = occ.prob) # combine data as data frame and save data <- data.frame(X1, true.presence) id <- sample(1:10,nrow(data),replace=TRUE) ListX <- split(data,id) fold1 <- data[id==1,] fold2 <- data[id==2,] fold3 <- data[id==3,] fold4 <- data[id==4,] fold5 <- data[id==5,] fold6 <- data[id==6,] fold7 <- data[id==7,] fold8 <- data[id==8,] fold9 <- data[id==9,] fold10 <- data[id==10,] trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10)) testset <- subset(data, id %in% c(1))

I'm just wondering if there are simpler ways to achieve this and how can I do a stratified cross-validation that ensures that the priors (true.presence) class is about the same in all folds?

+6

r

cs0815 May 01, '12 at 15:15

source share

3 answers

createFolds method of the caret package performs partitioned partitioning. Here is the paragraph on the help page:

... Random sampling is performed within levels y (= outcomes) when y is a factor in trying to balance class distributions within sections.

Here is the answer to your problem:

 library(caret) folds <- createFolds(factor(data$true.presence), k = 10, list = FALSE)

and proportions:

 > library(plyr) > data$fold <- folds > ddply(data, 'fold', summarise, prop=mean(true.presence)) fold prop 1 1 0.5000000 2 2 0.5050505 3 3 0.5000000 4 4 0.5000000 5 5 0.5000000 6 6 0.5049505 7 7 0.5000000 8 8 0.5049505 9 9 0.5000000 10 10 0.5050505

+15

gkcn May 15, '14 at 9:36

source share

@joran is correct (regarding his assumption (b)). dismo :: kfold () is what you are looking for.

So, using data from the original question:

 require(dismo) folds <- kfold(data, k=10, by=data$true.presence)

gives a vector of length nrow(data) containing the association of the folds of each row of data. Therefore, data[fold==1,] returns the 1st shift and data[fold!=1,] can be used for verification.

+6

Janhoo Jan 31 '14 at 8:33

source share

joran · Accepted Answer · 2012-05-01T15:47:53+0000

I am sure that (a) there is a more efficient way to code this, and (b) there is almost certainly a function somewhere in the package that will only return the folds, but here is some simple code that gives you an idea of how to do this :

 rows <- 1000 X1<- sort(runif(n = rows, min = -1, max =1)) occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1))) true.presence <- rbinom(n = rows, size = 1, prob = occ.prob) # combine data as data frame and save dat <- data.frame(X1, true.presence) require(plyr) createFolds <- function(x,k){ n <- nrow(x) x$folds <- rep(1:k,length.out = n)[sample(n,n)] x } folds <- ddply(dat,.(true.presence),createFolds,k = 10) #Proportion of true.presence in each fold: ddply(folds,.(folds),summarise,prop = sum(true.presence)/length(true.presence)) folds prop 1 1 0.5049505 2 2 0.5049505 3 3 0.5100000 4 4 0.5100000 5 5 0.5100000 6 6 0.5100000 7 7 0.5100000 8 8 0.5100000 9 9 0.5050505 10 10 0.5050505

Stratified 10x Cross Validation

More articles: