I started to create some training and test sets using 10x cross-search for an artificial dataset:
rows <- 1000 X1<- sort(runif(n = rows, min = -1, max =1)) occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1))) true.presence <- rbinom(n = rows, size = 1, prob = occ.prob) # combine data as data frame and save data <- data.frame(X1, true.presence) id <- sample(1:10,nrow(data),replace=TRUE) ListX <- split(data,id) fold1 <- data[id==1,] fold2 <- data[id==2,] fold3 <- data[id==3,] fold4 <- data[id==4,] fold5 <- data[id==5,] fold6 <- data[id==6,] fold7 <- data[id==7,] fold8 <- data[id==8,] fold9 <- data[id==9,] fold10 <- data[id==10,] trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10)) testset <- subset(data, id %in% c(1))
I'm just wondering if there are simpler ways to achieve this and how can I do a stratified cross-validation that ensures that the priors (true.presence) class is about the same in all folds?
source share