Randomized stratified k-fold cross validation in scikit-learn?

Is there any built-in way to get scikit-learn to do shuffled stratified k-fold cross validation? This is one of the most common CV methods, and I am surprised that I could not find a built-in method for this.

I saw that cross_validation.KFold() has a shuffle flag, but it is not stratified. Unfortunately, cross_validation.StratifiedKFold() does not have this option, and cross_validation.StratifiedShuffleSplit() does not create disjoint bends.

Am I missing something? Is this planned?

(obviously, I can implement this myself)

+6
source share
4 answers

A shuffle flag for cross_validation.StratifiedKFold was introduced in the current version 0.15:

http://scikit-learn.org/0.15/modules/generated/sklearn.cross_validation.StratifiedKFold.html

This can be found in the change list:

http://scikit-learn.org/stable/whats_new.html#new-features

Shuffle option for cross_validation.StratifiedKFold. Jeffrey Blackburne.

+5
source

I thought I would post my decision if it would be useful to anyone else.

 from collections import defaultdict import random def strat_map(y): """ Returns permuted indices that maintain class """ smap = defaultdict(list) for i,v in enumerate(y): smap[v].append(i) for values in smap.values(): random.shuffle(values) y_map = np.zeros_like(y) for i,v in enumerate(y): y_map[i] = smap[v].pop() return y_map ########## #Example Use ########## skf = StratifiedKFold(y, nfolds) sm = strat_map(y) for test, train in skf: test,train = sm[test], sm[train] #then cv as usual ####### #tests# ####### import numpy.random as rnd for _ in range(100): y = np.array( [0]*10 + [1]*20 + [3] * 10) rnd.shuffle(y) sm = strat_map(y) shuffled = y[sm] assert (sm != range(len(y))).any() , "did not shuffle" assert (shuffled == y).all(), "classes not in right position" assert (set(sm) == set(range(len(y)))), "missing indices" for _ in range(100): nfolds = 10 skf = StratifiedKFold(y, nfolds) sm = strat_map(y) for test, train in skf: assert (sm[test] != test).any(), "did not shuffle" assert (y[sm[test]] == y[test]).all(), "classes not in right position" 
+2
source

Here is my implementation of a broken alternating shuffle into a training and testing suite:

 import numpy as np def get_train_test_inds(y,train_proportion=0.7): '''Generates indices, making random stratified split into training set and testing sets with proportions train_proportion and (1-train_proportion) of initial sample. y is any iterable indicating classes of each observation in the sample. Initial proportions of classes inside training and test sets are preserved (stratified sampling). ''' y=np.array(y) train_inds = np.zeros(len(y),dtype=bool) test_inds = np.zeros(len(y),dtype=bool) values = np.unique(y) for value in values: value_inds = np.nonzero(y==value)[0] np.random.shuffle(value_inds) n = int(train_proportion*len(value_inds)) train_inds[value_inds[:n]]=True test_inds[value_inds[n:]]=True return train_inds,test_inds y = np.array([1,1,2,2,3,3]) train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5) print y[train_inds] print y[test_inds] 

This code outputs:

 [1 2 3] [1 2 3] 
+1
source

As far as I know, this is really implemented in scikit-learn.

"" ShuffleSplit Stratified Cross-Forward Check Iterator

Provides train / test data for separating data in train test sets.

This cross-validation object is a merger of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. Folds are created by storing the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds are different, although this is still very likely for large datasets. ""

-3
source

Source: https://habr.com/ru/post/944567/


All Articles