Stratified sampling means that class distribution is maintained. If you are looking for this, you can still use StratifiedKFold and StratifiedShuffleSplit if you have a categorical variable for which you want to ensure the same distribution in each fold. Just use a variable instead of the target variable. For example, if you have a categorical variable in column i ,
skf = cross_validation.StratifiedKFold(X[:,i])
However, if I understand you correctly, you want to re-select for a specific target distribution (for example, 50/50) one of the categorical functions. I think you will have to come up with your own method to get such a pattern (break the data set into the value of a variable, and then take the same number of random samples from each split). If your main motivation is to balance the training set for the classifier, you could set up the sample_weights tag. You can set the scales so that they balance the set of workouts according to the desired variable:
sample_weights = sklearn.preprocessing.balance_weights(X[:,i]) clf = svm.SVC() clf_weights.fit(X, y, sample_weight=sample_weights)
For uneven distribution of the target, you need to configure sample_weights accordingly.
source share