Sklearn train_test_split on pandas stratify across multiple columns

I am a relatively new user to sklearn and have encountered some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas framework that I would like to split into a training and testing kit. I would like to stratify my data by at least 2, but ideally 4 columns in my framework.

There were no warnings from sklearn when I tried to do this, however later I discovered that rows were repeated in my last dataset. I created a test sample to show this behavior:

from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

It seems to work as expected if I stratify any column:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

But when I try to layered both columns, I get duplicate values:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 640000
+4
2

scikit-learn ? sklearn.__version__ .

0.19.0 scikit-learn . 0.19.0.

# 9044.

scikit-learn, . scikit-learn, . .

+3

, , , train_test_split() , stratify. , , , , .

train_test_split() StratifiedShuffleSplit, np.unique() on y ( , stratify). :

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

, , :

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

print(df)
     a    b  c
0    0  bar  y
1    1  foo  y
2    2  bar  z
3    3  bar  y
4    4  foo  z
5    5  bar  y
...

, : foo, bar, y z. , y z b == foo b == bar, , .

train, test = train_test_split(df, test_size=0.2, random_state=0, 
                               stratify=df[['b', 'c']])
print(len(train.a.values))  # 16
print(len(set(train.a.values)))  # 12

print(train)
     a    b  c
3    3  bar  y   # selecting a = 3 for b = bar*
5    5  bar  y
13  13  foo  y
4    4  foo  z
14  14  bar  z
10  10  foo  z
3    3  bar  y   # selecting a = 3 for c = y
6    6  bar  y
16  16  foo  y
18  18  bar  z
6    6  bar  y
8    8  foo  y
18  18  bar  z
7    7  bar  z
4    4  foo  z
19  19  bar  y

#* We can't be sure which row is selecting for `bar` or `y`, 
#  I'm just illustrating the idea here.

: , df.b df.c ? , , . , , train_test_split .

, .

+2

Source: https://habr.com/ru/post/1683050/


All Articles