Scikit learn: train_test_split, can I provide the same splits on different data sets

I understand that the train_test_split method splits the data set into random trains and test subsets. And using random_state = int can guarantee that we will have the same splits on this dataset for each method call.

My problem is a little different.

I have two data sets: A and B, they contain the same sets of examples, and the order of these examples appears in each data set, is also identical. But the main difference is that in each data set in different data sets different sets of functions are used.

I would like to check if the functions used in A work better than the functions used in B. Therefore, I would like to make sure that when I call train_test_split on A and B, I can get the same splits on both so that the comparison has meaning.

Is it possible? Do I just need to make sure that random_state in both method calls is the same for both datasets?

thanks

+7
source share
3 answers

Yes, a random state is enough.

>>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X2 = np.hstack((X,X)) >>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42) >>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42) >>> X_train array([[4, 5], [0, 1], [6, 7]]) >>> X_train2 array([[4, 5, 4, 5], [0, 1, 0, 1], [6, 7, 6, 7]]) >>> X_test array([[2, 3], [8, 9]]) >>> X_test2 array([[2, 3, 2, 3], [8, 9, 8, 9]]) 
+10
source

Considering the code for the train_test_split function, it sets an arbitrary seed inside the function with every call. Thus, this will lead to the same split every time. We can verify that this works quite simply.

 X1 = np.random.random((200, 5)) X2 = np.random.random((200, 5)) y = np.arange(200) X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y, test_size=0.1, random_state=42) X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y, test_size=0.1, random_state=42) print np.all(y1_train == y2_train) print np.all(y1_test == y2_test) 

What outputs:

 True True 

It's good! Another way to solve this problem is to create one training and test on all your functions, and then divide your functions before training. However, if you are in a strange situation where you need to do both at the same time (sometimes with similarity matrices that you do not need for test functions in your training set), you can use the StratifiedShuffleSplit function to return indexes of data belonging to each set. For instance:

 n_splits = 1 sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits, test_size=0.1, random_state=42) train_idx, test_idx = list(sss.split(X, y))[0] 
+4
source

As mentioned above, you can use the Random State parameter. But if you want to get the same results globally, that means you have to set a random state for all future calls that you can use.

 np.random.seed('Any random number ') 
0
source

Source: https://habr.com/ru/post/1266076/


All Articles