What does KFold do in python?

I review this tutorial: https://www.dataquest.io/mission/74/getting-started-with-kaggle

I got part 9 making predictions. There is some data in a data frame called titanic, which is then divided into folds using:

# Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(titanic.shape[0], n_folds=3, random_state=1) 

I'm not sure what it does, and what kind of kf object it is. I tried reading the documentation, but that didn't help much. In addition, there are three folds (n_folds = 3), why does this only happen with access to the train and test (and how do I know that they are called train and test) on this line?

 for train, test in kf: 
+7
source share
2 answers

KFold will provide train / test indices for sharing data on trains and test suites. It will divide the data set into k consecutive folds (no default permutation). Then each time a test set is used, and the remaining k - 1 folds form a training set ( source ).

Suppose you have some data indices from 1 to 10. If you use n_fold=k , in the first iteration you will get the n_fold=k (i<=k) fold as test indices and the remaining (k-1) folds (without this i reset) together as train indices.

Example

 import numpy as np from sklearn.cross_validation import KFold x = [1,2,3,4,5,6,7,8,9,10,11,12] kf = KFold(12, n_folds=3) for train_index, test_index in kf: print (train_index, test_index) 

Output

Add 1: [4 5 6 7 8 9 10 11] [0 1 2 3]

Add 2: [0 1 2 3 8 9 10 11] [4 5 6 7]

Add 3: [0 1 2 3 4 5 6 7] [8 9 10 11]

Import update for sklearn 0.20:

The KFold object was moved to the sklearn.model_selection module in version 0.20. To import KFold into sklearn 0. 20+ use from sklearn.model_selection import KFold . KFold current source of documentation

+12
source

Exchange of theoretical information about KF, which I have learned so far.

KFOLD is a model validation method in which it does not use a pre-prepared model. Rather, it simply uses a hyperparameter and trains a new model with a k-1 data set and tests the same model on the k-th set.

K different models are just used for verification.

It will return K different points (percentage accuracy), which are based on the data set of the k-th test. And we usually take the average value for the analysis of the model.

We repeat this process with all the different models that we want to analyze. Short Algo:

  1. Separate the data in the training and test sections.
  2. Trained different models speak SVM, RF, LR on this training data.
  2.a Take whole data set and divide in to K-Folds. 2.b Create a new model with the hyper parameter received after training on step 1. 2.c Fit the newly created model on K-1 data set. 2.d Test on Kth data set 2.e Take average score. 
  1. Analyze the various average ratings and select the best model from SVM, RF and LR.

A simple reason for this, as a rule, we have data flaws, and if we divide the entire data set into:

  1. Training
  2. Check
  3. testing

We can exclude a relatively small piece of data that might be superior to our model. It is also possible that some data remains intact for our training, and we do not analyze the behavior with respect to such data.

CF overcame both problems.

0
source

Source: https://habr.com/ru/post/1245240/


All Articles