Column Based Stratified Sample

I have a rather large CSV file containing amazon overview data that I read in the pandas data frame. I want to split the data 80-20 (train test), but at the same time I want to make sure that the divided data proportionally represent the values โ€‹โ€‹of one column (category), i.e. All other category of reviews is present both in the train and the test data is proportional.

The data is as follows:

**ReviewerID** **ReviewText** **Categories** **ProductId** 1212 good product Mobile 14444425 1233 will buy again drugs 324532 5432 not recomended dvd 789654123 

Im using the following code:

 import pandas as pd Meta = pd.read_csv('C:\\Users\\xyz\\Desktop\\WM Project\\Joined.csv') import numpy as np from sklearn.cross_validation import train_test_split train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y) 

it gives the following error:

 NameError: name 'y' is not defined 

Since I'm relatively new to python, I can't figure out what I'm doing wrong, or this code will stratify based on column categories. It seems to work fine when I remove the stratify parameter as well as the category columns from the separation of test cases.

Any help would be appreciated.

+10
source share
2 answers
  >>> import pandas as pd >>> Meta = pd.read_csv('C:\\Users\\*****\\Downloads\\so\\Book1.csv') >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> y = Meta.pop('Categories') >>> Meta ReviewerID ReviewText ProductId 0 1212 good product 14444425 1 1233 will buy again 324532 2 5432 not recomended 789654123 >>> y 0 Mobile 1 drugs 2 dvd Name: Categories, dtype: object >>> X = Meta >>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y) >>> X_test ReviewerID ReviewText ProductId 0 1212 good product 14444425 
+10
source

sklearn.model_selection.train_test_split

stratification: array or not (default not)

If not None, the data is stratified, using it as class labels.

X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y) to the API documentation, I think you should try, for example, X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y) .

Meta_X , Meta_Y should be assigned by you properly (I think Meta_Y should be Meta.categories based on your code).

+9
source

Source: https://habr.com/ru/post/1268109/


All Articles