Column Based Stratified Sample

Question

Column Based Stratified Sample

I have a rather large CSV file containing amazon overview data that I read in the pandas data frame. I want to split the data 80-20 (train test), but at the same time I want to make sure that the divided data proportionally represent the values of one column (category), i.e. All other category of reviews is present both in the train and the test data is proportional.

The data is as follows:

**ReviewerID** **ReviewText** **Categories** **ProductId** 1212 good product Mobile 14444425 1233 will buy again drugs 324532 5432 not recomended dvd 789654123

Im using the following code:

 import pandas as pd Meta = pd.read_csv('C:\\Users\\xyz\\Desktop\\WM Project\\Joined.csv') import numpy as np from sklearn.cross_validation import train_test_split train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)

it gives the following error:

 NameError: name 'y' is not defined

Since I'm relatively new to python, I can't figure out what I'm doing wrong, or this code will stratify based on column categories. It seems to work fine when I remove the stratify parameter as well as the category columns from the separation of test cases.

Any help would be appreciated.

+10

python pandas scikit-learn sklearn-pandas

Muhammad Ali Zia May 03 '16 at 6:56

source share

2 answers

sklearn.model_selection.train_test_split
stratification: array or not (default not)
If not None, the data is stratified, using it as class labels.

X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y) to the API documentation, I think you should try, for example, X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y) .

Meta_X , Meta_Y should be assigned by you properly (I think Meta_Y should be Meta.categories based on your code).

+9

su79eu7k May 03 '16 at 7:17

source share

nEO · Accepted Answer · 2016-05-03T07:24:24+0000

  >>> import pandas as pd >>> Meta = pd.read_csv('C:\\Users\\*****\\Downloads\\so\\Book1.csv') >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> y = Meta.pop('Categories') >>> Meta ReviewerID ReviewText ProductId 0 1212 good product 14444425 1 1233 will buy again 324532 2 5432 not recomended 789654123 >>> y 0 Mobile 1 drugs 2 dvd Name: Categories, dtype: object >>> X = Meta >>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y) >>> X_test ReviewerID ReviewText ProductId 0 1212 good product 14444425

Column Based Stratified Sample

More articles: