I have a rather large CSV file containing amazon overview data that I read in the pandas data frame. I want to split the data 80-20 (train test), but at the same time I want to make sure that the divided data proportionally represent the values โโof one column (category), i.e. All other category of reviews is present both in the train and the test data is proportional.
The data is as follows:
**ReviewerID** **ReviewText** **Categories** **ProductId** 1212 good product Mobile 14444425 1233 will buy again drugs 324532 5432 not recomended dvd 789654123
Im using the following code:
import pandas as pd Meta = pd.read_csv('C:\\Users\\xyz\\Desktop\\WM Project\\Joined.csv') import numpy as np from sklearn.cross_validation import train_test_split train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)
it gives the following error:
NameError: name 'y' is not defined
Since I'm relatively new to python, I can't figure out what I'm doing wrong, or this code will stratify based on column categories. It seems to work fine when I remove the stratify parameter as well as the category columns from the separation of test cases.
Any help would be appreciated.
source share