Sklearn train_test_split; storing unique values ​​from columns (s) in the training set

Is there a way to use sklearn.model_selection.train_test_splitto save all unique values ​​from a specific column (s) in a training set.

Let me give you an example. The most common matrix factorization problem that I know of is predicting movie ratings for users in, say, the Netflix Challenge or Movielens datasets. Now this question has nothing to do with any matrix factorization method, but within the limits of possibilities there is a group that will make forecasts only for known combinations of users and elements.

So, in Movielens 100k, for example, we have 943 unique users and 1682 unique films. If we used train_test_spliteven with a high ratio train_size(say 0.9), the number of unique users and films would not be the same. This presents a problem because the group of methods that I mentioned cannot predict anything other than 0 for films or users on which it has not been trained. Here is an example of what I mean.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

ml = pd.read_csv(ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts'])
ml.head()   
   User_id  Item_id Rating         ts
0      196      242      3  881250949
1      186      302      3  891717742
2       22      377      1  878887116
3      244       51      2  880606923
4      166      346      1  886397596
ml.User_id.unique().size
943
ml.Item_id.unique().size
1682
utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9)
np.unique(utrain).size
943
np.unique(itrain).size
1644

, , 1682 . , . , ( - 20), . , , , .

, .

sklearn , , () ?

.

  • , / .
  • train_test_split , / (, + ).
  • ,

:

item_counts = ml.groupby(['Item_id']).size()
user_counts = ml.groupby(['User_id']).size()
rare_items = item_counts.loc[item_counts <= 5].index.values
rare_users = user_counts.loc[user_counts <= 5].index.values
rare_items.size
384
rare_users.size
0
# We can ignore users in this example
rare_ratings = ml.loc[ml.Item_id.isin(rare_items)]
rare_ratings.shape[0]
968
ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)]
items = ml_less_rare.Item_id.values
users = ml_less_rare.User_id.values
ratings = ml_less_rare.Rating.values
# Establish number of items desired from train_test_split
desired_ratio = 0.9
train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0]
train_ratio = train_size / ml_less_rare.shape[0]
itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio)
itrain = np.concatenate((itrain, rare_ratings.Item_id.values))
np.unique(itrain).size
1682
utrain = np.concatenate((utrain, rare_ratings.User_id.values))
np.unique(utrain).size
943
rtrain = np.concatenate((rtrain, rare_ratings.Rating.values))

, , train_test_split sklearn.

+4

Source: https://habr.com/ru/post/1690397/


All Articles