Re-fetch in scikit-learn and / or pandas

Is there a built-in function in Pandas or Scikit-learn for resampling according to the specified strategy? I want to repurpose my data based on a categorical variable.

For example, if my data contains 75% of men and 25% of women, but I would like to train my model 50% of men and 50% of women. (I would also like to be able to summarize cases that are not 50/50)

I need something that revises my data according to the specified proportions.

+6
source share
3 answers

Stratified sampling means that class distribution is maintained. If you are looking for this, you can still use StratifiedKFold and StratifiedShuffleSplit if you have a categorical variable for which you want to ensure the same distribution in each fold. Just use a variable instead of the target variable. For example, if you have a categorical variable in column i ,

 skf = cross_validation.StratifiedKFold(X[:,i]) 

However, if I understand you correctly, you want to re-select for a specific target distribution (for example, 50/50) one of the categorical functions. I think you will have to come up with your own method to get such a pattern (break the data set into the value of a variable, and then take the same number of random samples from each split). If your main motivation is to balance the training set for the classifier, you could set up the sample_weights tag. You can set the scales so that they balance the set of workouts according to the desired variable:

 sample_weights = sklearn.preprocessing.balance_weights(X[:,i]) clf = svm.SVC() clf_weights.fit(X, y, sample_weight=sample_weights) 

For uneven distribution of the target, you need to configure sample_weights accordingly.

+1
source

My hit on the function is to do what I want below. Hope this helps someone else.

X and y are considered Pandas DataFrame and Series respectively.

 def resample(X, y, sample_type=None, sample_size=None, class_weights=None, seed=None): # Nothing to do if sample_type is 'abs' or not set. sample_size should then be int # If sample type is 'min' or 'max' then sample_size should be float if sample_type == 'min': sample_size_ = np.round(sample_size * y.value_counts().min()).astype(int) elif sample_type == 'max': sample_size_ = np.round(sample_size * y.value_counts().max()).astype(int) else: sample_size_ = max(int(sample_size), 1) if seed is not None: np.random.seed(seed) if class_weights is None: class_weights = dict() X_resampled = pd.DataFrame() for yi in y.unique(): size = np.round(sample_size_ * class_weights.get(yi, 1.)).astype(int) X_yi = X[y == yi] sample_index = np.random.choice(X_yi.index, size=size) X_resampled = X_resampled.append(X_yi.reindex(sample_index)) return X_resampled 
+1
source

If you are open to importing a library, I find that the imbalanced-learn library is useful when accessing re-fetching. Here, the categorical variable is “y,” and the data for resampling is “X”. In the example below, the fish are counted to an equal number of dogs, 3: 3.

The code is slightly modified from the imbalance-learn documents: 2.1.1. Naive random oversampling . You can use this method with numeric data and strings.

 import numpy as np from collections import Counter from imblearn.over_sampling import RandomOverSampler y = np.array([1,1,0,0,0]); # Fish / Dog print('target:\n', y) X = np.array([['red fish'],['blue fish'],['dog'],['dog'],['dog']]); print('data:\n',X); print('Original dataset shape {}'.format(Counter(y))) # Original dataset shape Counter({1: 900, 0: 100}) print(type(X)); print(X); print(y); ros = RandomOverSampler(ratio='auto', random_state=42); X_res, y_res = ros.fit_sample(X, y); print('Resampled dataset shape {}'.format(Counter(y_res))) # Resampled dataset shape Counter({0: 900, 1: 900}); print(type(X_res)); print(X_res); print(y_res); 
0
source

Source: https://habr.com/ru/post/985979/


All Articles