Subsampling + classification using scikit-learn

I use Scikit-learn for a binary classification task .. and I have: Class 0: with 200 observations Class 1: with 50 observations

And because I have unbalanced data. I want to take a random subsample of the majority class, where the number of observations will be the same as the minority class, and I want to use the new data set as input for the classifier .. the process of selection and classification can be repeated many times. I have the following code for subsampling, mainly using Ami Tavory

docs_train=load_files(rootdir,categories=categories, encoding='latin-1')

X_train = docs_train.data
y_train = docs_train.target

majority_x,majority_y=x[y==0,:],y[y==0]  # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]

inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]

This works like a charm, however at the end of processing most_x and most_y I want to be able to replace the old set that represents class0 in X_train, y_train with the new smaller set, to pass it as shown below. classifier or conveyor:

pipeline = Pipeline([
    ('vectorizer',  CountVectorizer( tokenizer=tokens, binary=True)),
    ('classifier',SVC(C=1,kernel='linear')) ])

pipeline.fit(X_train, y_train)

What I did For this: since the given arrays, where there are numerous arrays, and because I am new to the whole field, and I really really try to study. I tried to combine the two given arrays together most_x + minor_x to form the training data that I want ... I could not give some errors that I am trying to solve so far ... but even if I could .. how can I keep your index so that most_i and minor_y will also be true!

+4
1

most_x minor_y

X_train = np.concatenate((majority_x,minority_x))
y_train = np.concatenate((majority_y,minority_y))

X_train y_train y = 0, y = 1.

: , . 50 , 50 . , . , , .

, "replace = False" np.random.choice, .

+1

Source: https://habr.com/ru/post/1626685/


All Articles