I use Scikit-learn for a binary classification task .. and I have: Class 0: with 200 observations Class 1: with 50 observations
And because I have unbalanced data. I want to take a random subsample of the majority class, where the number of observations will be the same as the minority class, and I want to use the new data set as input for the classifier .. the process of selection and classification can be repeated many times. I have the following code for subsampling, mainly using Ami Tavory
docs_train=load_files(rootdir,categories=categories, encoding='latin-1')
X_train = docs_train.data
y_train = docs_train.target
majority_x,majority_y=x[y==0,:],y[y==0] # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]
inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]
This works like a charm, however at the end of processing most_x and most_y I want to be able to replace the old set that represents class0 in X_train, y_train with the new smaller set, to pass it as shown below. classifier or conveyor:
pipeline = Pipeline([
('vectorizer', CountVectorizer( tokenizer=tokens, binary=True)),
('classifier',SVC(C=1,kernel='linear')) ])
pipeline.fit(X_train, y_train)
What I did For this: since the given arrays, where there are numerous arrays, and because I am new to the whole field, and I really really try to study. I tried to combine the two given arrays together most_x + minor_x to form the training data that I want ... I could not give some errors that I am trying to solve so far ... but even if I could .. how can I keep your index so that most_i and minor_y will also be true!