Classification: PCA and logistic regression using sklearn

Question

Classification: PCA and logistic regression using sklearn

Step 0: Description of the Problem

I have a classification problem, i.e. I want to predict a binary target based on a set of numerical functions using logistic regression and after performing a component analysis (PCA).

I have 2 data sets: df_train and df_valid (training set and test set, respectively) as a pandas data frame containing functions and a goal. As a first step, I used the get_dummies pandas function to convert all categorical variables as boolean. For example, I would:

 n_train = 10 np.random.seed(0) df_train = pd.DataFrame({"f1":np.random.random(n_train), \ "f2": np.random.random(n_train), \ "f3":np.random.randint(0,2,n_train).astype(bool),\ "target":np.random.randint(0,2,n_train).astype(bool)}) In [36]: df_train Out[36]: f1 f2 f3 target 0 0.548814 0.791725 False False 1 0.715189 0.528895 True True 2 0.602763 0.568045 False True 3 0.544883 0.925597 True True 4 0.423655 0.071036 True True 5 0.645894 0.087129 True False 6 0.437587 0.020218 True True 7 0.891773 0.832620 True False 8 0.963663 0.778157 False False 9 0.383442 0.870012 True True n_valid = 3 np.random.seed(1) df_valid = pd.DataFrame({"f1":np.random.random(n_valid), \ "f2": np.random.random(n_valid), \ "f3":np.random.randint(0,2,n_valid).astype(bool),\ "target":np.random.randint(0,2,n_valid).astype(bool)}) In [44]: df_valid Out[44]: f1 f2 f3 target 0 0.417022 0.302333 False False 1 0.720324 0.146756 True False 2 0.000114 0.092339 True True

Now I would like to use the PCA to reduce the dimension of my problem, and then use the LogisticRegression from sklearn to train and get the forecast in my test set, but I'm not sure that the procedure I'm following is correct. That's what I'm doing:

Step 1: PCA

The idea is that I need to transform both my training and verification, just like PCA. In other words, I cannot run PCA separately. Otherwise, they will be projected onto different eigenvectors.

 from sklearn.decomposition import PCA pca = PCA(n_components=2) #assume to keep 2 components, but doesn't matter newdf_train = pca.fit_transform(df_train.drop("target", axis=1)) newdf_valid = pca.transform(df_valid.drop("target", axis=1)) #not sure here if this is right

Step 2: Logistic Regression

This is not necessary, but I prefer to store things as a dataframe:

 features_train = pd.DataFrame(newdf_train) features_valid = pd.DataFrame(newdf_valid)

And now I am performing a logistic regression

 from sklearn.linear_model import LogisticRegression cls = LogisticRegression() cls.fit(features_train, df_train["target"]) predictions = cls.predict(features_valid)

I think step 2 is correct, but I have more doubts about step 1: do I have to bind the PCA and then to the classifier?

+6

python scikit-learn classification pca logistic-regression

ldocao 30 sept '15 at 7:59

source share

1 answer

Alexander Fridman · Answer 1 · 2016-01-25T15:26:45+0000

To do this, there is a pipeline in Sklearn.

 from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline pca = PCA(n_components=2) cls = LogisticRegression() pipe = Pipeline([('pca', pca), ('logistic', clf)]) pipe.fit(features_train, df_train["target"]) predictions = pipe.predict(features_valid)

Classification: PCA and logistic regression using sklearn

Step 0: Description of the Problem

Step 1: PCA

Step 2: Logistic Regression

More articles: