How to normalize with PCA and scikit-learn

Question

How to normalize with PCA and scikit-learn

Let me do it. Mostly I want to know: should I do this,

pca.fit(normalize(x)) new=pca.transform(normalize(x))

or

 pca.fit(normalize(x)) new=pca.transform(x)

I know that we must normalize our data before using the PCA, but which of the above procedures is correct with sklearn?

+6

python scikit-learn

user1008537 Aug 24 '14 at 19:25

source share

1 answer

eickenberg · Accepted Answer · 2014-08-24T20:50:59+0000

In general, you would like to use the first option.

Your normalization puts your data in a new space that is visible on the PCA, and its transformation basically expects the data to be in the same space.

Scikit-learn provides tools for this transparently and conveniently, combining grades in the pipeline. Try:

 from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline import numpy as np data = np.random.randn(20, 40) pipeline = Pipeline([('scaling', StandardScaler()), ('pca', PCA(n_components=5))]) pipeline.fit_transform(data)

An attached scanner will always apply its transformation to the data before moving on to the PCA object.

As @larsmans points out, you can use sklearn.preprocessing.Normalizer instead of StandardScaler or, similarly, remove the middle centering from StandardScaler by passing the keyword argument with_mean=False .

How to normalize with PCA and scikit-learn

More articles: