How to normalize with PCA and scikit-learn

Let me do it. Mostly I want to know: should I do this,

pca.fit(normalize(x)) new=pca.transform(normalize(x)) 

or

 pca.fit(normalize(x)) new=pca.transform(x) 

I know that we must normalize our data before using the PCA, but which of the above procedures is correct with sklearn?

+6
source share
1 answer

In general, you would like to use the first option.

Your normalization puts your data in a new space that is visible on the PCA, and its transformation basically expects the data to be in the same space.

Scikit-learn provides tools for this transparently and conveniently, combining grades in the pipeline. Try:

 from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline import numpy as np data = np.random.randn(20, 40) pipeline = Pipeline([('scaling', StandardScaler()), ('pca', PCA(n_components=5))]) pipeline.fit_transform(data) 

An attached scanner will always apply its transformation to the data before moving on to the PCA object.

As @larsmans points out, you can use sklearn.preprocessing.Normalizer instead of StandardScaler or, similarly, remove the middle centering from StandardScaler by passing the keyword argument with_mean=False .

+13
source

Source: https://habr.com/ru/post/974273/


All Articles