In general, you would like to use the first option.
Your normalization puts your data in a new space that is visible on the PCA, and its transformation basically expects the data to be in the same space.
Scikit-learn provides tools for this transparently and conveniently, combining grades in the pipeline. Try:
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline import numpy as np data = np.random.randn(20, 40) pipeline = Pipeline([('scaling', StandardScaler()), ('pca', PCA(n_components=5))]) pipeline.fit_transform(data)
An attached scanner will always apply its transformation to the data before moving on to the PCA object.
As @larsmans points out, you can use sklearn.preprocessing.Normalizer instead of StandardScaler or, similarly, remove the middle centering from StandardScaler by passing the keyword argument with_mean=False .
source share