Normalize PCA with scikit-learn when data is shared

Question

Normalize PCA with scikit-learn when data is shared

I have the following question: How to normalize using PCA and scikit-learn .

I am creating an emotion detection system and now I am doing the following:

Separate data for all emotions (distributing data among several subsets).
Add all data together (several subsets in 1 set)
Get PCA parameters for combined data (self.pca = RandomizedPCA (n_components = self.n_components, whiten = True) .fit (self.data))
For an emotion (for each subset), apply the PCA to the data for that emotion (subset).

I have to do the normalization: step 2) normalize all the combined data and step 4) normalize the subsets.

Edit

I was wondering if normalization across all data and normalization across a subset is the same. Now that I tried to simplify my example at the suggestion of @BartoszKP, I realized that, as I understood it, normalized work was wrong. Normalization works the same in both cases, so this is the right way to do this, right? (see code)

from sklearn.preprocessing import normalize from sklearn.decomposition import RandomizedPCA import numpy as np data_1 = np.array(([52, 254], [4, 128]), dtype='f') data_2 = np.array(([39, 213], [123, 7]), dtype='f') data_combined = np.vstack((data_1, data_2)) #print(data_combined) """ Output [[ 52. 254.] [ 4. 128.] [ 39. 213.] [ 123. 7.]] """ #Normalize all data data_norm = normalize(data_combined) print(data_norm) """ [[ 0.20056452 0.97968054] [ 0.03123475 0.99951208] [ 0.18010448 0.98364753] [ 0.99838448 0.05681863]] """ pca = RandomizedPCA(n_components=20, whiten=True) pca.fit(data_norm) #Normalize subset of data data_1_norm = normalize(data_1) print(data_1_norm) """ [[ 0.20056452 0.97968054] [ 0.03123475 0.99951208]] """ pca.transform(data_1_norm)

+2

python scikit-learn pca

NumesSanguis Dec 25 '14 at 11:46

source share

1 answer

Bartoszkp · Accepted Answer · 2014-12-25T14:39:59+0000

Yes, as explained in the documentation , what normalize does scales individual samples independently of each other:

Normalization is the process of scaling individual samples to have a single norm.

This is further explained in the documentation of the Normalizer class :

Each sample (i.e., each row of the data matrix) with at least one nonzero component is scaled independently of other samples , so that its norm (l1 or l2) is equal to one.

^{(my emphasis)}

Normalize PCA with scikit-learn when data is shared

Edit

More articles: