I have the following question: How to normalize using PCA and scikit-learn .
I am creating an emotion detection system and now I am doing the following:
- Separate data for all emotions (distributing data among several subsets).
- Add all data together (several subsets in 1 set)
- Get PCA parameters for combined data (self.pca = RandomizedPCA (n_components = self.n_components, whiten = True) .fit (self.data))
- For an emotion (for each subset), apply the PCA to the data for that emotion (subset).
I have to do the normalization: step 2) normalize all the combined data and step 4) normalize the subsets.
Edit
I was wondering if normalization across all data and normalization across a subset is the same. Now that I tried to simplify my example at the suggestion of @BartoszKP, I realized that, as I understood it, normalized work was wrong. Normalization works the same in both cases, so this is the right way to do this, right? (see code)
from sklearn.preprocessing import normalize from sklearn.decomposition import RandomizedPCA import numpy as np data_1 = np.array(([52, 254], [4, 128]), dtype='f') data_2 = np.array(([39, 213], [123, 7]), dtype='f') data_combined = np.vstack((data_1, data_2)) #print(data_combined) """ Output [[ 52. 254.] [ 4. 128.] [ 39. 213.] [ 123. 7.]] """ #Normalize all data data_norm = normalize(data_combined) print(data_norm) """ [[ 0.20056452 0.97968054] [ 0.03123475 0.99951208] [ 0.18010448 0.98364753] [ 0.99838448 0.05681863]] """ pca = RandomizedPCA(n_components=20, whiten=True) pca.fit(data_norm) #Normalize subset of data data_1_norm = normalize(data_1) print(data_1_norm) """ [[ 0.20056452 0.97968054] [ 0.03123475 0.99951208]] """ pca.transform(data_1_norm)
source share