PCA Design and Reconstruction at Scikit-Learn

I can run a PCA in scikit using the code below: X_train has 279180 rows and 104 columns.

from sklearn.decomposition import PCA pca = PCA(n_components=30) X_train_pca = pca.fit_transform(X_train) 

Now that I want to project my own vectors onto space, I have to do the following:

 """ Projection """ comp = pca.components_ #30x104 com_tr = np.transpose(pca.components_) #104x30 proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30 

But I do not agree with this step, because the Scikit documentation says:

components_: array, [n_components, n_features]

The main axes in the space of objects representing the directions of maximum data dispersion.

It seems to me that it is already projecting, but when I checked the source code, it returns only eigenvectors.

How can you design it?

Ultimately, I'm going to calculate the MSE reconstruction.

 """ Reconstruct """ recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104 """ MSE Error """ print "MSE = %.6G" %(np.mean((X_train - recon)**2)) 
+11
source share
2 answers

You can do

 proj = pca.inverse_transform(X_train_pca) 

Thus, you do not need to worry about how to do the multiplication.

After pca.fit_transform or pca.transform you get pca.fit_transform , which is usually called "loads" for each sample, which means how much volume of each component you need to describe best using a linear combination of components_ (principal axes in the feature space).) .

The projection you are aiming at returns to the original signal space. This means that you need to return to the signal space using components and loads.

So there are three steps to disambiguating here. Here you have, step by step, what you can do with the PCA and how it is actually calculated:

  1. pca.fit evaluates components (using SVD on a centered Xtrain):

     from sklearn.decomposition import PCA import numpy as np from numpy.testing import assert_array_almost_equal #Should this variable be X_train instead of Xtrain? X_train = np.random.randn(100, 50) pca = PCA(n_components=30) pca.fit(X_train) U, S, VT = np.linalg.svd(X_train - X_train.mean(0)) assert_array_almost_equal(VT[:30], pca.components_) 
  2. pca.transform calculates loads as you describe

     X_train_pca = pca.transform(X_train) X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T) assert_array_almost_equal(X_train_pca, X_train_pca2) 
  3. pca.inverse_transform gets the projection onto the components in the signal space you are interested in

     X_projected = pca.inverse_transform(X_train_pca) X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_ assert_array_almost_equal(X_projected, X_projected2) 

Now you can estimate the loss of forecast

 loss = ((X_train - X_projected) ** 2).mean() 
+22
source

In addition to @eickenberg's post, here's how to do pca reconstruction of digit images:

 import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn import decomposition n_components = 10 image_shape = (8, 8) digits = load_digits() digits = digits.data n_samples, n_features = digits.shape estimator = decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True) digits_recons = estimator.inverse_transform(estimator.fit_transform(digits)) # show 5 randomly chosen digits and their PCA reconstructions with 10 dominant eigenvectors indices = np.random.choice(n_samples, 5, replace=False) plt.figure(figsize=(5,2)) for i in range(len(indices)): plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits[indices[i],:], image_shape)), plt.axis('off') plt.suptitle('Original', size=25) plt.show() plt.figure(figsize=(5,2)) for i in range(len(indices)): plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits_recons[indices[i],:], image_shape)), plt.axis('off') plt.suptitle('PCA reconstructed'.format(n_components), size=25) plt.show() 

enter image description here

enter image description here

0
source

Source: https://habr.com/ru/post/1246922/


All Articles