Selection of components showing the greatest variance in PCA

I have a huge dataset (32000 * 2500) that I need for training. This seems too big for my classifier, so I decided to do some reading on reducing the dimension, and in particular in the PCA.

From my point of view, the PCA selects the current data and replaces it with another (x, y) domain / scale. These new coordinates mean nothing, but the data is rearranged to give one axis variation. After these new odds, I can reset cooeff with a minimal change.

Now I am trying to implement this in MatLab, and I am having problems with the provided result. MatLab always treats rows as observations and columns as variables. So my inout for the pca function will be my size matrix (32000*2500) . This will return the PCA coefficients in the output matrix of size 2500*2500 .

Help for pca states:

Each coefficient column contains coefficients for one principal component, and the columns are in descending order of the variance component.

In this issue, which dimension is an observation of my data? I mean, if I have to give this to the classifier, will the coeff rows represent my data observations or are they now coeff columns?

And how to remove the coefficients with the least change?

+5
source share
1 answer

(Disclaimer: This has been a long time since I switched from Matlab to scipy, but the principles are the same.)

If you use the svd function

 [U,S,V] = svd(X) 

to reduce the dimension from X to k , you must multiply k V by the first columns. In Matlab, I assume that

 X * V(:, 1: k); 

Refer to Elements of Statistical Learning for Theory.

+5
source

Source: https://habr.com/ru/post/1244010/


All Articles