Sklearn PCA memory error: alternative noise reduction?

Question

Sklearn PCA memory error: alternative noise reduction?

I am trying to reduce the dimension of a very large matrix using PCA in Sklearn, but it causes a memory error (requires RAM (more than 128 GB)). I have already set copy = False, and I am using a less costly randomized PCA.

Is there a workaround? If not, what other brightness reduction methods can I use that require less memory. Thanks.

Update: The matrix I'm trying to use for the PCA is a collection of feature vectors. This is due to the transfer of a set of training images through pre-processed CNN. Matrix [300000, 51200]. PCA components tried: 100 to 500.

I want to reduce its dimension, so I can use these functions to teach the ML algorithm, for example XGBoost. Thanks.

+7

python scikit-learn multidimensional-array pca

Chris parry Apr 11 '17 at 10:53

source share

2 answers

You can use IncrementalPCAavailable in SK learn. from sklearn.decomposition import IncrementalPCA. The rest of the interface is the same as y PCA. You need to pass an additional argument batch_size, which should be <= #components.

However, if you need to use a non-linear version, such as KernelPCA, it seems, something similar is not supported. KernelPCAabsolutely explodes memory requirements in it, see this Wikipedia article on nonlinear dimensional reduction

0

Vivek 16 . '19 20:33

Chris parry · Accepted Answer · 2017-04-20T07:53:21+0000

In the end, I used TruncatedSVD instead of the PCA, which is capable of handling large matrices without memory problems:

from sklearn import decomposition

n_comp = 250
svd = decomposition.TruncatedSVD(n_components=n_comp, algorithm='arpack')
svd.fit(train_features)
print(svd.explained_variance_ratio_.sum())

train_features = svd.transform(train_features)
test_features = svd.transform(test_features)

Sklearn PCA memory error: alternative noise reduction?

More articles: