Sklearn PCA memory error: alternative noise reduction?

I am trying to reduce the dimension of a very large matrix using PCA in Sklearn, but it causes a memory error (requires RAM (more than 128 GB)). I have already set copy = False, and I am using a less costly randomized PCA.

Is there a workaround? If not, what other brightness reduction methods can I use that require less memory. Thanks.


Update: The matrix I'm trying to use for the PCA is a collection of feature vectors. This is due to the transfer of a set of training images through pre-processed CNN. Matrix [300000, 51200]. PCA components tried: 100 to 500.

I want to reduce its dimension, so I can use these functions to teach the ML algorithm, for example XGBoost. Thanks.

+7
source share
2 answers

In the end, I used TruncatedSVD instead of the PCA, which is capable of handling large matrices without memory problems:

from sklearn import decomposition

n_comp = 250
svd = decomposition.TruncatedSVD(n_components=n_comp, algorithm='arpack')
svd.fit(train_features)
print(svd.explained_variance_ratio_.sum())

train_features = svd.transform(train_features)
test_features = svd.transform(test_features)
+6
source

You can use IncrementalPCAavailable in SK learn. from sklearn.decomposition import IncrementalPCA. The rest of the interface is the same as y PCA. You need to pass an additional argument batch_size, which should be <= #components.

However, if you need to use a non-linear version, such as KernelPCA, it seems, something similar is not supported. KernelPCAabsolutely explodes memory requirements in it, see this Wikipedia article on nonlinear dimensional reduction

0

Source: https://habr.com/ru/post/1016482/


All Articles