PCA in Spark MLlib and Spark ML

Question

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They slightly overlap the implemented, but, as I understand it (as a person new to the entire Spark ecosystem), Spark ML is the way to go, and MLlib is still mainly for backward compatibility.

My question is very specific and related to the PCA. In MLlib, the implementation seems to limit the number of columns

spark.mllib supports PCA for high-precision matrices stored in a row-oriented format and any vectors.

In addition, if you look at an example of Java code, there is this

The number of columns should be small, for example, less than 1000.

On the other hand, if you are looking at the ML documentation, there are no restrictions mentioned.

So my question is: does this limitation exist in Spark ML? And if so, why is there a limitation and is there any workaround to be able to use this implementation, even if the number of columns is large?

+6

apache-spark apache-spark-mllib apache-spark-ml

Marko Oct 26 '16 at 12:38

source share

1 answer

broncoAbierto · Accepted Answer · 2016-10-26T16:56:54+0000

The PCA consists of finding a set of independent random variables with which you can represent your data, sorted in descending order with respect to the value that they store.

These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (medium-term) data matrix is X, then this subspace consists of eigenvectors X ^ T X.

When X is large, say about the dimensions of nxd, you can calculate X ^ TX by calculating the outer product of each row of the matrix yourself, and then adding all the results up. This, of course, lends itself to a simple map reduction procedure if d is small, regardless of how large n is. This is because the external product of each row is itself a dxd matrix, which every worker must process in main memory. This is why you have had difficulty processing many columns.

If the number of columns is large (and the number of rows is not so much), you can really calculate the PCA. Just calculate the SVD of your (mid-center) transposed data matrix and multiply it by the resulting eigenvectors and the inverse diagonal eigenvalue matrix. There is your orthogonal subspace.

Bottom line: if the spark.ml implementation follows the first approach every time, then the restriction should be the same. If they check the size of the input dataset to decide if they should go for the second approach, then you will not have problems with a large number of columns if the number of rows is small.

Regardless of this, the restriction is imposed by how much memory your employees have, so perhaps they allow users to hit the ceiling themselves, and not offer a restriction that may not apply to some. This may be the reason they decided not to mention the restriction in the new documents.

Update: The source code shows that they perform the first approach each time, regardless of the dimension of the input. The actual limit is 65535, and at 10000, a warning.

PCA in Spark MLlib and Spark ML

More articles: