PCA output in Spark does not match scikit-learn

I am testing PCA (core component analysis) in Spark ML.

data = [(Vectors.dense([1.0, 1.0]),),
  (Vectors.dense([1.0, 2.0]),),
  (Vectors.dense([4.0, 4.0]),), 
  (Vectors.dense([5.0, 4.0]),)]

df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()

Conclusion:

+---------+--------------------+
| features|         pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+

When I tried PCA on the same data in scikit-learn, as shown below, it gave a different result

X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
    print(x,y)

Conclusion:

[ 1.  1.] [-2.44120041]
[ 1.  2.] [-1.85996222]
[ 4.  4.] [ 1.74371458]
[ 5.  4.] [ 2.55744805]

As you can see, there is a difference in output.

To check the result, I calculated PCA for the same data mathematically. I got the same result as from scikit-learn. Below is a fragment from the calculation of the pca transformation for the first data point (1.0,1.0): enter image description here

as you can see, this matches the result of scikit training.

It seems that the ML spark does not subtract the average vector MX from the data vector X, i.e. Y = A*(X-MX)used instead Y = A*(X).

For the point (1.0,1.0):

Y = (0.814*1.0)+(0.581*1.0)) = 1.395 

which is the same result that we got with spark ML.

Is Spark ML a false result or am I missing something?

+4
1

Spark PCA . , . , StandardScaler :

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
scaled_df = scaler.fit(df).transform(df)

PCA scaled_df , , , scikit-learn.


Spark ML . PCA , :

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                    withStd=False, withMean=True)
pca = PCA(k=1, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[scaler , pca])

model = pipeline.fit(df)
transformed_feature = model.transform(df)
+2

Source: https://habr.com/ru/post/1690600/


All Articles