I am testing PCA (core component analysis) in Spark ML.
data = [(Vectors.dense([1.0, 1.0]),),
(Vectors.dense([1.0, 2.0]),),
(Vectors.dense([4.0, 4.0]),),
(Vectors.dense([5.0, 4.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()
Conclusion:
+---------+--------------------+
| features| pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+
When I tried PCA on the same data in scikit-learn, as shown below, it gave a different result
X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
print(x,y)
Conclusion:
[ 1. 1.] [-2.44120041]
[ 1. 2.] [-1.85996222]
[ 4. 4.] [ 1.74371458]
[ 5. 4.] [ 2.55744805]
As you can see, there is a difference in output.
To check the result, I calculated PCA for the same data mathematically. I got the same result as from scikit-learn. Below is a fragment from the calculation of the pca transformation for the first data point (1.0,1.0):

as you can see, this matches the result of scikit training.
It seems that the ML spark does not subtract the average vector MX from the data vector X, i.e. Y = A*(X-MX)
used instead Y = A*(X)
.
For the point (1.0,1.0):
Y = (0.814*1.0)+(0.581*1.0)) = 1.395
which is the same result that we got with spark ML.
Is Spark ML a false result or am I missing something?