are (...) rows here in the same order in which I specified my input columns
Yes they are. Let it track what happens:
from pyspark.ml.feature import PCA, VectorAssembler data = [ (0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0), (4.0, 0.0, 0.0, 6.0, 7.0) ] df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])
VectorAseembler follows the column order:
assembler = VectorAssembler(inputCols=df.columns, outputCol="features") vectors = assembler.transform(df).select("features") vectors.schema[0].metadata
So, the main components
model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors) ?model.pc
Finally, a health check:
import numpy as np x = np.array(data) y = model.pc.values.reshape(3, 5).transpose() z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect()) np.linalg.norm(x.dot(y) - z)
source share