Python spark: narrowing down the most important functions with PCA

Question

Python spark: narrowing down the most important functions with PCA

I am using spark 2.2 with python. I am using PCA from ml.feature module. I use VectorAssembler to feed my functions to the PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 that I do:

from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=table.columns, outputCol="features") df = assembler.transform(table).select("features") from pyspark.ml.feature import PCA pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures") model = pca.fit(df)

At this time, I started the PCA with two components, and I can look at its values as:

 m = model.pc.values.reshape(3, 2)

which corresponds to columns 3 (= number of columns in my source table) and columns 2 (= number of components in my PCA). My question is, are the three lines here in the same order in which I pointed my input columns to the vector assembler above? To clarify this, the following matrix corresponds to:

  | PC1 | PC2 | ---------|-----|-----| col1 | | | ---------|-----|-----| col2 | | | ---------|-----|-----| col3 | | | ---------+-----+-----+

Please note that this is an example only for clarity. In my real problem, I am dealing with ~ 1600 columns and a lot of choices. I could not find a definitive answer to this in the claim documentation. I want to do this in order to select the best columns / functions from my original table to train my model based on the main core components. Or is there something else / better in the spark ML PCA that I have to look for to get such a result?

Or can I not use the PCA for this and must use other methods, such as spearman rating, etc.?

+5

machine-learning pca apache-spark pyspark feature-selection

Sameer majajan Jan 30 '18 at 16:43

source share

2 answers

You can see the actual column order here

 df.schema["features"].metadata["ml_attr"]["attrs"]

there will usually be two classes: ["binary" and ["numeric"]

 pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

Must indicate the exact order of all columns. You can check the input and output order remains the same.

0

pratiklodha Feb 05 '18 at 13:03

source share

user8371915 · Accepted Answer · 2018-02-02T08:43:58+0000

are (...) rows here in the same order in which I specified my input columns

Yes they are. Let it track what happens:

 from pyspark.ml.feature import PCA, VectorAssembler data = [ (0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0), (4.0, 0.0, 0.0, 6.0, 7.0) ] df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])

VectorAseembler follows the column order:

 assembler = VectorAssembler(inputCols=df.columns, outputCol="features") vectors = assembler.transform(df).select("features") vectors.schema[0].metadata # {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'}, # {'idx': 1, 'name': 'v'}, # {'idx': 2, 'name': 'x'}, # {'idx': 3, 'name': 'y'}, # {'idx': 4, 'name': 'z'}]}, # 'num_attrs': 5}}

So, the main components

 model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors) ?model.pc # Type: property # String form: <property object at 0x7feb5bdc1d68> # Docstring: # Returns a principal components Matrix. # Each column is one principal component. # # .. versionadded:: 2.0.0

Finally, a health check:

 import numpy as np x = np.array(data) y = model.pc.values.reshape(3, 5).transpose() z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect()) np.linalg.norm(x.dot(y) - z) # 8.881784197001252e-16

Python spark: narrowing down the most important functions with PCA

More articles: