Python spark: narrowing down the most important functions with PCA

I am using spark 2.2 with python. I am using PCA from ml.feature module. I use VectorAssembler to feed my functions to the PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 that I do:

from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=table.columns, outputCol="features") df = assembler.transform(table).select("features") from pyspark.ml.feature import PCA pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures") model = pca.fit(df) 

At this time, I started the PCA with two components, and I can look at its values ​​as:

 m = model.pc.values.reshape(3, 2) 

which corresponds to columns 3 (= number of columns in my source table) and columns 2 (= number of components in my PCA). My question is, are the three lines here in the same order in which I pointed my input columns to the vector assembler above? To clarify this, the following matrix corresponds to:

  | PC1 | PC2 | ---------|-----|-----| col1 | | | ---------|-----|-----| col2 | | | ---------|-----|-----| col3 | | | ---------+-----+-----+ 

Please note that this is an example only for clarity. In my real problem, I am dealing with ~ 1600 columns and a lot of choices. I could not find a definitive answer to this in the claim documentation. I want to do this in order to select the best columns / functions from my original table to train my model based on the main core components. Or is there something else / better in the spark ML PCA that I have to look for to get such a result?

Or can I not use the PCA for this and must use other methods, such as spearman rating, etc.?

+5
source share
2 answers

are (...) rows here in the same order in which I specified my input columns

Yes they are. Let it track what happens:

 from pyspark.ml.feature import PCA, VectorAssembler data = [ (0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0), (4.0, 0.0, 0.0, 6.0, 7.0) ] df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"]) 

VectorAseembler follows the column order:

 assembler = VectorAssembler(inputCols=df.columns, outputCol="features") vectors = assembler.transform(df).select("features") vectors.schema[0].metadata # {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'}, # {'idx': 1, 'name': 'v'}, # {'idx': 2, 'name': 'x'}, # {'idx': 3, 'name': 'y'}, # {'idx': 4, 'name': 'z'}]}, # 'num_attrs': 5}} 

So, the main components

 model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors) ?model.pc # Type: property # String form: <property object at 0x7feb5bdc1d68> # Docstring: # Returns a principal components Matrix. # Each column is one principal component. # # .. versionadded:: 2.0.0 

Finally, a health check:

 import numpy as np x = np.array(data) y = model.pc.values.reshape(3, 5).transpose() z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect()) np.linalg.norm(x.dot(y) - z) # 8.881784197001252e-16 
+1
source

You can see the actual column order here

 df.schema["features"].metadata["ml_attr"]["attrs"] 

there will usually be two classes: ["binary" and ["numeric"]

 pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx") 

Must indicate the exact order of all columns. You can check the input and output order remains the same.

0
source

Source: https://habr.com/ru/post/1265752/


All Articles