How to map functions from VectorAssembler output to column names in Spark ML?

I am trying to run linear regression in PySpark, and I want to create a table containing summary statistics such as coefficients, P values, and t values ​​for each column in my dataset. However, to train the linear regression model, I had to create a vector of functions using Spark VectorAssembler , and now for each row I have one function vector and a target column. When I try to access the summary statistics of Spark regression, they give me a very rough list of numbers for each of these statistics, and there is no way to find out which attribute corresponds to the value that is really difficult to manually determine using a large number of columns. How to match these values ​​with column names?

For example, I have my current output something like this:

Participation rates: [-187.807832407, -187.058926726.85.1716641376.10595.3352802, -127.258892837, -39.2827730493, -1206.47228704,33.7078197705,99.9956812528]

P value: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]

t-statistic: [-23.348593508995318, -44.72813283953004, 19.836508234714472, 144.49248881747755, -16.547272230754242, -9.560681351483941, -19.563547400189073, 1.3228378389036228, 1.0389323283118383328381, 388,332,328,318,323,238.

Coefficient of Standard Error: [8.043646497811427, 4.182131353367049, 4.293682291754585, 73.32793120907755, 7.690626652102948, 4.108783841348964, 61.669402913526625, 25.481445101737247, 91.634587689.7658658687609

These numbers mean nothing if I don’t know what attribute they correspond to. But in my DataFrame , I only have one column called "functions" that contains rows of rare vectors.

This is an even more serious problem when I have hot coded functions, because if I have one variable with encoding of length n, I will get n corresponding coefficients of / p-values ​​/ t-values, etc.

+5
source share
2 answers

To date, Spark does not provide any method that can do this for you, so if you need to create your own. Say your data looks like this:

 import random random.seed(1) df = sc.parallelize([( random.choice([0.0, 1.0]), random.choice(["a", "b", "c"]), random.choice(["foo", "bar"]), random.randint(0, 100), random.random(), ) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"]) 

and processed using the following pipeline:

 from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler from pyspark.ml import Pipeline from pyspark.ml.regression import LinearRegression indexers = [ StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]] encoders = [ OneHotEncoder( inputCol=idx.getOutputCol(), outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers] assembler = VectorAssembler( inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"], outputCol="features") pipeline = Pipeline( stages=indexers + encoders + [assembler, LinearRegression()]) model = pipeline.fit(df) 

Get LinearRegressionModel :

 lrm = model.stages[-1] 

Data conversion:

 transformed = model.transform(df) 

Extract and smooth the ML attributes:

 from itertools import chain attrs = sorted( (attr["idx"], attr["name"]) for attr in (chain(*transformed .schema[lrm.summary.featuresCol] .metadata["ml_attr"]["attrs"].values()))) 

and display the output:

 [(name, lrm.summary.pValues[idx]) for idx, name in attrs] 
 [('x1_idx_enc_a', 0.26400012641279824), ('x1_idx_enc_c', 0.06320192217171572), ('x2_idx_enc_foo', 0.40447778902400433), ('x3', 0.1081883594783335), ('x4', 0.4545851609776568)] 
 [(name, lrm.coefficients[idx]) for idx, name in attrs] 
 [('x1_idx_enc_a', 0.13874401585637453), ('x1_idx_enc_c', 0.23498565469334595), ('x2_idx_enc_foo', -0.083558932128022873), ('x3', 0.0030186112903237442), ('x4', -0.12951394186593695)] 
+7
source

You can see the actual column order here

 df.schema["features"].metadata["ml_attr"]["attrs"] 

there will usually be two classes: ["binary" and ["numeric"]

 pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx") 

Must indicate the exact order of all columns

+3
source

Source: https://habr.com/ru/post/1265751/


All Articles