Understanding the role of UIDs in a Spark MLLib transformer

Question

Understanding the role of UIDs in a Spark MLLib transformer

I am using Apache Spark with Scala to create an ML pipeline. One of the transformers that I have in my conveyor performs an expensive operation joinin the early stages of the process. Since I have many functions in mine ParamGrid, this means that the program needs to hold this huge, combined DataFramein memory, while it optimizes each function in the grid.

To try to solve this problem, I created a custom Transformerone that caches this large intermediate DataFrame, writing it to the parquet in S3 and returning a DataFrame that is read from the parquet. This worked and increased the speed of the model until I added the functions to ParamGridthat were delivered before the caching stage. When I write parquet in S3, I use the path that is defined:

class Cacher(override val uid: String) extends Transformer {

  // the cachePath variable determines the path within the S3 bucket
  lazy val cachePath = Identifiable.randomUID(uid + "transformer-cache")

  // ...

I think I don’t understand how it works uid... I assumed that whenever Spark was optimized over ParamGrid, it took all the classes that were delivered at that moment in the pipeline, created new instances from them and gave them a new, unique uidto track them. I suspect that caching is going awry because Spark does not give unique uidto the new instances Transformerthat it creates, which means that the cached parquet is constantly overwritten whenever a new cache instance is created Transformer. Can someone give pointers on how to create a unique random uidstep for each instance of the stage that the pipeline creates?

Hooray!

+5

scala apache-spark apache-spark-sql spark-dataframe apache-spark-mllib apache-spark-ml

the-jackalope 29 . '16 20:11

1

user6910411 · Accepted Answer · 2016-12-11T06:48:02+0000

:

uid Identifiable (Transformer extends PipelineStage, Params, Identifiable).
Identifiable docs uid :
.

:

Params . this uid.

import org.apache.spark.ml.feature.OneHotEncoder

val enc = new OneHotEncoder()
val enc_  = enc.setInputCol("foo")

enc == enc_

Boolean = true

enc.uid == enc_.uid

Boolean = true

copying Params , uid (. ).

val encCopy = enc.copy(new org.apache.spark.ml.param.ParamMap())

encCopy == enc

Boolean = false

encCopy.uid == enc.uid

Boolean = true

copy , uid, , , Params Identifiable.

:

uid , .
(Dataset.persist). , , .

Understanding the role of UIDs in a Spark MLLib transformer

More articles: