Encode ADT / sealed attribute hierarchy in Spark DataSet column

Question

Encode ADT / sealed attribute hierarchy in Spark DataSet column

If I want to store an Algebraic Data Type (ADT) (i.e. Scala sealed feature hierarchy) inside a Spark DataSet , what is the best coding strategy?

For example, if I have an ADT, where sheet types store different data types:

sealed trait Occupation case object SoftwareEngineer extends Occupation case class Wizard(level: Int) extends Occupation case class Other(description: String) extends Occupation

What is the best way to build:

 org.apache.spark.sql.DataSet[Occupation]

+5

scala dataset apache-spark

Ben hutchison Dec 08 '16 at 1:03

source share

1 answer

user6910411 · Accepted Answer · 2016-12-11T02:57:41+0000

TL DR There is currently no good solution, and given the implementation of Spark SQL / Dataset , it is unlikely that there will be one in the foreseeable future.

You can use generic kryo or java encoder

 val occupation: Seq[Occupation] = Seq(SoftwareEngineer, Wizard(1), Other("foo")) spark.createDataset(occupation)(org.apache.spark.sql.Encoders.kryo[Occupation])

but hardly useful in practice.

The UDT API provides another possible approach (Spark 1.6 , 2.0 , 2.1-SNAPSHOT ), it is private and requires quite a lot of code templates (you can check oasml.linalg.VectorUDT to see an implementation example).

Encode ADT / sealed attribute hierarchy in Spark DataSet column

More articles: