Encode ADT / sealed attribute hierarchy in Spark DataSet column

If I want to store an Algebraic Data Type (ADT) (i.e. Scala sealed feature hierarchy) inside a Spark DataSet , what is the best coding strategy?

For example, if I have an ADT, where sheet types store different data types:

sealed trait Occupation case object SoftwareEngineer extends Occupation case class Wizard(level: Int) extends Occupation case class Other(description: String) extends Occupation 

What is the best way to build:

 org.apache.spark.sql.DataSet[Occupation] 
+5
source share
1 answer

TL DR There is currently no good solution, and given the implementation of Spark SQL / Dataset , it is unlikely that there will be one in the foreseeable future.

You can use generic kryo or java encoder

 val occupation: Seq[Occupation] = Seq(SoftwareEngineer, Wizard(1), Other("foo")) spark.createDataset(occupation)(org.apache.spark.sql.Encoders.kryo[Occupation]) 

but hardly useful in practice.

The UDT API provides another possible approach (Spark 1.6 , 2.0 , 2.1-SNAPSHOT ), it is private and requires quite a lot of code templates (you can check oasml.linalg.VectorUDT to see an implementation example).

+5
source

Source: https://habr.com/ru/post/1260957/


All Articles