Using Spark 2.x, it seems that I cannot create a Dataframe using an RDD string consisting of case classes.
It works fine on Spark 1.6.x, but the following exception is not working on 2.x at runtime:
java.lang.RuntimeException: Timestamp is not a valid external type for schema of struct<seconds:bigint,nanos:int>
preceded by a bunch of generated code from Catalyst.
Here is a snippet (simplified version of what I'm doing):
package main
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType}
object Test {
case class Timestamp(seconds: Long, nanos: Int)
val TIMESTAMP_TYPE = StructType(List(
StructField("seconds", LongType, false),
StructField("nanos", IntegerType, false)
))
val SCHEMA = StructType(List(
StructField("created_at", TIMESTAMP_TYPE, true)
))
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
val rowRDD = spark.sparkContext.parallelize(Seq((0L, 0))).map {
case (seconds: Long, nanos: Int) => {
Row(Timestamp(seconds, nanos))
}
}
spark.createDataFrame(rowRDD, SCHEMA).show(1)
}
}
I'm not sure if this is a Spark error or something that I missed in the documentation (I know that Spark 2.x introduced a Row string check, maybe this is due)
Help appreciate
source
share