Unable to create data frame from RDD row using case class

Using Spark 2.x, it seems that I cannot create a Dataframe using an RDD string consisting of case classes.

It works fine on Spark 1.6.x, but the following exception is not working on 2.x at runtime:

java.lang.RuntimeException: Timestamp is not a valid external type for schema of struct<seconds:bigint,nanos:int>

preceded by a bunch of generated code from Catalyst.

Here is a snippet (simplified version of what I'm doing):

package main

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType}

object Test {

  case class Timestamp(seconds: Long, nanos: Int)

  val TIMESTAMP_TYPE = StructType(List(
    StructField("seconds", LongType, false),
    StructField("nanos", IntegerType, false)
  ))

  val SCHEMA = StructType(List(
    StructField("created_at", TIMESTAMP_TYPE, true)
  ))

  def main(args: Array[String]) {

    val spark = SparkSession.builder().getOrCreate()

    val rowRDD = spark.sparkContext.parallelize(Seq((0L, 0))).map {
      case (seconds: Long, nanos: Int) => {
        Row(Timestamp(seconds, nanos))
      }
    }

    spark.createDataFrame(rowRDD, SCHEMA).show(1)
  }
}

I'm not sure if this is a Spark error or something that I missed in the documentation (I know that Spark 2.x introduced a Row string check, maybe this is due)

Help appreciate

+4
source share
1 answer

, , Row, case . Rows :

import collection.mutable._
import collection.JavaConverters._

spark.createDataFrame(ArrayBuffer(Row(Row(0L, 0))).asJava, SCHEMA)

case:

import spark.implicits._

Seq(Tuple1(Timestamp(0L, 0))).toDF("created_at")

.

, , Options.

case class Record(created_at: Option[Timestamp])
case class Timestamp(seconds: Long, nanos: Option[Int])

Seq(Record(Some(Timestamp(0L, Some(0))))).toDF

, created_at created_at.milliseconds NULL, created_at.seconds , created_at NULL.

+5

Source: https://habr.com/ru/post/1653368/


All Articles