Initial value of Spark 2

Question

Initial value of Spark 2

Getting this zero error in the Dataset.filter spark box

CSV Input:

name,age,stat
abc,22,m
xyz,,s

Work code:

case class Person(name: String, age: Long, stat: String)

val peopleDS = spark.read.option("inferSchema","true")
  .option("header", "true").option("delimiter", ",")
  .csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()

Code error (adding the following line returns an error):

val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()

Returns zero error

java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).

+7

scala apache-spark apache-spark-sql apache-spark-dataset

xstack2000 Jan 15 '17 at 19:08

source share

1 answer

user6910411 · Accepted Answer · 2017-01-15T19:53:09+0000

The exception you get should explain everything, but step by step:

When loading data using a data source, csvall fields are marked as nullable:

val path: String = ???

val peopleDF = spark.read
  .option("inferSchema","true")
  .option("header", "true")
  .option("delimiter", ",")
  .csv(path)

peopleDF.printSchema

root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)

Missing field appears as SQL NULL

peopleDF.where($"age".isNull).show

+----+----+----+
|name| age|stat|
+----+----+----+
| xyz|null|   s|
+----+----+----+

Dataset[Row] Dataset[Person] Long age. Long Scala null. nullable, nullable :

val peopleDS = peopleDF.as[Person]

peopleDS.printSchema

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- stat: string (nullable = true)

, as[T] .

Dataset SQL ( ) DataFrame API, Spark . nullable :

peopleDS.where($"age" > 30).show

+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+

. SQL, NULL - .

API Dataset:
```
peopleDS.filter(_.age > 30)
```
. Long null (SQL NULL), , .
, NPE.

Optional :

case class Person(name: String, age: Option[Long], stat: String)

:

peopleDS.filter(_.age.map(_ > 30).getOrElse(false))

+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+

, :

peopleDS.filter {
  case Some(age) => age > 30
  case _         => false     // or case None => false
}

, ( ) name stat. Scala String - String Java, null. , , , null .

Spark 2.0 DataFrame

Initial value of Spark 2

More articles: