When creating a Dataset from a statically typed structure (independent of the schema argument), Spark uses a relatively simple set of rules to define the nullable property.
- If an object of this type can be
null , then its DataFrame representation DataFrame nullable . - If the object is
Option[_] , then its DataFrame nullable representation with None is considered SQL null . - In any other case, it will be marked as
nullable .
Since the Scala String is java.lang.String , which may be null , the spawned column may be nullable . For the same reason, the bar column is nullable in the source dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C")) val df1 = data1.toDF("foo", "bar") df1.schema("bar").nullable
Boolean = true
but foo not ( scala.Int cannot be null ).
df1.schema("foo").nullable
Boolean = false
If we change the data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo will be nullable ( Integer is java.lang.Integer , and the integer in the block can be null ):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
See also: SPARK-20668 Modify ScalaUDF to handle nullability.
source share