When creating a Dataset
from a statically typed structure (independent of the schema
argument), Spark uses a relatively simple set of rules to define the nullable
property.
- If an object of this type can be
null
, then its DataFrame
representation DataFrame
nullable
. - If the object is
Option[_]
, then its DataFrame
nullable
representation with None
is considered SQL null
. - In any other case, it will be marked as
nullable
.
Since the Scala String
is java.lang.String
, which may be null
, the spawned column may be nullable
. For the same reason, the bar
column is nullable
in the source dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C")) val df1 = data1.toDF("foo", "bar") df1.schema("bar").nullable
Boolean = true
but foo
not ( scala.Int
cannot be null
).
df1.schema("foo").nullable
Boolean = false
If we change the data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo
will be nullable
( Integer
is java.lang.Integer
, and the integer in the block can be null
):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
See also: SPARK-20668 Modify ScalaUDF to handle nullability.
source share