When creating a Dataset from a statically typed structure (independent of the schema argument), Spark uses a relatively simple set of rules to define the nullable property.
- If an object of this type can be null, then itsDataFramerepresentationDataFramenullable.
- If the object is Option[_], then itsDataFramenullablerepresentation withNoneis considered SQLnull.
- In any other case, it will be marked as nullable.
Since the Scala String is java.lang.String , which may be null , the spawned column may be nullable . For the same reason, the bar column is nullable in the source dataset:
 val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C")) val df1 = data1.toDF("foo", "bar") df1.schema("bar").nullable 
 Boolean = true 
but foo not ( scala.Int cannot be null ).
 df1.schema("foo").nullable 
 Boolean = false 
If we change the data definition to:
 val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C")) 
foo will be nullable ( Integer is java.lang.Integer , and the integer in the block can be null ):
 data2.toDF("foo", "bar").schema("foo").nullable 
 Boolean = true 
See also: SPARK-20668 Modify ScalaUDF to handle nullability.