Spark why columns change to nullable true

Question

Spark why columns change to nullable true

why nullable = true after performing some functions? However, there are no nan values in df.

val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show

when df.printSchema is df.printSchema , now the nullable value will be false for both columns.

 val foo: (Int => String) = (t: Int) => { fooMap.get(t) match { case Some(tt) => tt case None => "notFound" } } val fooMap = Map( 1 -> "small", 2 -> "big" ) val fooUDF = udf(foo) myDf .withColumn("foo", fooUDF(col("foo"))) .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2") .select("foo", "foo_2") .printSchema

However, now the nullable value is true for at least one column that used to be false. How can this be explained?

+7

apache-spark apache-spark-sql apache-spark-dataset

Georg Heiler Nov 15 '16 at 6:53

source share

2 answers

You can also very quickly change the DataFrame layout. something like that would do the job -

 def setNullableStateForAllColumns( df: DataFrame, columnMap: Map[String, Boolean]) : DataFrame = { import org.apache.spark.sql.types.{StructField, StructType} // get schema val schema = df.schema val newSchema = StructType(schema.map { case StructField( c, d, n, m) => StructField( c, d, columnMap.getOrElse(c, default = n), m) }) // apply new schema df.sqlContext.createDataFrame( df.rdd, newSchema ) }

0

nil Apr 18 '18 at 20:41

source share

user6910411 · Accepted Answer · 2016-11-15T07:04:06+0000

When creating a Dataset from a statically typed structure (independent of the schema argument), Spark uses a relatively simple set of rules to define the nullable property.

If an object of this type can be null , then its DataFrame representation DataFrame nullable .
If the object is Option[_] , then its DataFrame nullable representation with None is considered SQL null .
In any other case, it will be marked as nullable .

Since the Scala String is java.lang.String , which may be null , the spawned column may be nullable . For the same reason, the bar column is nullable in the source dataset:

 val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C")) val df1 = data1.toDF("foo", "bar") df1.schema("bar").nullable

 Boolean = true

but foo not ( scala.Int cannot be null ).

 df1.schema("foo").nullable

 Boolean = false

If we change the data definition to:

 val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))

foo will be nullable ( Integer is java.lang.Integer , and the integer in the block can be null ):

 data2.toDF("foo", "bar").schema("foo").nullable

 Boolean = true

See also: SPARK-20668 Modify ScalaUDF to handle nullability.

Spark why columns change to nullable true

More articles: