Spark why columns change to nullable true

why nullable = true after performing some functions? However, there are no nan values ​​in df.

val myDf = Seq((2,"A"),(2,"B"),(1,"C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show 

when df.printSchema is df.printSchema , now the nullable value will be false for both columns.

 val foo: (Int => String) = (t: Int) => { fooMap.get(t) match { case Some(tt) => tt case None => "notFound" } } val fooMap = Map( 1 -> "small", 2 -> "big" ) val fooUDF = udf(foo) myDf .withColumn("foo", fooUDF(col("foo"))) .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2") .select("foo", "foo_2") .printSchema 

However, now the nullable value is true for at least one column that used to be false. How can this be explained?

+7
source share
2 answers

When creating a Dataset from a statically typed structure (independent of the schema argument), Spark uses a relatively simple set of rules to define the nullable property.

  • If an object of this type can be null , then its DataFrame representation DataFrame nullable .
  • If the object is Option[_] , then its DataFrame nullable representation with None is considered SQL null .
  • In any other case, it will be marked as nullable .

Since the Scala String is java.lang.String , which may be null , the spawned column may be nullable . For the same reason, the bar column is nullable in the source dataset:

 val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C")) val df1 = data1.toDF("foo", "bar") df1.schema("bar").nullable 
 Boolean = true 

but foo not ( scala.Int cannot be null ).

 df1.schema("foo").nullable 
 Boolean = false 

If we change the data definition to:

 val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C")) 

foo will be nullable ( Integer is java.lang.Integer , and the integer in the block can be null ):

 data2.toDF("foo", "bar").schema("foo").nullable 
 Boolean = true 

See also: SPARK-20668 Modify ScalaUDF to handle nullability.

+7
source

You can also very quickly change the DataFrame layout. something like that would do the job -

 def setNullableStateForAllColumns( df: DataFrame, columnMap: Map[String, Boolean]) : DataFrame = { import org.apache.spark.sql.types.{StructField, StructType} // get schema val schema = df.schema val newSchema = StructType(schema.map { case StructField( c, d, n, m) => StructField( c, d, columnMap.getOrElse(c, default = n), m) }) // apply new schema df.sqlContext.createDataFrame( df.rdd, newSchema ) } 
0
source

Source: https://habr.com/ru/post/1012331/


All Articles