Spark SQL nested using column

I have a DataFrame that has several columns, some of which are structures. Something like that

root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) |-- abc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- def: struct (nullable = true) | | | |-- a: string (nullable = true) | | | |-- b: integer (nullable = true) | | | |-- c: string (nullable = true) 

I want to apply a UserDefinedFunction in a baz column to replace baz with a baz function, but I cannot figure out how to do this. Here is an example of the desired output (note that baz now int )

 root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: int (nullable = true) |-- abc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- def: struct (nullable = true) | | | |-- a: string (nullable = true) | | | |-- b: integer (nullable = true) | | | |-- c: string (nullable = true) 

It appears that DataFrame.withColumn only works on top-level columns, but not on nested columns. I am using Scala for this problem.

Can someone help me with this?

thanks

+11
source share
1 answer

it's simple, just use a dot to select nested structures, for example. $"foo.baz" :

 case class Foo(bar:String,baz:String) case class Record(foo:Foo) val df = Seq( Record(Foo("Hi","There")) ).toDF() df.printSchema root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) val myUDF = udf((s:String) => { // do something with s s.toUpperCase }) df .withColumn("udfResult",myUDF($"foo.baz")) .show +----------+---------+ | foo|udfResult| +----------+---------+ |[Hi,There]| THERE| +----------+---------+ 

If you want to add the result of your UDF to the existing foo structure, then get:

 root |-- foo: struct (nullable = false) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) | |-- udfResult: string (nullable = true) 

There are two options:

with withColumn :

 df .withColumn("udfResult",myUDF($"foo.baz")) .withColumn("foo",struct($"foo.*",$"udfResult")) .drop($"udfResult") 

with select :

 df .select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo")) 

EDIT: Replacing an existing attribute in a structure with a UDF result: unfortunately, this does not work:

 df .withColumn("foo.baz",myUDF($"foo.baz")) 

but it can be done as follows:

 // get all columns except foo.baz val structCols = df.select($"foo.*") .columns .filter(_!="baz") .map(name => col("foo."+name)) df.withColumn( "foo", struct((structCols:+myUDF($"foo.baz").as("baz")):_*) ) 
+13
source

Source: https://habr.com/ru/post/1269343/


All Articles