How to apply a function to a Spark DataFrame column?

Suppose we have a Spark DataFrame

df.getClass Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame 

with the following diagram

 df.printSchema root |-- rawFV: string (nullable = true) |-- tk: array (nullable = true) | |-- element: string (containsNull = true) 

Given that each row of the tk column is an array of rows, how to write a Scala function that will return the number of elements in each row?

+5
source share
2 answers

You do not need to write a custom function because it is:

 import org.apache.spark.sql.functions.size df.select(size($"tk")) 

If you really want, you can write udf :

 import org.apache.spark.sql.functions.udf val size_ = udf((xs: Seq[String]) => xs.size) 

or even create a custom expression, but that makes no sense.

+10
source

One way is to access them using sql as shown below.

 df.registerTempTable("tab1") val df2 = sqlContext.sql("select tk[0], tk[1] from tab1") df2.show() 

To get the size of an array column,

 val df3 = sqlContext.sql("select size(tk) from tab1") df3.show() 

If the version of Spark is older, you can use the HiveContext instead of the Spark SQL Context.

I would also try something that goes by.

+1
source

Source: https://habr.com/ru/post/1239901/


All Articles