With Spark 2.x and Scala 2.11
I think of 3 possible ways to convert the values ββof a specific column to a list.
Common code snippets for all approaches
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.getOrCreate import spark.implicits._ // for .toDf() method val df = Seq( ("first", 2.0), ("test", 1.5), ("choose", 8.0) ).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
What is happening now? We collect data for Driver with collect() and select the zero element from each record.
This may not be a great way to do this, let's improve it with the following approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
What is better? We distributed the load of the card conversion among the workers, and not by one driver.
I know rdd.map(r => r(0)) does not seem elegant to you. So let's look at this in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
Here we do not convert the DataFrame to RDD. Look at map it will not accept r => r(0) (or _(0) ) as the previous approach due to problems with the encoder in the DataFrame. So end up using r => r.getString(0) and that will be r => r.getString(0) in future versions of Spark.
Conclusion
All options give the same result, but 2 and 3 are effective, finally, the third is efficient and elegant (I think that).
Link to the Databricks notebook, which will be available up to 6 months from 2017/05/20
mrsrinivas May 20 '17 at 6:44 2017-05-20 06:44
source share