Retrieving Dataframe Column List Values ​​in Apache Spark

I would like to convert a string column of data to a list. In the Dataframe API Dataframe I can find RDD, so I first tried to convert it to RDD, and then apply the toArray function to RDD. In this case, length and SQL work fine. However, the result I got from RDD has square brackets around every element like this [A00001] . I was wondering if there is a way to convert a column to a list or a way to remove square brackets.

Any suggestions would be appreciated. Thank!

+71
scala apache-spark apache-spark-sql spark-dataframe
Aug 14 '15 at 0:39
source share
5 answers

This should return a collection containing one list:

 dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect() 

Without display, you simply get a Row object containing each column from the database.

Keep in mind that this will probably give you a list of any type. If you want to specify the type of result, you can use .asInstanceOf [YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping

PS because of the automatic conversion, you can skip the .rdd part.

+94
Aug 14 '15 at 7:49
source share

With Spark 2.x and Scala 2.11

I think of 3 possible ways to convert the values ​​of a specific column to a list.

Common code snippets for all approaches

 import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.getOrCreate import spark.implicits._ // for .toDf() method val df = Seq( ("first", 2.0), ("test", 1.5), ("choose", 8.0) ).toDF("id", "val") 

Approach 1

 df.select("id").collect().map(_(0)).toList // res9: List[Any] = List(one, two, three) 

What is happening now? We collect data for Driver with collect() and select the zero element from each record.

This may not be a great way to do this, let's improve it with the following approach.




Approach 2

 df.select("id").rdd.map(r => r(0)).collect.toList //res10: List[Any] = List(one, two, three) 

What is better? We distributed the load of the card conversion among the workers, and not by one driver.

I know rdd.map(r => r(0)) does not seem elegant to you. So let's look at this in the next approach.




Approach 3

 df.select("id").map(r => r.getString(0)).collect.toList //res11: List[String] = List(one, two, three) 

Here we do not convert the DataFrame to RDD. Look at map it will not accept r => r(0) (or _(0) ) as the previous approach due to problems with the encoder in the DataFrame. So end up using r => r.getString(0) and that will be r => r.getString(0) in future versions of Spark.

Conclusion

All options give the same result, but 2 and 3 are effective, finally, the third is efficient and elegant (I think that).

Link to the Databricks notebook, which will be available up to 6 months from 2017/05/20

+49
May 20 '17 at 6:44
source share

I know that the answer indicated and requested is supposed to be for Scala, so I just provide a small snippet of Python code if PySpark user is interested. The syntax is similar to this answer, but for the list to display correctly, I really have to refer to the column name a second time in the mapping function, and I don't need the select statement.

i.e. DataFrame containing a column named "Raw"

To get each row value in "Raw", combined as a list, where each record is a row value from "Raw", I simply use:

 MyDataFrame.rdd.map(lambda x: x.Raw).collect() 
+17
Sep 30 '16 at 23:41
source share

In Scala and Spark 2+, try this (assuming your column name is "s"): df.select('s).as[String].collect

+5
Jul 10 '17 at 17:20
source share
 sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets 

works great

+1
Dec 16 '17 at 5:58
source share



All Articles