Spark DataFrame: work with groups

Question

Spark DataFrame: work with groups

I have a DataFrame I'm working on, and I want to group by many columns and manage each group in the remaining columns. In a regular RDD -land, I think it will look something like this:

 rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ). groupByKey(). forEachPartition( iter => doSomeJob(iter) )

In a DataFrame -land, I would start like this:

 df.groupBy("col1", "col2", "col3") // Reference by name

but then I'm not sure how to work with groups if my operations are more complicated than the average / min / max / count offered by GroupedData .

For example, I want to create a single MongoDB document per group ("col1", "col2", "col3") (by repeating using the linked Row in the group), scale to N sections, and then paste the documents into the MongoDB database. Limit N is the maximum number of concurrent connections I want.

Any tips?

+6

scala dataframe grouping apache-spark

Ken williams May 20, '15 at 16:07

source share

1 answer

David Griffin · Answer 1 · 2015-05-20T19:48:14+0000

You can make an independent connection. First get the groups:

 val groups = df.groupBy($"col1", $"col2", $"col3").agg($"col1", $"col2", $"col3")

Then you can join this source DataFrame file:

 val joinedDF = groups .select($"col1" as "l_col1", $"col2" as "l_col2", $"col3" as "l_col3) .join(df, $"col1" <=> $"l_col1" and $"col2" <=> $"l_col2" and $"col3" <=> $"l_col3")

As long as you get exactly the same data that you originally had (and with three additional redundant columns), you can make another connection to add a column with the MongoDB document ID for the group (col1, col2, col3) associated with the row.

Anyway, in my experience, the ways of handling complex things in DataFrames are joining and joining.

Spark DataFrame: work with groups

More articles: