Why dataset.count is causing a shuffle! (spark 2.2)

Question

Why dataset.count is causing a shuffle! (spark 2.2)

Here is my dataframe:

There are 2 sections in basic RDD

When I do df.count, the created DAG

When I do df.rdd.count, the created DAG is:

Ques . A graph is an action in a spark expression, an official definition: "Returns the number of rows in a DataFrame. Now that I am counting on a dataframe, why does shuffling happen? Also, when I do the same on the underlying RDD, there is no shuffling.

It makes no sense to me why shuffling will occur anyway. I tried to look at the source code of the account here spark github But that doesn't make any sense to me. Is "groupby" the culprit for the action?

PS. df.coalesce(1).count

+4

scala apache-spark rdd spark-dataframe

bigdatamann 09 . '17 4:54

2

pratyush sharma · Answer 1 · 2017-11-09T15:20:18+0000

, DataFrame groupBy, . https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

* Returns the number of rows in the Dataset.
* @group action
* @since 1.6.0
*/
def count(): Long = withAction("count", groupBy().count().queryExecution) { 
plan =>
plan.executeCollect().head.getLong(0)
}

RDD, , , .sum .

: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

nefo_x · Answer 2 · 2017-11-09T10:46:45+0000

DataFrame, , , . , .

- SQL Spark UI, - :

*HashAggregate(keys=[], functions=[count(1)], output=[count#202L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#206L])

Why dataset.count is causing a shuffle! (spark 2.2)

More articles: