Why dataset.count is causing a shuffle! (spark 2.2)

Here is my dataframe:

enter image description here

There are 2 sections in basic RDD

enter image description here enter image description here

When I do df.count, the created DAG enter image description here

When I do df.rdd.count, the created DAG is:

enter image description here

Ques . A graph is an action in a spark expression, an official definition: "Returns the number of rows in a DataFrame. Now that I am counting on a dataframe, why does shuffling happen? Also, when I do the same on the underlying RDD, there is no shuffling.

It makes no sense to me why shuffling will occur anyway. I tried to look at the source code of the account here spark github But that doesn't make any sense to me. Is "groupby" the culprit for the action?

PS. df.coalesce(1).count

+4
2

, DataFrame groupBy, . https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

* Returns the number of rows in the Dataset.
* @group action
* @since 1.6.0
*/
def count(): Long = withAction("count", groupBy().count().queryExecution) { 
plan =>
plan.executeCollect().head.getLong(0)
}

RDD, , , .sum .

: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
+3

DataFrame, , , . , .

- SQL Spark UI, - :

*HashAggregate(keys=[], functions=[count(1)], output=[count#202L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#206L])
+2

Source: https://habr.com/ru/post/1688982/


All Articles