Here is my dataframe:
There are 2 sections in basic RDD
When I do df.count, the created DAG
When I do df.rdd.count, the created DAG is:
Ques . A graph is an action in a spark expression, an official definition: "Returns the number of rows in a DataFrame. Now that I am counting on a dataframe, why does shuffling happen? Also, when I do the same on the underlying RDD, there is no shuffling.
It makes no sense to me why shuffling will occur anyway. I tried to look at the source code of the account here spark github But that doesn't make any sense to me. Is "groupby" the culprit for the action?
PS. df.coalesce(1).count