Partitioning and merging Apache Spark RDD

Question

Partitioning and merging Apache Spark RDD

When I am jointwo RDD, where is the actual data, that is, the data aggregated on the driver and then sent back to the work nodes or one of the nodes randomly selected to “receive” the data? Also, if I call partitionon pairRDD, is it automatically partitioned using the key?

+4

apache-spark rdd

monster Apr 26 '15 at 0:16

source share

1 answer

Sean owen · Accepted Answer · 2015-04-26T00:43:25+0000

No, this does not happen through the driver or any one node. A random change occurs in which each of the many tasks for performers collects all the values (from both parents) for a subset of the keys. Tasks form an association product for each key as it repeats. Keyword separation. Joining two equally separated RDDs is beneficial as you avoid shuffling.

Partitioning and merging Apache Spark RDD

More articles: