Partitioning and merging Apache Spark RDD

When I am jointwo RDD, where is the actual data, that is, the data aggregated on the driver and then sent back to the work nodes or one of the nodes randomly selected to β€œreceive” the data? Also, if I call partitionon pairRDD, is it automatically partitioned using the key?

+4
source share
1 answer

No, this does not happen through the driver or any one node. A random change occurs in which each of the many tasks for performers collects all the values ​​(from both parents) for a subset of the keys. Tasks form an association product for each key as it repeats. Keyword separation. Joining two equally separated RDDs is beneficial as you avoid shuffling.

+4
source

Source: https://habr.com/ru/post/1584681/


All Articles