Spark: force two RDD [Key, Value] with co-located partitions using a custom delimiter

I have two RDD[K,V], where K=Longand V=Object. Lets call rdd1and rdd2. I have a regular custom Partitioner. I am trying to find a way to take unionor joinby avoiding or minimizing data movement.

val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */

val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))

val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle

Is it possible to assume (or a way to provide) nth-Partitionboth rdd1, and rdd2on the same slavenode?

+4
source share
1 answer

* colocation Spark, . PartitionerAwareUnionRDD RDDs, . . getPreferredLocations.


* High Performance Spark

RDD , .

+7

Source: https://habr.com/ru/post/1627798/


All Articles