Spark: force two RDD [Key, Value] with co-located partitions using a custom delimiter

Question

Spark: force two RDD [Key, Value] with co-located partitions using a custom delimiter

I have two RDD[K,V], where K=Longand V=Object. Lets call rdd1and rdd2. I have a regular custom Partitioner. I am trying to find a way to take unionor joinby avoiding or minimizing data movement.

val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */

val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))

val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle

Is it possible to assume (or a way to provide) nth-Partitionboth rdd1, and rdd2on the same slavenode?

+4

shuffle hash partitioning apache-spark

Mohitt Feb 08 '16 at 14:08

source share

1 answer

zero323 · Accepted Answer · 2016-02-08T14:41:55+0000

* colocation Spark, . PartitionerAwareUnionRDD RDDs, . . getPreferredLocations.

* High Performance Spark

RDD , .

Spark: force two RDD [Key, Value] with co-located partitions using a custom delimiter

More articles: