How to join two RDD in a spark with python?

Question

How to join two RDD in a spark with python?

Let's pretend that

rdd1 = ( (a, 1), (a, 2), (b, 1) ), rdd2 = ( (a, ?), (a, *), (c, .) ).

Want to create

 ( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).

Any simple methods? I think that it is different from the cross, but cannot find a good solution. My decision

 (rdd1 .cartesian( rdd2 ) .filter( lambda (k, v): k[0]==v[0] ) .map( lambda (k, v): (k[0], (k[1], v[1])) ))

+6

join apache-spark pyspark

Peng sun Jun 22 '15 at 20:12

source share

1 answer

dpeacock · Answer 1 · 2015-06-22T21:48:56+0000

You are just looking for a simple connection like

 rdd = sc.parallelize([("red",20),("red",30),("blue", 100)]) rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)]) rdd.join(rdd2).collect() # Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]

How to join two RDD in a spark with python?

More articles: