How to find the intersection of two rdd keys in pyspark?

Question

How to find the intersection of two rdd keys in pyspark?

I have two rdds:

rdd1 = sc.parallelize([("www.page1.html", "word1"), ("www.page2.html", "word1"), 
    ("www.page1.html", "word3")])

rdd2 = sc.parallelize([("www.page1.html", 7.3), ("www.page2.html", 1.25), 
    ("www.page3.html", 5.41)])

intersection_rdd = rdd1.keys().intersection(rdd2.keys())

// When I do this, I only get the key intersection ie (www.page1.html, www.page2.html).

But I need keys along with two values from two rdds. The result should look like this:

[www.page1.html, (word1, word3, 7.3)]

[www.page2.html, (word1, 1.25)]

+4

python apache-spark pyspark

anvesh Dec 10 '15 at 7:24

source share

2 answers

Since you use the install operation only for keys, your output includes only keys.

rdd1.union(rdd2).groupByKey().mapValues(tuple).collect()

"GroupByKey

('www.page1.html', 'word1') ('www.page1.html', ['word1', 'word3', 7.3])
('www.page2.html', 'word1') ('www.page2.html', ['word1', 1.25])
('www.page1.html', 'word3') ('www.page3.html', [5.41])
('www.page1.html', 7.3)
('www.page2.html', 1.25)
('www.page3.html', 5.41)

0

alwaysprep 10 . '15 17:27

zero323 · Accepted Answer · 2015-12-10T09:44:19+0000

You can, for example, cogroupfilter:

## This depends on empty resultiterable.ResultIterable
## evaluating to False

intersection_rdd = rdd1.cogroup(rdd2).filter(lambda x: x[1][0] and x[1][1])
intersection_rdd.map(lambda x: (x[0], (list(x[1][0]), list(x[1][1])))).collect()

## [('www.page1.html', (['word1', 'word3'], [7.3])),
##  ('www.page2.html', (['word1'], [1.25]))]

How to find the intersection of two rdd keys in pyspark?

More articles: