Counting individual texts in Spark RDD with array objects

Question

I have a rdd ( words) spark , which consists of arrays of texts. For example,

words.take(3)

will return something like.

[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]

Now I want to find out the total number of texts, as well as the unique number of texts. If there were only more than 3 entries in the RDD,

total_words = 7
unique_words = 4 (only A, B,C,D)

Now, to get the total, I did something like this

text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()

But I was fixated on how to get a unique account.

+4

rclakmal Feb 11 '16 at 18:03

1 answer

zero323 · Answer 1 · 2016-02-11T18:30:12+0000

Just flatMaptake distinctand count:

words.flatMap(set).distinct().count()