I have a rdd ( words) spark , which consists of arrays of texts. For example,
words.take(3)
will return something like.
[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]
Now I want to find out the total number of texts, as well as the unique number of texts. If there were only more than 3 entries in the RDD,
total_words = 7
unique_words = 4 (only A, B,C,D)
Now, to get the total, I did something like this
text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()
But I was fixated on how to get a unique account.
source
share