Counting individual texts in Spark RDD with array objects

I have a rdd ( words) spark , which consists of arrays of texts. For example,

words.take(3)

will return something like.

[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]

Now I want to find out the total number of texts, as well as the unique number of texts. If there were only more than 3 entries in the RDD,

total_words = 7
unique_words = 4 (only A, B,C,D)

Now, to get the total, I did something like this

text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()

But I was fixated on how to get a unique account.

+4
source share
1 answer

Just flatMaptake distinctand count:

words.flatMap(set).distinct().count()
+1
source

Source: https://habr.com/ru/post/1628363/


All Articles