I would like to group by value and then find the maximum value in each group using PySpark. I have the following code, but now I am a little stuck on how to extract the maximum value.
# some file contains tuples ('user', 'item', 'occurrences') data_file = sc.textData('file:///some_file.txt')
It returns something like:
[[(u'u1', u's1', 20), (u'u1', u's2', 5)], [(u'u2', u's3', 5), (u'u2', u's2', 10)]]
I want to find the maximum "appearance" for each user now. The end result after executing max would result in an RDD that would look like this:
[[(u'u1', u's1', 20)], [(u'u2', u's2', 10)]]
Where only the maximum data set will remain for each user in the file. In other words, I want to change the RDD value to contain only one triplet, each of which contains the maximum number of users.
source share