Suppose I have a data set with a volume of about 2.1 billion records.
This is a dataset with customer information, and I want to know how many times they have done something. Therefore, I have to group the identifier and summarize one column (it has 0 and 1 values, where 1 indicates the action).
Now I can use simple groupBy and agg(sum) , but, in my opinion, this is not very efficient. groupBy will move a lot of data between partitions.
Alternatively, I can also use the Window function with the partitionBy clause, and then summarize the data. One of the drawbacks is that I have to apply an additional filter so that it saves all the data. And I want one record per ID.
But I do not see how this window processes data. Itβs better than this group and amount. Or is it the same thing?
source share