Pyspark Window.partitionBy vs groupBy

Suppose I have a data set with a volume of about 2.1 billion records.

This is a dataset with customer information, and I want to know how many times they have done something. Therefore, I have to group the identifier and summarize one column (it has 0 and 1 values, where 1 indicates the action).

Now I can use simple groupBy and agg(sum) , but, in my opinion, this is not very efficient. groupBy will move a lot of data between partitions.

Alternatively, I can also use the Window function with the partitionBy clause, and then summarize the data. One of the drawbacks is that I have to apply an additional filter so that it saves all the data. And I want one record per ID.

But I do not see how this window processes data. It’s better than this group and amount. Or is it the same thing?

+5
source share
1 answer

As far as I know, when working with spark DataFrames, the groupBy operation groupBy optimized using Catalyst . groupBy in DataFrames is different from groupBy in RDD.

For example, groupBy on DataFrames first performs aggregation on partitions, and then shuffles the aggregated results for the final aggregation phase. Consequently, only the presented aggregated results are shuffled, and not all data. This is similar to reduceByKey or aggregateByKey on RDD. See this related SO article for a good example.

Also, see slide 5 in this presentation by Yin Huai, which covers the benefits of using DataFrames in conjunction with Catalyst.

In conclusion, I think you are using groupBy when using spark DataFrames. Using Window not suitable for me according to your requirement.

+3
source

Source: https://habr.com/ru/post/1273200/


All Articles