Multiple pySpark file frame aggregation criteria

I have a pySpark frame that looks like this:

+-------------+----------+
|          sku|      date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+

I want to group by sku and then calculate the minimum and maximum dates. If I do this:

df_testing.groupBy('sku') \
    .agg({'date': 'min', 'date':'max'}) \
    .limit(10) \
    .show()

The behavior is the same as Pandas, where I get only columns skuand max(date). In Pandas, I usually did the following to get the results I want:

df_testing.groupBy('sku') \
    .agg({'day': ['min','max']}) \
    .limit(10) \
    .show()

However, this does not work on pySpark, and I get an error java.util.ArrayList cannot be cast to java.lang.String. Can someone tell me the correct syntax?

Thank.

+4
source share
1 answer

You cannot use dict. Using:

>>> from pyspark.sql import functions as F
>>>
>>> df_testing.groupBy('sku').agg(F.min('date'), F.max('date'))
+8
source

Source: https://habr.com/ru/post/1659014/


All Articles