The problem with the mode is almost the same as with the environment. Although it is easy to calculate, the calculation is quite expensive. This can be done either using sorting, followed by local and global aggregations, or using only one word-word and filter:
import numpy as np
np.random.seed(1)
df = sc.parallelize([
(int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])
cnts = df.groupBy("x").count()
mode = cnts.join(
cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
## 0
In either case, a complete shuffle may be required for each column.
source
share