Numerous aggregations would be quite expensive to calculate. I suggest using approximation methods instead. In this case, an approximate clear account:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3") val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap df.agg(exprs).show() // +---------------------------+---------------------------+---------------------------+ // |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)| // +---------------------------+---------------------------+---------------------------+ // | 2| 2| 3| // +---------------------------+---------------------------+---------------------------+
The approx_count_distinct
method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog ++ (implemented in Spark) are based on the following clever observation.
If the numbers are evenly distributed over the range, then the number of individual elements can be approximated by the largest number of leading zeros in the binary representation of numbers.
For example, if we observe a number whose binary numbers have the form 0β¦(k times)β¦01β¦1
, then we can estimate that there are about 2 ^ k elements in the set. This is a very rough estimate, but it can be clarified with high accuracy using the sketching algorithm.
A detailed explanation of the mechanism underlying this algorithm can be found in the original article .
Note. Starting with Spark 1.6 , when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df
each sentence, a separate aggregation for each sentence must be run. Whereas it differs from SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df
where we aggregate once. Thus, performance will not be comparable when using count(distinct(_))
and approxCountDistinct
(or approx_count_distinct
).
This is one of the behavior changes since Spark 1.6:
With the improved query planner for queries that have different aggregations (SPARK-9241), the query plan having one separate aggregation has been changed to a more robust version. To return to the plan generated by the Spark 1.5s scheduler, set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Link: Approximate algorithms in Apache Spark: HyperLogLog and Quantiles .