Spark DataFrame: Count different values for each column

Question

Spark DataFrame: Count different values for each column

The question pretty much relates to the heading: is there an efficient way to count the different values in each column in a DataFrame?

The describe method provides only an account, but not a separate account, and I wonder if there is a way to get a separate account for all (or some selected) columns.

+20

distinct-values apache-spark apache-spark-sql spark-dataframe

Rami Nov 30 '16 at 12:55

source share

6 answers

In pySpark you can do something like this using countDistinct() :

 from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

 import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)

If you want to speed approxCountDistinct() up with a potential loss of accuracy, you can also use approxCountDistinct() .

+33

mtoto Nov 30 '16 at 13:22

source share

if you just want to count on a particular column, then the following may help. Although his late reply. it might help someone. (tested by pyspark 2.2.0 )

 from pyspark.sql.functions import col, countDistinct df.agg(countDistinct(col("colName")).alias("count")).show()

+9

desaiankitb Oct 05 '17 at 9:07

source share

Adding an answer to desaiankitb will give you a more intuitive answer:

from pyspark.sql.functions import counter

 df.groupBy(colname).count().show()

+1

thegooner Apr 29 '19 at 6:44

source share

You can use the count(column name) function of SQL

Alternatively, if you use data analysis and want to get an approximate estimate, rather than an accurate calculation of each and every column, you can use the approx_count_distinct(expr[, relativeSD]) function

0

Ahmed Aug 10 '18 at 7:47

source share

I want to get the number of individual values for multiple columns from a Dataframe using Spark and Java8

Input DataFrame - you need to write code for dynamic columns - columns can be added later

 +----+----+----+ |Col1|Col2|Col3| +----+----+----+ |A1|Y|B2|Y|C3|Y| |A1|Y|B2|N|C3|Y| |A1|Y|B2|Y|C3|N| +----+----+----+

Output DateFrame

 +--------+---------------------+--------------------+ |Col1 | Col2 | Col3 | +--------+---------------------+--------------------+ |A1|Y - 3| B2|Y - 2 & B2|N - 1 | C3|Y - 3 & C3|N -1 | +----+----+----+----+----+----+----+----+----+------+

0

Tamil Oct 15 '19 at 12:19

source share

eliasah · Accepted Answer · 2016-11-30T13:42:07+0000

Numerous aggregations would be quite expensive to calculate. I suggest using approximation methods instead. In this case, an approximate clear account:

val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3") val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap df.agg(exprs).show() // +---------------------------+---------------------------+---------------------------+ // |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)| // +---------------------------+---------------------------+---------------------------+ // | 2| 2| 3| // +---------------------------+---------------------------+---------------------------+

The approx_count_distinct method relies on HyperLogLog under the hood.

The HyperLogLog algorithm and its variant HyperLogLog ++ (implemented in Spark) are based on the following clever observation.

If the numbers are evenly distributed over the range, then the number of individual elements can be approximated by the largest number of leading zeros in the binary representation of numbers.

For example, if we observe a number whose binary numbers have the form 0…(k times)…01…1 , then we can estimate that there are about 2 ^ k elements in the set. This is a very rough estimate, but it can be clarified with high accuracy using the sketching algorithm.

A detailed explanation of the mechanism underlying this algorithm can be found in the original article .

Note. Starting with Spark 1.6 , when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each sentence, a separate aggregation for each sentence must be run. Whereas it differs from SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus, performance will not be comparable when using count(distinct(_)) and approxCountDistinct (or approx_count_distinct ).

This is one of the behavior changes since Spark 1.6:

With the improved query planner for queries that have different aggregations (SPARK-9241), the query plan having one separate aggregation has been changed to a more robust version. To return to the plan generated by the Spark 1.5s scheduler, set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)

Link: Approximate algorithms in Apache Spark: HyperLogLog and Quantiles .

Spark DataFrame: Count different values ​​for each column

More articles:

Spark DataFrame: Count different values for each column