Spark DataFrame: Count different values ​​for each column

The question pretty much relates to the heading: is there an efficient way to count the different values ​​in each column in a DataFrame?

The describe method provides only an account, but not a separate account, and I wonder if there is a way to get a separate account for all (or some selected) columns.

+20
source share
6 answers

Numerous aggregations would be quite expensive to calculate. I suggest using approximation methods instead. In this case, an approximate clear account:

val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3") val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap df.agg(exprs).show() // +---------------------------+---------------------------+---------------------------+ // |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)| // +---------------------------+---------------------------+---------------------------+ // | 2| 2| 3| // +---------------------------+---------------------------+---------------------------+ 

The approx_count_distinct method relies on HyperLogLog under the hood.

The HyperLogLog algorithm and its variant HyperLogLog ++ (implemented in Spark) are based on the following clever observation.

If the numbers are evenly distributed over the range, then the number of individual elements can be approximated by the largest number of leading zeros in the binary representation of numbers.

For example, if we observe a number whose binary numbers have the form 0…(k times)…01…1 , then we can estimate that there are about 2 ^ k elements in the set. This is a very rough estimate, but it can be clarified with high accuracy using the sketching algorithm.

A detailed explanation of the mechanism underlying this algorithm can be found in the original article .

Note. Starting with Spark 1.6 , when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each sentence, a separate aggregation for each sentence must be run. Whereas it differs from SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus, performance will not be comparable when using count(distinct(_)) and approxCountDistinct (or approx_count_distinct ).

This is one of the behavior changes since Spark 1.6:

With the improved query planner for queries that have different aggregations (SPARK-9241), the query plan having one separate aggregation has been changed to a more robust version. To return to the plan generated by the Spark 1.5s scheduler, set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)

Link: Approximate algorithms in Apache Spark: HyperLogLog and Quantiles .

+30
source

In pySpark you can do something like this using countDistinct() :

 from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)) 

Similarly in Scala :

 import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*) 

If you want to speed approxCountDistinct() up with a potential loss of accuracy, you can also use approxCountDistinct() .

+33
source

if you just want to count on a particular column, then the following may help. Although his late reply. it might help someone. (tested by pyspark 2.2.0 )

 from pyspark.sql.functions import col, countDistinct df.agg(countDistinct(col("colName")).alias("count")).show() 
+9
source

Adding an answer to desaiankitb will give you a more intuitive answer:

from pyspark.sql.functions import counter

 df.groupBy(colname).count().show() 
+1
source

You can use the count(column name) function of SQL

Alternatively, if you use data analysis and want to get an approximate estimate, rather than an accurate calculation of each and every column, you can use the approx_count_distinct(expr[, relativeSD]) function

0
source

I want to get the number of individual values ​​for multiple columns from a Dataframe using Spark and Java8

Input DataFrame - you need to write code for dynamic columns - columns can be added later

 +----+----+----+ |Col1|Col2|Col3| +----+----+----+ |A1|Y|B2|Y|C3|Y| |A1|Y|B2|N|C3|Y| |A1|Y|B2|Y|C3|N| +----+----+----+ 

Output DateFrame

 +--------+---------------------+--------------------+ |Col1 | Col2 | Col3 | +--------+---------------------+--------------------+ |A1|Y - 3| B2|Y - 2 & B2|N - 1 | C3|Y - 3 & C3|N -1 | +----+----+----+----+----+----+----+----+----+------+ 
0
source

Source: https://habr.com/ru/post/1012757/


All Articles