Apache Spark: get the number of records per partition

I want to check how we can get information about each section, such as total no. entries in each section on the driver's side when the spark job is sent with deployment mode as a cluster of yarn for registration or printing on the console.

+4
source share
3 answers

You can get the number of records per section as follows:

df
  .rdd
  .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
  .toDF("partition_number","number_of_records")
  .show

But it will also launch Spark Job on its own (because the file must be read by a spark in order to get the number of records).

Spark can also read hive table statistics, but I don’t know how to display this metadata.

+8

. , :

import org.apache.spark.sql.functions.spark_partition_id

df.groupBy(spark_partition_id).count
+4

Spark 1.5 solution:

( sparkPartitionId()exists in org.apache.spark.sql.functions)

import org.apache.spark.sql.functions._ 

df.withColumn("partitiond", sparkPartitionId()).groupBy("partitionId").count.show

as mentioned by @Raphael Roth

mapPartitionsWithIndex - the best approach, will work with the whole version of the spark from the moment of approach based on RDD

+1
source

Source: https://habr.com/ru/post/1685008/


All Articles