Apache Spark: get the number of records per partition

Question

Apache Spark: get the number of records per partition

I want to check how we can get information about each section, such as total no. entries in each section on the driver's side when the spark job is sent with deployment mode as a cluster of yarn for registration or printing on the console.

+4

scala hadoop apache-spark spark-dataframe

nilesh1212 Sep 04 '17 at 7:34

source share

3 answers

. , :

import org.apache.spark.sql.functions.spark_partition_id

df.groupBy(spark_partition_id).count

+4

user8371915 04 . '17 9:19

Spark 1.5 solution:

( sparkPartitionId()exists in org.apache.spark.sql.functions)

import org.apache.spark.sql.functions._ 

df.withColumn("partitiond", sparkPartitionId()).groupBy("partitionId").count.show

as mentioned by @Raphael Roth

mapPartitionsWithIndex - the best approach, will work with the whole version of the spark from the moment of approach based on RDD

+1

Ram ghadiyaram Nov 08 '17 at 7:18

source share

Raphael roth · Accepted Answer · 2017-09-04T07:55:01+0000

You can get the number of records per section as follows:

df
  .rdd
  .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
  .toDF("partition_number","number_of_records")
  .show

But it will also launch Spark Job on its own (because the file must be read by a spark in order to get the number of records).

Spark can also read hive table statistics, but I don’t know how to display this metadata.

Apache Spark: get the number of records per partition

Spark 1.5 solution:

More articles: