Spark Dataframe will lose section

Question

Spark Dataframe will lose section

I found this strange problem while reading a light block of data. I redid the dataframe in 50k sections. However, when I read and execute the counter action on the data frame, I found that the base rdd has only 2143 partitions when I use spark 2.0.

So, I went on a path where I saved oversize data and found that

hfs -ls /repartitionedData/ | wc -l
50476

Thus, while saving the data, he created 50 thousand parions.

However, with spark 2.0,

val d = spark.read.parquet("repartitionedData")
d.rdd.getNumPartitions
res4: Int = 2143

But with spark 1.5,

val d = spark.read.parquet("repartitionedData")
d.rdd.partitions.length
res4: Int = 50474

Can someone help me?

+4

hadoop bigdata apache-spark apache-spark-sql spark-dataframe

Avishek bhattacharya Aug 11 '17 at 8:03

source share

1 answer

Shaido · Accepted Answer · 2017-08-11T08:43:09+0000

, , Spark . FileSourceStrategy , .

, , Spark 2.0. .

Spark Dataframe will lose section

More articles: