I found this strange problem while reading a light block of data. I redid the dataframe in 50k sections. However, when I read and execute the counter action on the data frame, I found that the base rdd has only 2143 partitions when I use spark 2.0.
So, I went on a path where I saved oversize data and found that
hfs -ls /repartitionedData/ | wc -l
50476
Thus, while saving the data, he created 50 thousand parions.
However, with spark 2.0,
val d = spark.read.parquet("repartitionedData")
d.rdd.getNumPartitions
res4: Int = 2143
But with spark 1.5,
val d = spark.read.parquet("repartitionedData")
d.rdd.partitions.length
res4: Int = 50474
Can someone help me?
source
share