Spark Dataframe will lose section

I found this strange problem while reading a light block of data. I redid the dataframe in 50k sections. However, when I read and execute the counter action on the data frame, I found that the base rdd has only 2143 partitions when I use spark 2.0.

So, I went on a path where I saved oversize data and found that

hfs -ls /repartitionedData/ | wc -l
50476

Thus, while saving the data, he created 50 thousand parions.

However, with spark 2.0,

val d = spark.read.parquet("repartitionedData")
d.rdd.getNumPartitions
res4: Int = 2143

But with spark 1.5,

val d = spark.read.parquet("repartitionedData")
d.rdd.partitions.length
res4: Int = 50474

Can someone help me?

+4
source share
1 answer

, , Spark . FileSourceStrategy , .

, , Spark 2.0. .

+6

Source: https://habr.com/ru/post/1683504/


All Articles