Very slow write of data file to file on Spark cluster

I have a test program that writes data to a file. A data frame is generated by adding consecutive numbers for each row, for example

1,2,3,4,5,6,7.....11
2,3,4,5,6,7,8.....12
......

There are 100,000 rows in a DataFrame, but I don’t think it is too big. When I submit the Spark task, it takes almost 20 minutes to write the data file to the HDFS file. I am wondering why it is so slow, and how to improve performance.

val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val numCol = 11

val arraydataInt = 1 to 100000 toArray
val arraydata = arraydataInt.map(x => x.toDouble)
val slideddata = arraydata.sliding(numCol).toSeq
val rows = arraydata.sliding(numCol).map { x => Row(x: _*) }
val datasetsize = arraydataInt.size

val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()

val schemaString = "value1 value2 value3 value4 value5 " +
                   "value6 value7 value8 value9 value10 label"

val schema =
StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))

val df = sqlContext.createDataFrame(myrdd, schema).cache()

    val splitsH = df.randomSplit(Array(0.8, 0.1))
val trainsetH = splitsH(0).cache()
val testsetH = splitsH(1).cache()

println("now saving training and test samples into files")

trainsetH.write.save("TrainingSample.parquet")
testsetH.write.save("TestSample.parquet")
+4
source share
1 answer

Turn

val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()

For

val myrdd = sc.makeRDD(rows.toSeq, 100).persist()

rdd arraydata.size - numCol, , . , parallelism . 100 , .

BTW, 2 3 .

+2

Source: https://habr.com/ru/post/1665180/


All Articles