How to add partitionBy column name as prefix in spark output file in scala

I have done quite a bit of research for this question and have not found a satisfactory answer. I have to rename my output file coming out of the spark.

I am currently outputting my spark data block in S3 and then read it again and then renamed and copied again. The problem with this is my spark, which lasts 16 minutes, but reading from S3 and then renaming and writing to S3 again takes another 15 minutes.

Is there any way to rename my output file. I'm fine withpart-00000

This is how I save my data frame

dfMainOutputFinalWithoutNull.repartition(50).write.partitionBy("DataPartition", "PartitionYear")
      .format("csv")
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .option("codec", "bzip2")
      .save(outputFileURL)

In any case, how to use the hadoop file format?

I am currently doing this as shown below

val finalFileName = finalPrefix + DataPartitionName + "." + YearPartition + "." + intFileCounter + "." + fileVersion + currentTime + fileExtention
      val dest = new Path(mainFileURL + "/" + finalFileName)
      fs.rename(urlStatus.getPath, dest)

, 50 , , .

, EMR , .

+4

Source: https://habr.com/ru/post/1694783/


All Articles