I have done quite a bit of research for this question and have not found a satisfactory answer. I have to rename my output file coming out of the spark.
I am currently outputting my spark data block in S3 and then read it again and then renamed and copied again. The problem with this is my spark, which lasts 16 minutes, but reading from S3 and then renaming and writing to S3 again takes another 15 minutes.
Is there any way to rename my output file. I'm fine withpart-00000
This is how I save my data frame
dfMainOutputFinalWithoutNull.repartition(50).write.partitionBy("DataPartition", "PartitionYear")
.format("csv")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "bzip2")
.save(outputFileURL)
In any case, how to use the hadoop file format?
I am currently doing this as shown below
val finalFileName = finalPrefix + DataPartitionName + "." + YearPartition + "." + intFileCounter + "." + fileVersion + currentTime + fileExtention
val dest = new Path(mainFileURL + "/" + finalFileName)
fs.rename(urlStatus.getPath, dest)
, 50 , , .
, EMR , .
user9175539