Change output file name prefix for DataFrame.write ()

Question

Change output file name prefix for DataFrame.write ()

Output files generated using the Spark SQL DataFrame.write () method start with the base name prefix "part". eg.

DataFrame sample_07 = hiveContext.table("sample_07"); sample_07.write().parquet("sample_07_parquet");

Results in:

 hdfs dfs -ls sample_07_parquet/ Found 4 items -rw-r--r-- 1 rob rob 0 2016-03-19 16:40 sample_07_parquet/_SUCCESS -rw-r--r-- 1 rob rob 491 2016-03-19 16:40 sample_07_parquet/_common_metadata -rw-r--r-- 1 rob rob 1025 2016-03-19 16:40 sample_07_parquet/_metadata -rw-r--r-- 1 rob rob 17194 2016-03-19 16:40 sample_07_parquet/part-r-00000-cefb2ac6-9f44-4ce4-93d9-8e7de3f2cb92.gz.parquet

I would like to change the output file prefix used when creating the file using Spark SQL DataFrame.write (). I tried setting the mapreduce.output.basename property in the hadoop configuration for the Spark context. eg.

 public class MyJavaSparkSQL { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("MyJavaSparkSQL"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ctx.hadoopConfiguration().set("mapreduce.output.basename", "myprefix"); HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(ctx.sc()); DataFrame sample_07 = hiveContext.table("sample_07"); sample_07.write().parquet("sample_07_parquet"); ctx.stop(); }

This did not change the output file prefix for the generated files.

Is there a way to override the output file name prefix using the DataFrame.write () method?

+5

java mapreduce apache-spark apache-spark-sql

Rob Mar 19 '16 at 21:46

source share

1 answer

Tzach zohar · Accepted Answer · 2016-03-19T23:12:36+0000

You cannot change the "part" prefix when using any of the standard output formats (for example, Parquet). See this snippet from the ParquetRelation source code :

 private val recordWriter: RecordWriter[Void, InternalRow] = { val outputFormat = { new ParquetOutputFormat[InternalRow]() { // ... override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { // .. // prefix is hard-coded here: new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$bucketString$extension") } } }

If you really need to manage part file names, you probably have to implement your own FileOutputFormat file and use one of the Spark save methods that take the FileOutputFormat class (for example, saveAsHadoopFile ).

Change output file name prefix for DataFrame.write ()

More articles: