Output files generated using the Spark SQL DataFrame.write () method start with the base name prefix "part". eg.
DataFrame sample_07 = hiveContext.table("sample_07"); sample_07.write().parquet("sample_07_parquet");
Results in:
hdfs dfs -ls sample_07_parquet/ Found 4 items -rw-r--r-- 1 rob rob 0 2016-03-19 16:40 sample_07_parquet/_SUCCESS -rw-r--r-- 1 rob rob 491 2016-03-19 16:40 sample_07_parquet/_common_metadata -rw-r--r-- 1 rob rob 1025 2016-03-19 16:40 sample_07_parquet/_metadata -rw-r--r-- 1 rob rob 17194 2016-03-19 16:40 sample_07_parquet/part-r-00000-cefb2ac6-9f44-4ce4-93d9-8e7de3f2cb92.gz.parquet
I would like to change the output file prefix used when creating the file using Spark SQL DataFrame.write (). I tried setting the mapreduce.output.basename property in the hadoop configuration for the Spark context. eg.
public class MyJavaSparkSQL { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("MyJavaSparkSQL"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); ctx.hadoopConfiguration().set("mapreduce.output.basename", "myprefix"); HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(ctx.sc()); DataFrame sample_07 = hiveContext.table("sample_07"); sample_07.write().parquet("sample_07_parquet"); ctx.stop(); }
This did not change the output file prefix for the generated files.
Is there a way to override the output file name prefix using the DataFrame.write () method?
source share