How to save parquet files using bucketBy / sortkey value in s3?

Question

How to save parquet files using bucketBy / sortkey value in s3?

It looks like this will lead to an error

df .write() .option("mode", "DROPMALFORMED") .option("compression", "snappy") .mode("overwrite") .bucketBy(32,"column").sortBy("column") .parquet("s3://....");

With an error

 Exception in thread "main" org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:314)

I see that saveAsTable("myfile") is still supported, but it writes only locally. How do I take this saveAsTable(...) output and put it on s3 after doing the job?

+5

amazon-web-services elastic-map-reduce emr apache-spark

Foreverconfused Oct 28 '17 at 21:43

source share

1 answer

Amit kumar · Accepted Answer · 2017-11-06T06:49:33+0000

  You Can use like below: df .write() .option("mode", "DROPMALFORMED") .option("compression", "snappy") .option("path","s3://....") .mode("overwrite") .format("parquet") .bucketBy(32,"column").sortBy("column") .saveAsTable("tableName");

This will create an external table pointing to the location of S3 .option ("path", "s3: // ....") - this is the trick here

How to save parquet files using bucketBy / sortkey value in s3?

More articles: