How to save parquet files using bucketBy / sortkey value in s3?

It looks like this will lead to an error

df .write() .option("mode", "DROPMALFORMED") .option("compression", "snappy") .mode("overwrite") .bucketBy(32,"column").sortBy("column") .parquet("s3://...."); 

With an error

 Exception in thread "main" org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:314) 

I see that saveAsTable("myfile") is still supported, but it writes only locally. How do I take this saveAsTable(...) output and put it on s3 after doing the job?

+5
source share
1 answer
  You Can use like below: df .write() .option("mode", "DROPMALFORMED") .option("compression", "snappy") .option("path","s3://....") .mode("overwrite") .format("parquet") .bucketBy(32,"column").sortBy("column") .saveAsTable("tableName"); 

This will create an external table pointing to the location of S3 .option ("path", "s3: // ....") - this is the trick here

+4
source

Source: https://habr.com/ru/post/1272967/


All Articles