Spark DataFrame saveAsTable with partitionBy does not create ORC file in HDFS

Hi, I have a Spark dataset that I want to save as a table with partition tables. I tried the following two statements, but they do not work. I do not see any ORC files in the HDFS directory empty. I see that baseTable is present in the Hive console, but obviously its empty due to the lack of files inside HDFS. The next two lines saveAsTable()and insertInto()do not work. The registerDataFrameAsTable () method works, but it creates in the memory table and calls OOM in my use case, since I have thousands of hive sections for prcoess. Please explain that I am new to Spark. Thanks in advance.

dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable"); 

dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");

//the following works but creates in memory table and seems to be reason for OOM in my case

hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
+4
source share
1 answer

I hope you already received your answer, but published this answer for other links, partitionBy was only supported for Parquet before Spark 1.4, support for ORC, JSON, text and avro was added in version 1.5+, please refer to the document below

https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

+1
source

Source: https://habr.com/ru/post/1605631/


All Articles