Spark writes data to a partitioned Hive table very slowly

Question

Spark writes data to a partitioned Hive table very slowly

I want to store a Spark data frame in a Hive table in a normal readable text format. For this, I first did

sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")

My DataFrame is like:

final_data1_df = sqlContext.sql("select a, b from final_data")

and I'm trying to write it:

final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")

but it is very slow, even slower than the HIVE table. Therefore, to solve this problem, I decided to define the section through the Hive DDL instruction, and then load the data like:

sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")

but this gives a partitioned Hive table, but still formatted parquet data. Did I miss something?

+4

hadoop apache-spark pyspark spark-dataframe

abhiieor Aug 08 '16 at 7:00

source share

1 answer

Gonzalo Herreros · Answer 1 · 2017-09-01T23:23:08+0000

, DDL . Hive, .

CREATE "STORED AS TEXTFILE", , .

Spark writes data to a partitioned Hive table very slowly

More articles: