I want to store a Spark data frame in a Hive table in a normal readable text format. For this, I first did
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
My DataFrame is like:
final_data1_df = sqlContext.sql("select a, b from final_data")
and I'm trying to write it:
final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")
but it is very slow, even slower than the HIVE table. Therefore, to solve this problem, I decided to define the section through the Hive DDL instruction, and then load the data like:
sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")
but this gives a partitioned Hive table, but still formatted parquet data. Did I miss something?
source
share