Spark SQL saveAsTable is not compatible with Hive when a section is specified

Edge view, when saving a parquet table in Spark SQL with a section,

#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
    DataTypes.createStructField("time", DataTypes.StringType, true),
    DataTypes.createStructField("accountId", DataTypes.StringType, true),
    ...

DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);

df.coalesce(1)
    .write()
    .mode(SaveMode.Append)
    .format("parquet")
    .partitionBy("year")
    .saveAsTable("tblclick8partitioned");

Spark warns:

Saving the ratio of the source of the shared data to the hive metastor in a Spark-specific SQL format that is NOT compatible with Hive

In the hive:

hive> describe tblclick8partitioned;
OK
col                     array<string>           from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)

The schema is obviously incorrect - however, if I use saveAsTableSpark SQL without a partition, the table can be queried without problems.

Question: how can I make a parquet table in Spark SQL compatible with Hive with section information?

+4
source share
2 answers

, DataFrame.saveAsTable RDD, Hive, hql DataFrame.saveAsTable. SPARK-14927 :

hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")

Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
  .write
  .partitionBy("year")
  .mode(SaveMode.Append)
  .saveAsTable("tmp.partitiontest1")
+4

, Hive, ...partitionBy("year").insertInto("default.mytable").

, Hive, ...partitionBy("year").saveAsTable("default.mytable") . Spark 1.6.2.

+1

Source: https://habr.com/ru/post/1653075/


All Articles