SparkSQL CLUSTER BY table creation is not supported

Question

SparkSQL CLUSTER BY table creation is not supported

According to the Spark document https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#supported-hive-features , the hive CLUSTER BY statement is supported. But when I tried to create a table using the following beeline query

CREATE TABLE set_bucketing_test (key INT, value STRING) CLUSTERED BY (key) INTO 10 BUCKETS;

I get the following error

Error: org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0)

I don’t know what mistake I am making. Any help?

+4

hive apache-spark-sql hiveql

Arun Kumar Sundaramurthy Jan 30 '17 at 17:31

source share

1 answer

KobeFeng · Answer 1 · 2017-03-13T22:28:49+0000

You can use the function cluster in spark-sql to create tables, join tables, etc., which acts like a bush to avoid data exchange and sorting in spark2.1 +

. https://issues.apache.org/jira/browse/SPARK-15453

, , , , array

:

val df = (0 until 80000).map(i => (i, i.toString, i.toString)).toDF("item_id", "country", "state").coalesce(1)

", Hive."

df.write.bucketBy(100, "country", "state").sortBy("country", "state").saveAsTable("kofeng.lstg_bucket_test")

17/03/13 15:12:01 WARN HiveExternalCatalog: Persisting bucketed data source table `kofeng`.`lstg_bucket_test` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

df.write.bucketBy(100, "country", "state").sortBy("country", "state").saveAsTable("kofeng.lstg_bucket_test2")

, .

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.sql.autoBroadcastJoinThreshold", "0").getOrCreate()

SPARK 2.1.0, SPARK2.0, .

 val query = """
 |SELECT *
 |FROM
 |  kofeng.lstg_bucket_test a
 |JOIN
 |  kofeng.lstg_bucket_test2 b
 |ON a.country=b.country AND
 |   a.state=b.state
      """.stripMargin
val joinDF = sql(query)


    scala> joinDF.queryExecution.executedPlan
    res10: org.apache.spark.sql.execution.SparkPlan =
*SortMergeJoin [country#71, state#72], [country#74, state#75], Inner
:- *Project [item_id#70, country#71, state#72]
:  +- *Filter (isnotnull(country#71) && isnotnull(state#72))
:     +- *FileScan parquet kofeng.lstg_bucket_test[item_id#70,country#71,state#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://ares-lvs-nn-ha/user/hive/warehouse/kofeng.db/lstg_bucket_test], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(state)], ReadSchema: struct<item_id:int,country:int,state:string>
+- *Project [item_id#73, country#74, state#75]
   +- *Filter (isnotnull(country#74) && isnotnull(state#75))
      +- *FileScan parquet kofeng.lstg_bucket_test2[item_id#73,country#74,state#75] Batched: true, Format: Parquet...

SparkSQL CLUSTER BY table creation is not supported

More articles: