Does the statistics of the compute tables in the hive or impala accelerate the apache spark?

Question

Does the statistics of the compute tables in the hive or impala accelerate the apache spark?

To improve performance (for example, for joins), it is recommended that you first compute a static table.

In Hive, I can do:

analyze table <table name> compute statistics;

In Impala:

compute stats <table name>;

Is my spark application (reading from hive tables) also useful information from pre-calculated statistics? If so, which one do I need to run? Do they keep statistics in the hive metaphor? I am using spark 1.6.1 on Cloudera 5.5.4

Note: In Docs of spark 1.6.1 ( https://spark.apache.org/docs/1.6.1/sql-programming-guide.html ) for the parameter, spark.sql.autoBroadcastJoinThresholdI found a hint:

Please note that statistics are currently only supported for Hive Metastore tables in which the ANALYZE TABLE COMPUTE STATISTICS noscan command is run.

+4

hive apache-spark impala

Raphael roth Sep 22 '16 at 7:23

source share

3 answers

user24225 · Answer 1 · 2016-09-29T01:17:42+0000

I assume that you are using Hive on Spark (or) Spark-Sql with a hive context. If so, you should run the analysis in the hive.

Analysis of the table <...> should usually start after creating the table or having significant insertions / changes. You can do this at the end of your boot phase if it is an MR or spark task.

, , , . . https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started , .

Vivek · Answer 2 · 2016-09-23T11:11:11+0000

, , impala .

:

, . Hive wiki Hive . Cloudera Impala COMPUTE STATS .
Hive ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, Impala , . , .

: https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_perf_stats.html

Jacek Laskowski · Answer 3 · 2018-01-18T21:24:52+0000

Spark 2.3.0 (, 2.2.1 ealier).

( ) ?

, Impala Hive (, ) , Spark ( Spark ).

, DESCRIBE EXTENDED SQL spark-shell.

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+

ANALYZE TABLE COMPUTE STATISTICS noscan , Spark, .. ( - noscan). Impala Hive "" , Spark SQL DESC EXTENDED.

DESC EXTENDED tableName , , Impala Hive. DESC EXTENDED, ( , ).

( Spark) , , Impala Hive , Spark SQL.

Does the statistics of the compute tables in the hive or impala accelerate the apache spark?

More articles: