Does the statistics of the compute tables in the hive or impala accelerate the apache spark?

To improve performance (for example, for joins), it is recommended that you first compute a static table.

In Hive, I can do:

analyze table <table name> compute statistics;

In Impala:

compute stats <table name>;

Is my spark application (reading from hive tables) also useful information from pre-calculated statistics? If so, which one do I need to run? Do they keep statistics in the hive metaphor? I am using spark 1.6.1 on Cloudera 5.5.4

Note: In Docs of spark 1.6.1 ( https://spark.apache.org/docs/1.6.1/sql-programming-guide.html ) for the parameter, spark.sql.autoBroadcastJoinThresholdI found a hint:

Please note that statistics are currently only supported for Hive Metastore tables in which the ANALYZE TABLE COMPUTE STATISTICS noscan command is run.

+4
source share
3 answers

I assume that you are using Hive on Spark (or) Spark-Sql with a hive context. If so, you should run the analysis in the hive.

Analysis of the table <...> should usually start after creating the table or having significant insertions / changes. You can do this at the end of your boot phase if it is an MR or spark task.

, , , . . https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started , .

+1

, , impala .

:

, . Hive wiki Hive . Cloudera Impala COMPUTE STATS .

Hive ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, Impala , . , .

: https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_perf_stats.html

0

Spark 2.3.0 (, 2.2.1 ealier).

( ) ?

, Impala Hive (, ) , Spark ( Spark ).

, DESCRIBE EXTENDED SQL spark-shell.

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+

ANALYZE TABLE COMPUTE STATISTICS noscan , Spark, .. ( - noscan). Impala Hive "" , Spark SQL DESC EXTENDED.

DESC EXTENDED tableName , , Impala Hive. DESC EXTENDED, ( , ).


( Spark) , , Impala Hive , Spark SQL.

0

Source: https://habr.com/ru/post/1655430/


All Articles