Tuning Hive Queries that uses the HBase base table

I have a table in Hbase, let them say "tbl", and I would like to query it using Hive. So I matched the table with the hive as follows:

CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") TBLPROPERTIES("hbase.table.name" = "tbl"); 

Requests like:

 select * from tbl", "select id from tbl", "select id, data from tbl 

really fast.

But queries like

 select id from tbl where substr(id, 0, 5) = "12345" select id from tbl where data["777"] IS NOT NULL 

incredibly slow.

Otherwise, when starting from the Hbase shell:

 "scan 'tbl', { COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', { COLUMNS=>'data', "FILTER" => FilterList.new([qualifierFilter('777')])}" 

lightning fast!

When I looked at the given job created by the hive on jobtracker, I found that "map.input.records" counts ALL the elements in the Hbase table, which means that the job performs a full scan of the table before it even starts any mappers! Moreover, I suspect that it copies all the data from the Hbase table to the hdfs mapper tmp input folder before execution.

So my questions are: why does the hbase hive storage handler not translate the hive of requests into the corresponding hbase functions? Why does he scan all the records and then cut them using the where clause? How can this be improved?

Any suggestions for improving the performance of Hive queries (mapped to the HBase table).

Is it possible to create a secondary index in HBase tables?

We are using HBase and Hive integration and trying to tune the performance of Hive requests.

+6
source share
1 answer

A lot of questions !, I will try to answer everyone and give you some performance tips:

Data is not copied to HDFS, but mapreduce jobs created by HIVE will store their intermediate data in HDFS.

Secondary indexes or alternative query paths are not supported by HBase ( more ).

Hive will translate everything into MapReduce jobs that take time to distribute and initialize, if you have very few series, it is possible that a simple SCAN operation in the Hbase shell is faster than a Hive query, but on large datasets, distributing the job among datanodes is required.

The Hive HBase handler does not do very good work when extracting start and stop line keys from a query; queries like substr(id, 0, 5) = "12345" will not use line start and stop keys.

Before executing your queries, run the EXPLAIN [your_query]; and check if filterExpr: , if you do not find it, your query will perform a full table scan. On a side note, all expressions in Filter Operator: will be converted to the appropriate filters.

 EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346') STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: tbl TableScan alias: tbl filterExpr: expr: ((id>= '12345') and (id < '12346')) type: boolean Filter Operator .... 

Fortunately, there is an easy way to make sure start and stop keys are used when looking for line-to-line prefixes, just convert substr(id, 0, 5) = "12345" into a simpler query: id>="12345" AND id<"12346" , it will detect handler keys and start and stop lines will be provided by SCAN (12345, 12346)


Now, here are some tips to speed up your queries (by and large):

  • Make sure you select the following properties to take advantage of batch processing to reduce the number of RPC calls (the number depends on the size of your columns)

    SET hbase.scan.cache=10000;

    SET hbase.client.scanner.cache=10000;

  • Make sure that you set the following properties to start a distributed task in your task tracker instead of starting a local task.

    SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;

    SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];

  • Reduce the number of columns in the SELECT statement to a minimum. Do not try SELECT *

  • Whenever you want to use the start and stop keys of a row to prevent a full table scan, always provide the expressions key>=x and key<y (do not use the BETWEEN operator)

  • Always EXPLAIN SELECT your queries before executing them.

+7
source

Source: https://habr.com/ru/post/986646/


All Articles