I have a table in Hbase, let them say "tbl", and I would like to query it using Hive. So I matched the table with the hive as follows:
CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") TBLPROPERTIES("hbase.table.name" = "tbl");
Requests like:
select * from tbl", "select id from tbl", "select id, data from tbl
really fast.
But queries like
select id from tbl where substr(id, 0, 5) = "12345" select id from tbl where data["777"] IS NOT NULL
incredibly slow.
Otherwise, when starting from the Hbase shell:
"scan 'tbl', { COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', { COLUMNS=>'data', "FILTER" => FilterList.new([qualifierFilter('777')])}"
lightning fast!
When I looked at the given job created by the hive on jobtracker, I found that "map.input.records" counts ALL the elements in the Hbase table, which means that the job performs a full scan of the table before it even starts any mappers! Moreover, I suspect that it copies all the data from the Hbase table to the hdfs mapper tmp input folder before execution.
So my questions are: why does the hbase hive storage handler not translate the hive of requests into the corresponding hbase functions? Why does he scan all the records and then cut them using the where clause? How can this be improved?
Any suggestions for improving the performance of Hive queries (mapped to the HBase table).
Is it possible to create a secondary index in HBase tables?
We are using HBase and Hive integration and trying to tune the performance of Hive requests.