Matching the data warehouse star schema with HBASE

Question

Matching the data warehouse star schema with HBASE

Suppose, hypothetically, that I have a starry scheme in setting up a data warehouse. There is one VERY, VERY long fact table (count billions of trillions of rows) and several minimum power measurement tables (read 100 measurement tables). Each foreign key fact table that points to the primary key of the dimension table is indexed with a bitmap. The primary key of the size table is also indexed by bitmap. This is all for a quick connection. Everything is pretty standard.

Suppose a data warehouse begins to show performance degradation. The time it takes to return results from a raster join is getting worse than the fact table comes. The business requirement is that the fact table continues to grow (we cannot transfer data older than a year to archive storage)

I am thinking of the following solutions:

The hash partitions the fact table, but this temporarily eliminates the inevitable growth problem.
The database shares the physical star schema database as multiple schemas / databases. 1..N fact tables and their dimensional copies, each of which stores the data assigned to it through the hash function (1..N), which is performed in a separate intermediate ETL database to determine which database / schema contains the fact line ( as a result of the ETL process). If any dimension changes, copy this change to other databases of the appropriate size. Again, this will not work as a permanent solution.
Collapse dimensions and save all dimension values directly in the fact table. Then import the fact table into HBASE on Hadoop. You get a massive HBASE table, a key value repository that does not have dimension tables. I would do this because joins were forbidden in HBASE (so there are no facts for dimension joins, just apply dimension values in dimension columns).

Has anyone ever done this before?

Does anyone have any advice on solution # 3?

Is HBASE optimal for fast read scaling?

As for recording, I do not care about fast recording, since they will be executed after hours in the form of batch processes.

If someone chose solution 1 or 2, did someone use a sequential hashing algorithm (to avoid reassignment, as in a regular old hash, if there are more partitions, hash keys are created dynamically)? A dynamic increase in the number of partitions without a complete reassignment is probably not an option (I have not seen this done in practice with respect to partitioned tables), so it seems to me that any solution to partitions will lead to scaling problems.

Any thoughts, advice, or experience with moving a giant fact table with many dimensions (traditional stellar DW scheme) to the HBASE bez giants table?

Related Question:

How are aggregate collections of data that are traditionally in materialized representations (or alternately as separate fact tables related to the same dimensions as the most granular fact table — that is, hourly / daily / weekly / monthly, where the basic fact table is hourly) are collected on data warehouse card before HBASE?

My thoughts are that since there are no materialized representations in HBASE, aggregate data collections are stored as HBASE tables, which are updated or inserted at any time when changes occur in the most granular fact table of the lowest level.

Any thoughts on aggregate tables in HBASE? Has anyone used Hive scripts to greatly simulate the behavior of materialized views when updating pivot column data in secondary HBASE tables that store aggregated data in them (e.g. daily_aggregates_fact_table, weekly_aggregates_fact_table, month_aggregates_fact_table) when changing the statistical fact table itself?

+4

hbase data-warehouse star-schema dimensional-modeling aggregates

Dean toader Sep 26 '12 at 10:39

source share

2 answers

brucenan · Answer 1 · 2012-12-04T06:40:35+0000

Size will be determined as keyrow in HBase. Value is the value of your measure. If your fact tables do not matter, the value in the HBase row may be zero.

Depending on the weak resources from the Internet, I think the idea is this:

**RowKey** **Value** DimensionA XX DimensionA:DimensionB XX DimensionB:DimensionC XX DimenesionA:DimensionB:DimenesionC: XXX

Is this suitable for your problems?

Arnon Rotem-Gal-Oz · Answer 2 · 2012-09-27T18:44:48+0000

HBase is not a good choice for a general-purpose data warehouse (with real-time querying). Any separate table will only allow you to expand along one dimension or along one dimension path (if you correctly configure the correct composite key). It is not canceled (for example, ebay built its new search engine on HBase ), but it is not out of the box

There are several attempts to provide high-performance SQL code through Hadoop (e.g. Hadapt or Rainstor ), but they will not give you the performance of good massively parallel databases such as Vertica , Greenplum , Asterdata , Netezza , etc.

Matching the data warehouse star schema with HBASE

More articles: