Why did the author propose the HBase Tall-Thin over Short-Wide scheme described inside?

I read about Tall-Thin vs Short-Wide HBase schemas , and the author offers the following arguments that I don't understand:

It’s best to look at Tall-Thin’s design, because we know that this will help you find data faster, allowing us to read a single column family for user blog entries at the same time, rather than moving across multiple lines. In addition, since HBase is split up line by line, data related to a specific user can be found on a single server in the region.

The proposed short diagram of the blog site diagram below (where the line is on the author and each new blog entry is a new column):

+----------+-------------+------+-----+---------+-------+ | | BECF (Blog entry Column family) +----------+-------------+------+-----+---------+-------+ | RowKey (UserID) | BECF:BT | BECF:BT | BECF:BT | BECF:BT | +----------+-------------+------+-----+---------+-------+ | WriterA | Entry1 | Entry2 | Entry3 | WriterB | EntryA | EntryB | ... +----------+-------------+------+-----+---------+-------+ 

The proposed Tall-Thin design is below (where each new blog entry represents a new line):

 +----------+-------------+------+-----+---------+-------+ | | BECF (Blog entry Column family) +----------+-------------+------+-----+---------+-------+ | RowKey (UserID+TimeStamp) | BlogEntriesCF:Entries +----------+-------------+------+-----+---------+-------+ | WriterATimeStamp1 | Entry1 | WriterATimeStamp2 | Entry2 | WriterATimeStamp3 | Entry3 | WriterBTimeStamp1 | EntryA | WriterBTimeStamp2 | EntryB +----------+-------------+------+-----+---------+-------+ 
  • Why does the author believe that a design with a high fine pattern is better because "it allows you to read one family of columns for user blog entries at the same time, instead of going through many lines"?

  • Would a Short-Wide project read only one line to retrieve all the records? Why the best design?

+6
source share
3 answers

Well, the first thing you get around is blocking the lines.

Say you have a wide line and you need to update it. This means that this row must be locked. No other author can update it at that moment, because it is blocked. They must wait until the lock is released.

With high and thin data is contained in one field in a short line, which updates it, does not cause problems for other authors who want to update their thing, which is on a separate line.

Tall and thin also lends itself to creating dynamic relationships, expanding the user base, faster indexes, better response time.

Humanly readable, this is not real, but it is easier for machines to cope, unite, change, change structures.

If you have a relational object mapping interface (e.g. Java Hibernate, php Eloquent, etc.), it becomes absurd to easily turn it into a oneToMany or ManyToMany relationship and update, modify, query objects in general.

Tall and Thin also makes it easy to implement the same data objects elsewhere without having to view views to sanitize / delete unwanted data.

For instance:

I have a database of prices for product A, product B, product C. Prices have dates, they are active, corresponding to the seasons (Christmas, etc.). All products in my example are managed in the same seasons.

wide:

  date_from | date_to | ProductA_price | ProductB_price | ProductC_price 22-10-2000| 22-11-2000 | 100 | 110 | 90 23-11-2000| 26-12-2000 | 200 | 210 | 190 27-12-2000| 22-01-2001 | 100 | 110 | 90 

Now, if you want to add an additional product, you need to do the following:

  • Change table. This can be very expensive on a large table, leading to a crash.
  • update prices causing many row locks
  • Change requests. Requests are used ALL OVER THE PLACE. All of them should consider an additional column, especially if select * .
  • Change executable code. There can be broken an extra column, sloppy loops. Array iterators must be modified to account for the additional product.
  • It starts a long time after the change if the software base is a little outdated.
  • update hardlinked table name references

Tall:

 table: Products id | product_name 1 | ProductA 2 | ProductB 3 | ProductC table: Periods id| name | date_from | date_to 1 | autumn | 22-10-2000| 22-11-2000 2 | xmas | 23-11-2000| 26-12-2000 3 | newyear | 27-12-2000| 22-01-2001 table: Prices product_id | period_id | price 1 | 1 | 100 2 | 1 | 110 3 | 1 | 90 1 | 2 | 200 2 | 2 | 210 3 | 2 | 190 1 | 1 | 100 2 | 1 | 110 3 | 1 | 90 

Now, if you want to add an additional product, you need to do the following:

  • Add product to product table
  • Add entries in the price table for perioddate> now ()

Since it is all relational, the code already considers it relational and will read it as such and simply add it to the existing code stream.

+5
source

Your quote is from the book learning hbase . The quote is inaccurate, but this is good news :)

See how the author really describes Tall-Thin

In Tall-Thin table design, a table grows faster than to the right. [...]

RowKey (UserID + TimeStamp) | BlogEntriesCF: Posts ---------------------------------------- + ------ -------------------------
Writer A TimeStamp1 | HBaseEntry
Posted By TimeStamp2 | Haadoopntry
Writer A TimeStamp3 | Haadoopntry
... | ...

Please note that the key strings do not match , which are different from your example, which explains the confusion. This example explains the need, for example, for WritterA for

crosses many lines.

However, hbase doesn’t work like that, it actually sorts the keys before they are written (technically, the mutations are not sorted in the WAL, but if everything is ok, the WAL is not used, and if it is used, the mutations are played on the MemStore, it contains these areas).

Since HBase is split up line by line, data related to a specific user can be found on a single server in the region.

This part seems to be logically related to Short-Wide ...

So, to summarize, I think this part of the book may need a review. See this great MapR blog post for a quick overview of Hbase under the hood.

+1
source

"Narrow or stacked data is presented with one column containing all the values ​​and another column that lists the context of the value. It is often easier to implement, adding a new field does not require any changes in the structure but it may be more difficult for people to understand."

From "Wide and narrow data", Wikipedia https://en.wikipedia.org/wiki/Wide_and_narrow_data [accessed 12.29.16]

I assume this means that if you want to get a clean list of values ​​without worrying about their context, you are just reading the column. If you want to do this in short data structures, you will need to find the row and reach the desired column, and for each row instead of one.

Hi,

0
source

Source: https://habr.com/ru/post/1013537/


All Articles