Automatically split HBase areas using hbase.hregion.max.filesize

I use the HBase cloud distribution (hbase-0.94.6-cdh4.5.0) and the cloudera manager to configure all cluster configurations.

I set the following property for HBase:

<property> <name>hbase.hregion.max.filesize</name> <value>10737418240</value> <source>hbase-default.xml</source> </property> 

NB: 10737418240 <=> 10G

So, according to all the documentation read, the data should accumulate in one area until the size of the area reaches 10G.

But it doesn't seem to work ... Maybe I missed something ...

Here are all areas of my hbase table and their size:

root@hadoopmaster01 :~# hdfs dfs -du -h /hbase/my_table 719 /hbase/my_table/.tableinfo.0000000001 0 /hbase/my_table/.tmp 222.2 M /hbase/my_table/08e225d0ae802ef805fff65c89a15de6 602.7 M /hbase/my_table/0f3bb09af53ebdf5e538b50d7f08786e 735.1 M /hbase/my_table/1152669b3ef439f08614e3785451c305 2.8 G /hbase/my_table/1203fbc208fc93a702c67130047a1e4f 379.3 M /hbase/my_table/1742b0e038ece763184829e25067f138 7.3 G /hbase/my_table/194eae40d50554ce39c82dd8b2785d96 627.1 M /hbase/my_table/28aa1df8140f4eb289db76a17c583028 274.6 M /hbase/my_table/2f55b9760dbcaefca0e1064ce5da6f48 1.5 G /hbase/my_table/392f6070132ec9505d7aaecdc1202418 1.5 G /hbase/my_table/4396a8d8c5663de237574b967bf49b8a 1.6 G /hbase/my_table/440964e857d9beee1c24104bd96b7d5c 1.5 G /hbase/my_table/533369f47a365ab06f863d02c88f89e2 2.5 G /hbase/my_table/6d86b7199c128ae891b84fd9b1ccfd6e 1.2 G /hbase/my_table/6e5e6878028841c4d1f4c3b64d04698b 1.6 G /hbase/my_table/7dc1c717de025f3c15aa087cda5f76d2 200.2 M /hbase/my_table/8157d48f833bb3b708726c703874569d 118.0 M /hbase/my_table/85fb1d24bf9d03d748f615d3907589f2 2.0 G /hbase/my_table/94dd01c81c73dc35c02b6bd2c17d8d22 265.1 M /hbase/my_table/990d5adb14b2d1c936bd4a9c726f8e03 335.0 M /hbase/my_table/a9b673c142346014e01d7cf579b0e58a 502.1 M /hbase/my_table/ae3b1f6f537826f1bdb31bfc89d8ff9a 763.3 M /hbase/my_table/b6039c539b6cca2826022f863ed76c7b 470.7 M /hbase/my_table/be091ead2a408df55999950dcff6e7bc 5.9 G /hbase/my_table/c176cf8c19cc0fffab2af63ee7d1ca45 512.0 M /hbase/my_table/cb622a8a55ba575549759514281d5841 1.9 G /hbase/my_table/d201d1630ffdf08e4114dfc691488372 787.9 M /hbase/my_table/d78b4f682bb8e666488b06d0fd00ef9b 862.8 M /hbase/my_table/edd72e02de2a90aab086acd296d7da2b 627.5 M /hbase/my_table/f13a251ff7154f522e47bd54f0d1f921 1.3 G /hbase/my_table/fde68ec48d68e7f61a0258b7f8898be4

As you can see, there are many regions, and any of them has a size close to 10G ...

If someone has encountered such a problem or knows if there is another configuration to configure, please help me!

thanks

+6
source share
2 answers

@mpiffaretti, what you see is very relevant. I also had a little shock when I first saw the size of the areas after the automatic split.

In HBase 0.94+, the default partitioning policy is IncreasingToUpperBoundRegionSplitPolicy . The size of the area is determined by the following algorithm, described below.

Size split is the number of regions that are on this server, all of which have the same table, cube, times 2x the size of the drop area or the maximum size of the split area, whichever is smaller. For example, if the flash size is 128 M, then after two flushes (256 MB) we will split, which will make two areas that will be divided when their size is 2 ^ 3 * 128 M * 2 = 2048 M. If one of these areas divided, that is, three areas, and now the separation size is 3 ^ 3 * 128M * 2 = 6912M, etc., Until we reach the configured maximum file size, and then from there, we will use this.

This is a pretty good strategy, as you begin to get a pleasant distribution of regions over the region’s servers, without waiting until they reach the 10 GB limit.

As an alternative, it would be better for you to split your tables first, since you want to make sure that you get the most out of the computing power of your cluster - if you have one region, all requests will be sent to the Region Server that is assigned the region. Pre-splitting gives control to how partitions are split into row space.

+7
source

Pr-splitting is the best option. I hope that your data will not be constantly inserted into one region and will reach the limit of the region, there will be a division or compaction.

In this state, the recordings are not evenly distributed, and when the table is compacted, it becomes the neck of the bottle for recording modules.

The number of queries in the active area will be high.

0
source

Source: https://habr.com/ru/post/969866/


All Articles