Does HBase use a primary index?

How does HBase search and retrieve a record? for example, what is equivalent in HBase for RDBMS B-trees?

[EDIT]

I understand how HBase permits -ROOT- and .META. tables to find out in which area the data is stored. But how is a local search performed?

To better illustrate, here is an example:

  • I run a search (retrieve or scan) for recording using key 77.
  • The HBase client shows that the key is in the region of 50-100, RegionServer X is stored.
  • RegionServer X HBase Contact Client receives data.

How does RegionServer X locate record location 77?

Does RegionServer use some kind of lookup table (e.g. RDBMS B-Trees?) For region keys? Or do you need to read all the contents of StoreFiles, for entries from 50 to 77?

+4
source share
1 answer

TL DR: It is similar to HBase (e.g. BigTable), uses a structure similar to the B + tree to perform a search. So the key row is the main index (and the only index of any type in HBase by default.)

Long answer: From this Cloudera blog post about the HBase recording path , it seems like HBase works as follows

Each HBase table is hosted and managed by sets of servers that fall into three categories:

  • One active master server
  • One or more primary backup servers
  • Many regional servers

Servers in the region facilitate the processing of HBase tables. Because HBase tables can be large, they are partitioned into sections called regions. Each area server processes one or more of these areas.

The following paragraph in more detail:

Since the row key is sorted, it is easy to determine which area the server manages that key .... Each row key belongs to a specific region served by the region server. Therefore, based on the deletes the key, the HBase client can find the correct area server. Firstly, it finds the server address of the region that hosts the -ROOT- region from the ZooKeeper quorum. On the server of the root area, the client detects the location of the server of the region where the -META-area is located. On the meta-region server, we finally find which serves the requested region. This is a three-step process, so the location of the region is cached to avoid this expensive series of operations.

From another Cloudera blog post , it seems that the exact format used to store HBase on the file system continues to change, but the above mechanism for finding line strings should be more or less consistent.

This mechanism is very, very similar to the Google BigTable lookup (for details, see section 5.1, starting at the end of page 4 on the PDF), which uses a three-level hierarchy to query the location of the line: Chubby -> Root tablet -> METADATA tablets -> actual tablet

UPDATE: to answer the question about searching inside the Region server itself: I don’t know for sure, but since the row keys are sorted, and HBase knows the beginning and end keys, I suspect that it uses binary search or interpolation search , both of which are very fast - log (n) and log (log (n)), respectively. I don’t think HBase will ever need to scan strings from the start key for the one that it needs to find, since sorted key search is a well-known problem that has several effective solutions.

+4
source

Source: https://habr.com/ru/post/1439974/


All Articles