TL DR: It is similar to HBase (e.g. BigTable), uses a structure similar to the B + tree to perform a search. So the key row is the main index (and the only index of any type in HBase by default.)
Long answer: From this Cloudera blog post about the HBase recording path , it seems like HBase works as follows
Each HBase table is hosted and managed by sets of servers that fall into three categories:
- One active master server
- One or more primary backup servers
- Many regional servers
Servers in the region facilitate the processing of HBase tables. Because HBase tables can be large, they are partitioned into sections called regions. Each area server processes one or more of these areas.
The following paragraph in more detail:
Since the row key is sorted, it is easy to determine which area the server manages that key .... Each row key belongs to a specific region served by the region server. Therefore, based on the deletes the key, the HBase client can find the correct area server. Firstly, it finds the server address of the region that hosts the -ROOT- region from the ZooKeeper quorum. On the server of the root area, the client detects the location of the server of the region where the -META-area is located. On the meta-region server, we finally find which serves the requested region. This is a three-step process, so the location of the region is cached to avoid this expensive series of operations.
From another Cloudera blog post , it seems that the exact format used to store HBase on the file system continues to change, but the above mechanism for finding line strings should be more or less consistent.
This mechanism is very, very similar to the Google BigTable lookup (for details, see section 5.1, starting at the end of page 4 on the PDF), which uses a three-level hierarchy to query the location of the line: Chubby -> Root tablet -> METADATA tablets -> actual tablet
UPDATE: to answer the question about searching inside the Region server itself: I donβt know for sure, but since the row keys are sorted, and HBase knows the beginning and end keys, I suspect that it uses binary search or interpolation search , both of which are very fast - log (n) and log (log (n)), respectively. I donβt think HBase will ever need to scan strings from the start key for the one that it needs to find, since sorted key search is a well-known problem that has several effective solutions.