Sequential row identifiers in column-oriented databases (HBase, Cassandra)?

I saw two conflicting tips when it comes to developing string identifiers in HBase (in particular, but I think this applies to Cassandra as well).

  • Group keys that you will collect together often to take advantage of the data area. (White, Hadoop: The Definitive Guide, and I remember that I saw it on the HBase website, but I can not find it ...)
  • Distribute keys so that work can be shared between multiple machines ( Twitter, Pig and HBase on Twitter slide 14)

I guess which one is optimal may depend on your use case, but does anyone have experience with any strategy?

+3
source share
1 answer

In HBase, a table is broken down into regions, dividing the key space, which is sorted lexicographically. Each region of the table belongs to one server of the region, so all reads and writes are processed by this server (which provides a strong guarantee of consistency). This means that if all your readings or records are concentrated on a small range of your key space, you can only scale what a single server in the region can handle. For example, if your data is a time series and is specified by a time stamp, then all records go to the last region in the table, and you will be forced to write at a speed that one server can process.

, , , , , - .

+2

Source: https://habr.com/ru/post/1754305/


All Articles