As an author, I really appreciate the question and your participation in the material!
Regarding the initial questions - remember that this is not a formula for calculating the size of a table, this is a formula for calculating the size of one section. The goal is to use this formula with the βworstβ number of rows to identify sections that are too large. You will need to multiply the result of this equation by the number of partitions to get an estimate of the total data size for the table. And of course, this does not allow for replication.
Also thanks to those who answered the original question. Based on your feedback, I spent some time on the new storage format (3.0) to see if this could affect the formula. I agree that Aaron Morton's article is a useful resource (link above).
The basic approach of the formula remains sound for storage format 3.0. How the formula works, you basically add:
- partition key and static column sizes
- clustering column size for each row, number of rows
- 8 bytes of metadata for each cell
Updating a formula for storage format 3.0 requires revising constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp in the cell as optional, since it can be applied at the row level. For this reason, there is currently a variable amount of metadata per cell, which can be as 1-2 bytes, depending on the type of data.
After reading this feedback and re-reading this section of the chapter, I plan to update the text to add some clarifications, and stronger warnings about this formula are useful as an approximation, not an exact value. There are factors that he does not take into account at all, such as records distributed over several SSTables, as well as tombstones. We plan to print this spring (2017) again to fix a few errors, so find these changes soon.
source share