Prior to 3.0, sstable2json was a useful utility for understanding how data is organized in SSTables. This feature is currently missing in cassandra 3.0, but there will eventually be an alternative. So far, Chris Lofkink and I have developed an alternative to sstable2json ( sstable-tools ) for Cassandra 3.0, which you can use to understand how the data is organized. There is some talk about how to bring this into the cassandra itself in CASSANDRA-7464 .
The key difference between the storage format between older versions of Cassandra and Cassandra 3.0 is that SSTable used to be sections of partitions and their cells (identified by their clusters and column name), while Cassandra 3.0 SSTable now represents partitions and their series.
You can read more about these changes in more detail by visiting this blog post by the main developer of these changes, who does an excellent job of this in detail.
The biggest advantage that you will see is that in the general case, the size of your data will be reduced (in some cases with a large coefficient), since many of the costs introduced by CQL have been eliminated by some improvements.
Here is an example showing the difference between C * 2 and 3.
Scheme:
create keyspace demo with replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; use demo; create table phonelists (user text, person text, phonenumbers text, primary key (user, person)); insert into phonelists (user, person, phonenumbers) values ('scott', 'bill', '555-7382'); insert into phonelists (user, person, phonenumbers) values ('scott', 'jane', '555-8743'); insert into phonelists (user, person, phonenumbers) values ('scott', 'patricia', '555-4326'); insert into phonelists (user, person, phonenumbers) values ('john', 'doug', '555-1579'); insert into phonelists (user, person, phonenumbers) values ('john', 'patricia', '555-4326');
sstable2json C * 2.2 output:
[ {"key": "scott", "cells": [["bill:","",1451767903101827], ["bill:phonenumbers","555-7382",1451767903101827], ["jane:","",1451767911293116], ["jane:phonenumbers","555-8743",1451767911293116], ["patricia:","",1451767920541450], ["patricia:phonenumbers","555-4326",1451767920541450]]}, {"key": "john", "cells": [["doug:","",1451767936220932], ["doug:phonenumbers","555-1579",1451767936220932], ["patricia:","",1451767945748889], ["patricia:phonenumbers","555-4326",1451767945748889]]} ]
sstable-tools toJson C * 3.0:
[ { "partition" : { "key" : [ "scott" ] }, "rows" : [ { "type" : "row", "clustering" : [ "bill" ], "liveness_info" : { "tstamp" : 1451768259775428 }, "cells" : [ { "name" : "phonenumbers", "value" : "555-7382" } ] }, { "type" : "row", "clustering" : [ "jane" ], "liveness_info" : { "tstamp" : 1451768259793653 }, "cells" : [ { "name" : "phonenumbers", "value" : "555-8743" } ] }, { "type" : "row", "clustering" : [ "patricia" ], "liveness_info" : { "tstamp" : 1451768259796202 }, "cells" : [ { "name" : "phonenumbers", "value" : "555-4326" } ] } ] }, { "partition" : { "key" : [ "john" ] }, "rows" : [ { "type" : "row", "clustering" : [ "doug" ], "liveness_info" : { "tstamp" : 1451768259798802 }, "cells" : [ { "name" : "phonenumbers", "value" : "555-1579" } ] }, { "type" : "row", "clustering" : [ "patricia" ], "liveness_info" : { "tstamp" : 1451768259908016 }, "cells" : [ { "name" : "phonenumbers", "value" : "555-4326" } ] } ] } ]
While the output is larger (this is more related to the tool). The key differences you can see are:
- Data is now a collection of sections and their rows (including cells), rather than a collection of sections and their cells.
- Timestamps are now at the line level (liveness_info), rather than at the cell level. If some cells in cells are differentiated by timestamps, the new storage engine performs delta coding to save space and relate the difference at the cell level. It also includes TTL. As you can imagine, this saves a lot of space if you have many non-character columns, since the timestamp does not need to be repeated.
- Clustering information (in this case, we are grouped by "person") is now present at the row level instead of the cell level, which saves a bunch of overhead, since the values ββof the clustering columns do not have to be in the cell level.
It should be noted that in this particular data example, the advantages of the new storage mechanism are not fully realized, since there is only one column that does not contain clusters.
There are a number of other improvements not shown here (for example, the ability to store tombstones of a level range at the row level).