When does Kassandra fall into Amdahl’s law?

Question

When does Kassandra fall into Amdahl’s law?

I am trying to understand the claim that Cassandra scales linearly with the number of nodes. Briefly look around the "network", I did not see much relevance to this topic. Undoubtedly, there are elements of sequential processing in Kassandra that should limit the speed obtained by increasing N. Any thoughts, pointers or links to this topic will be appreciated.

Edit to provide perspective:
I am working on a project that has a current request for a 1000+ node Cassandra infrastructure. I did not come up with this specification. I suggest that N be reduced to a range from 200 to 500, with each node being at least twice as fast for sequential calculations. This can be easily achieved without a penalty for the cost per node by making simple changes to the server configuration.

+6

cassandra

martin's Jan 12 '12 at 17:26

source share

1 answer

Theodore hong · Accepted Answer · 2012-01-13T13:47:30+0000

Cassandra's scaling is better described in terms of Gustafson's Law , rather than Amdahl's Law . Scaling Gustafson looks like how much data can be processed as the number of nodes increases. That is, if you have N times more nodes, you can process the data set N times more for the same period of time.

This is possible because Cassandra uses very little coordination across the entire cluster, with the exception of circuit and ring changes. Most operations include only the number of nodes equal to the replication coefficient, which remains constant as the data set grows, therefore, on an almost linear scale.

In contrast, Amdahl scaling looks like how much faster you can process a fixed dataset as the number of nodes increases. That is, if you have N times more nodes, can you process the same data set N times faster?

It is clear that at some point you are reaching the point where adding more nodes does not make your requests faster because it takes a minimal amount of time to service the request. Cassandra is not linear here.

In your case, it sounds like you are asking if it is better to have 1000 slow nodes or 200 fast nodes. How big is your data set? It depends on your workload, but the usual recommendation is that the optimal node size is about 1 TB of data, each of which should have enough RAM and processor (see cassandra node restrictions ). 1000 sounds too much if you don't have petabytes of data.

When does Kassandra fall into Amdahl’s law?

More articles: