Cassandra great score

Question

Cassandra great score

I need to count a bunch of “things” in Kassandra. I need to increment the counts ~ 100-200 every few seconds or so.

However, I need to count different "things."

In order not to read something twice, I set the key in CF, which the program reads before increasing the counter, for example. sort of:

result = get cf[key]; if (result == NULL){ set cf[key][x] = 1; incr counter_cf[key][x]; }

However, this read operation significantly slows down the operation of the cluster. I tried to reduce reading using multiple columns, for example. sort of:

 result = get cf[key]; if (result[key1]){ set cf[key1][x] = 1; incr counter_cf[key1][x]; } if (result[key2]){ set cf[key2][x] = 1; incr counter_cf[key2][x]; } //etc....

Then I reduced the number of reads from 200+ to about 5-6, but it still slows down the cluster.

I don’t need an exact calculation, but I can’t use bit masks, nor color filters, because there will be 1M +++ counters, and some can move more than 4,000,000,000.

I am aware of Hyper_Log_Log counting, but I see no easy way to use it with so many counters (1M +++).

I'm currently thinking of using Tokyo Cabinet as an external key / value store, but this solution, if it works, will not be as scalable as Cassandra.

+4

cassandra distinct-values

Nick Mar 13 '13 at 8:09

source share

1 answer

Theo · Answer 1 · 2013-03-13T08:19:48+0000

Using Cassandra for a separate count is not ideal if the number of different values is large. Every time you need to read before writing, you should ask yourself if Cassandra is the right choice.

If the number of individual elements is less, you can simply save them as column keys and make an account. The account is not free, Cassandra still needs to collect a row to count the number of columns, but if the number of different values is in the order of thousands, then this will probably be normal. I assume that you have already considered this option and that this is not feasible for you, I just thought that I mentioned this.

The way people usually do this is to have HLL or Bloom filters in their memory and then periodically flush them in Cassandra. That is, not doing the actual operations in Kassandra, just using it for resilience. This is a complex system, but there is an easy way to count various values, especially if you have a huge number of counters.

Even if you switched to something else, for example, to where you can perform bitwise operations on values, you still need to protect yourself from the conditions of the race. I suggest you just bite the bullet and do all your calculations in memory. Coverage increment operations on your processing nodes with a key and save all the counter status (both incremental and different) in the memory on these nodes. Periodically clear the state to Cassandra and perform incremental operations when you do this. When a node receives an increment operation for a key that it does not have in memory, it loads this state from Cassandra (or creates a new state if there is nothing in the database). If the node crashes, the operations have not been allocated and will be re-added (you need a good message queue in front of the nodes to take care of this). As you outline increment operations, you can be sure that the state of the counter is only ever affected by a single node.

Cassandra great score

More articles: