Get the first row for each section key in Cassandra

Question

Get the first row for each section key in Cassandra

I view Cassandra as an intermediate storage during my ETL job to perform data deduplication.

Suppose I have a stream of events, each of which has a business object identifier, timestamp, and some value. I need to get only the last value in terms of the timestamp for each business key, but events can be unordered.

My idea was to create a staging table with a business identifier as the partition key and a timestamp as the clustering key:

CREATE TABLE sample_keyspace.table1_copy1 ( id uuid, time timestamp, value text, PRIMARY KEY (id, time) ) WITH CLUSTERING ORDER BY ( time DESC )

Now, if I insert some data into this table, I can get the last value for a specific section key:

 select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;

But this will require a request for each business key that interests me.

Is there any efficient way to do this in CQL?

I know that we have the ability to list all the available partition keys ( select distinct id from table1 ). Therefore, if I look at the Cassandra storage model, getting the first row for each section key should not be too complicated.

Is it supported?

+5

cassandra

Yura taras Nov 02 '16 at 12:12

source share

2 answers

In a word: no.

A separation key is why Cassandra can work with almost any amount of data: it decides where to put / search data using the hash of the partition key. This is why CQL SELECTs always need to make an equality filter on the entire key of a section. To find the first time for each id , Cassandra would need to ask all the nodes for any data section, and then perform a complex operation for each of them. Relational databases allow this; Cassandra does not. All that it allows is a full table scan ( SELECT * from table1 ) or section scan ( SELECT DISTINCT id FROM table1 ), but they cannot be associated with any complex operation.

*) Here I omit ALLOW FILTERING , as this does not help in this context.

0

Jan Doerrenhaus Nov 02 '16 at 12:36

source share

Chris lohfink · Accepted Answer · 2016-11-02T13:37:35+0000

If you are using a version after 3.6, there is an option in your request called PER PARTITION LIMIT ( CASSANDRA-7017 ), which you can set to 1 . This will not be automatically completed in cqlsh until 3.10 using CASSANDRA-12803 .

 SELECT * FROM table1 PER PARTITION LIMIT 1;

Get the first row for each section key in Cassandra

More articles: