I view Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Suppose I have a stream of events, each of which has a business object identifier, timestamp, and some value. I need to get only the last value in terms of the timestamp for each business key, but events can be unordered.
My idea was to create a staging table with a business identifier as the partition key and a timestamp as the clustering key:
CREATE TABLE sample_keyspace.table1_copy1 ( id uuid, time timestamp, value text, PRIMARY KEY (id, time) ) WITH CLUSTERING ORDER BY ( time DESC )
Now, if I insert some data into this table, I can get the last value for a specific section key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But this will require a request for each business key that interests me.
Is there any efficient way to do this in CQL?
I know that we have the ability to list all the available partition keys ( select distinct id from table1 ). Therefore, if I look at the Cassandra storage model, getting the first row for each section key should not be too complicated.
Is it supported?
source share