Cassandra CQL3 selects rows from a table with a composite primary key

I am using Cassandra 1.2.7 with the official Java driver that uses CQL3.

Suppose the table is created

CREATE TABLE foo ( row int, column int, txt text, PRIMARY KEY (row, column) ); 

Then I would like to generate the equivalent of a SELECT DISTINCT row FROM foo

As I understand it, it should be possible to efficiently execute this query inside the Cassandra data model (given the way to create composite primary keys), because it just queries the raw table.

I was looking for CQL documentation, but I did not find any options for this.

My backup plan is to create a separate table - something like

 CREATE TABLE foo_rows ( row int, PRIMARY KEY (row) ); 

But this requires the hassle of synchronizing two files - writing to foo_rows for any writing to foo (also a performance limitation).

So, is there a way to request different row (section) keys?

+4
source share
3 answers

according to the documentation , from CQL version 3.11, cassandra understands the DISTINCT modifier. So now you can write

 SELECT DISTINCT row FROM foo 
+4
source

I will give you a bad way to do this first. If you insert these lines:

 insert into foo (row,column,txt) values (1,1,'First Insert'); insert into foo (row,column,txt) values (1,2,'Second Insert'); insert into foo (row,column,txt) values (2,1,'First Insert'); insert into foo (row,column,txt) values (2,2,'Second Insert'); 

Performance

 'select row from foo;' 

will provide you with the following:

  row ----- 1 1 2 2 

Not different, since it shows all possible combinations of rows and columns. To query the value of a single row, you can add a column value:

 select row from foo where column = 1; 

But then you get this warning:

 Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING 

Ok Then with this:

 select row from foo where column = 1 ALLOW FILTERING; row ----- 1 2 

Great. What I wanted. However, we will not ignore this warning. If you have only a small number of lines, say 10,000, then this will work without huge success. What if I have 1 billion? Depending on the number of nodes and the replication rate, your performance will be seriously affected. First, the query should check every possible row in the table (read the full table scan), and then filter unique values ​​for the result set. In some cases, this request will simply be disabled. Given that this is probably not what you were looking for.

You mentioned that you were worried about performance when pasting into multiple tables. Multiple table insertions are a perfectly valid data modeling technique. Cassandra can do a lot of letters. As for the pain in synchronization, I do not know your specific application, but I can give general advice.

If you need a separate scan, you need to think about section columns. This is what we call an index or query table. An important thing to consider in any Cassandra data model is application requests. If I used the IP address as a string, I could create something like this to scan all the IP addresses that I have.

 CREATE TABLE ip_addresses ( first_quad int, last_quads ascii, PRIMARY KEY (first_quad, last_quads) ); 

Now, to insert some lines into the address space 192.xxx:

 insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001'); insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002'); insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001'); insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255'); 

To get individual lines in space 192, I do this:

 SELECT * FROM ip_addresses WHERE first_quad = 192; first_quad | last_quads ------------+------------ 192 | 000000001 192 | 000000002 192 | 000001001 192 | 000001255 

To get each individual address, you just need to iterate over all possible lines from 0-255. In my example, I expect the application to request specific ranges in order to remain operational. Your application may have different needs, but hopefully you can see the template here.

+7
source

@edofic

Section row keys are used as a unique index to distinguish between different rows in the storage engine, therefore, by their nature, row keys are always different. You do not need to put DISTINCT in a SELECT clause

Example

  INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1'); INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1'); INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2'); 

Then

 SELECT row FROM foo 

will return 2 values: 1 and 2

The following describes how things are stored in Cassandra

+ ---------- + ------------------- + --------------- --- +
| row key | column1 / value | column2 / value |
+ ---------- + ------------------- + ------------------ +
| 1 | 1 / '1' | 2 / '2' |
| 2 | 1 / '1 "| |
+ ---------- + ------------------- + ------------------ +

0
source

Source: https://habr.com/ru/post/1500027/


All Articles