Cassandra: a query with a where clause containing more or less (<and>),

I am using Cassandra 1.1.2. I am trying to convert an RDBMS application to Cassandra. In my RDBMS application, I have the following table called table1:

 | Col1 | Col2 | Col3 | Col4 | 
  • Col1: String (primary key)
  • Col2: String (primary key)
  • Col3: Bigint (index)
  • Col4: Bigint

This table has over 200 million records. The most commonly used query looks something like this:

 Select * from table where col3 < 100 and col3 > 50; 

In Cassandra, I used the following statement to create a table:

 create table table1 (primary_key varchar, col1 varchar, col2 varchar, col3 bigint, col4 bigint, primary key (primary_key)); create index on table1(col3); 

I changed the primary key to an extra column (I am calculating the key inside my application). After importing several records, I tried to execute the following cql:

 select * from table1 where col3 < 100 and col3 > 50; 

This result:

 Bad Request: No indexed columns present in by-columns clause with Equal operator 

The query select col1, col2, col3, col4 from table 1, where col3 = 67 works

Google said it was not possible to fulfill such requests. It is right? Any advice on how to create such a query?

+6
source share
2 answers

Cassandra indices do not actually support sequential access; see http://www.datastax.com/docs/1.1/ddl/indexes for a quick explanation of where they are useful. But do not despair; a more classic way of using Cassandra (and many other NoSQL systems) is to denormalize, denormalize, denormalize.

In your case, it might be a good idea to use the classic range template, which allows you to use the recommended RandomPartitioner and distribute your rows well around your cluster, while maintaining consistent access to your values. The idea in this case is that you make a comparison with the second dynamic display column (in square brackets and ordered) col3 back to the associated value primary_key . For example, if your col3 values ​​range from 0 to 10 ^ 9 and are distributed fairly evenly, you might want to put them in 1000 buckets with a range of 10 ^ 6 each (the best level of detail will depend on the type of requested queries, the type of data that you have, request round-trip time, etc.). Example schema for cql3:

 CREATE TABLE indexotron ( rangestart int, col3val int, table1key varchar, PRIMARY KEY (rangestart, col3val, table1key) ); 

When pasting into table1 you must insert the corresponding row into indexotron , with rangestart = int(col3val / 1000000) . Then, when you need to list all the rows in table1 using col3> X, you need to query up to 1000 indexotron buckets, but all col3val inside will be sorted. An example query to find all table1.primary_key values ​​for which table1.col3 < 4021 :

 SELECT * FROM indexotron WHERE rangestart = 0 ORDER BY col3val; SELECT * FROM indexotron WHERE rangestart = 1000 ORDER BY col3val; SELECT * FROM indexotron WHERE rangestart = 2000 ORDER BY col3val; SELECT * FROM indexotron WHERE rangestart = 3000 ORDER BY col3val; SELECT * FROM indexotron WHERE rangestart = 4000 AND col3val < 4021 ORDER BY col3val; 
+6
source

If col3 is always known for small values ​​/ ranges, you can leave with a simpler table, which will also return to the original table, for example:

  create table table2 (col3val int, table1key varchar, primary key (col3val, table1key)); 

and use

  insert into table2 (col3val, table1key) values (55, 'foreign_key'); insert into table2 (col3val, table1key) values (55, 'foreign_key3'); select * from table2 where col3val = 51; select * from table2 where col3val = 52; ... 

Maybe OK if you have too many ranges. (you can get the same effect with your secondary index, but secondary indexes are not recommended?). It could theoretically parallelize it “locally on the client side”.

It seems that the “Cassandra path” should have some key, like “userid”, and you use it as the first part of “all your queries”, so you may need to rethink your data model, then you may have queries like select * from table1 where userid='X' and col3val > 3 and it can work (provided that the cluster key is on col3val).

0
source

Source: https://habr.com/ru/post/919781/


All Articles