Cassandra IN Index Article

Suppose a simple table with one insert (or without this insert does not matter).

CREATE TABLE test ( x int, y int, z int, PRIMARY KEY (x, y, z) ); create index z_index on test (z); insert into test(x, y, z) values (1,2,3); 

I am trying to understand why I cannot query with an in clause on index z:

cqlsh: test> select * from test, where z is in (3);
Bad query: the PRIMARY KEY z part cannot be bounded (the previous part y is either not bounded or the relation is not EQ)

This is possible with a simple equal predicate:

 cqlsh:test> select * from test where z = 3; x | y | z ---+---+--- 1 | 2 | 3 (0 rows) 

I thought the index on z would keep the mapping from specific z values ​​to strings, but this assumption seems wrong.

Why does this not work as I expected? I think the index works differently.

EDIT: I am using [cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift Protocol 19.39.0]

+6
source share
1 answer

Although the documentation on DataStax is generally really good, I could not find anything discussing the details behind it. However, I came across this article entitled β€œ β€œ Breakdown of the WHQL WHERE clause. β€β€œ Section # 2 is called β€œ The last column in the section key supports the IN statement .”

Paraphrasing, he basically says this:

For individual sections of column sections, the IN operator is allowed without restriction. For compound partition keys, I have to use the = operator in the first columns of the N-1 partition key to use the IN operator in the last column.

In your case, x is your section key, which means that x is the only column that the IN CQL statement will support. If you really need to support IN queries in the z column, you will have to de-normalize your data and create a (redundant) table designed to support this query. For instance:

 CREATE TABLE test ( x int, y int, z int, PRIMARY KEY (z) ); 

... will support the request, but z values ​​may not be unique. In this case, you can define x and / or y as LIST<int> , and that would do it.

In addition, DataStax has documentation available on if you do not use an index , and they declare that the same conditions apply to the use of the IN operator.

In most cases, using IN in a WHERE clause is not recommended. Using IN can degrade performance because typically many nodes should be requested. For example, in one local data center a cluster with 30 nodes, a replication coefficient of 3 and a LOCAL_QUORUM level of consistency, one key request goes to two nodes, but if the request uses the IN condition, the number of requested nodes is likely to be even higher, up to 20 nodes depending from where the keys fall into the token range.

+6
source

Source: https://habr.com/ru/post/969871/


All Articles