Kassandra Read More Longer than Expected

I am using cassandra 1.2 with CQL3. I have three column families in my key space. When I request one of the column families (phones), it takes a long time to retrieve. Here is my request

**select * from phones where phone_no in ('9038487582');** 

Here is the trace result for the query.

 activity | timestamp | source | source_elapsed -------------------------------------------------+--------------+-------------+---------------- execute_cql3_query | 16:35:47,675 | 10.1.26.155 | 0 Parsing statement | 16:35:47,675 | 10.1.26.155 | 58 Peparing statement | 16:35:47,675 | 10.1.26.155 | 335 Executing single-partition query on phones | 16:35:47,676 | 10.1.26.155 | 1069 Acquiring sstable references | 16:35:47,676 | 10.1.26.155 | 1097 Merging memtable contents | 16:35:47,676 | 10.1.26.155 | 1143 Partition index lookup complete for sstable 822 | 16:35:47,676 | 10.1.26.155 | 1376 Partition index lookup complete for sstable 533 | 16:35:47,686 | 10.1.26.155 | 10659 Merging data from memtables and 2 sstables | 16:35:47,704 | 10.1.26.155 | 29192 Read 1 live cells and 0 tombstoned | 16:35:47,704 | 10.1.26.155 | 29332 Request complete | 16:35:47,704 | 10.1.26.155 | 29601 

I have only 1 replication factor in the key space. and have 3 node clusters. Phones have about 40 million lines and only two columns per line. it returns in 29 ms, 15 ms, 8 ms, 5 ms, 3 ms, but it is incompatible. Can you guys give me any suggestions as to what mistake I could make? Also my usecase will have an extremely low cache hit, so keys for caching for me are not a solution. Also, this is the definition of my column family.

 CREATE TABLE phones ( phone_no text PRIMARY KEY, ypids set<int> ) WITH bloom_filter_fp_chance=0.100000 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'LeveledCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; 
+4
source share
3 answers

Indexes look fast enough (probably the index file is cached by the OS because it is accessed frequently); where you lose all the time is between them and the โ€œdata mergeโ€ step. What happens between them is actually looking for the location of the data in sstable. (I added a new trace entry for 1.2.6 to make this clear.)

This explains why sometimes it is fast, and sometimes not - if your search is not needed or even better cached, the request will be fast. Otherwise, it will be slower.

I see several options that may help:

You will notice that only the first option does not include more or other equipment, so I would appreciate it first. But growth potential is limited: at best, you reduce the number of sstables to 1.

+4
source

From the table you specified above, most of the time the query is in index search and SSTables merge. This is quite common, I do not believe that you did something wrong.

Index searches can be eliminated by de-normalizing the data. Typically, Cassandra designs ColumnFamilies around queries, rather than tables, as is typical of relational systems. This then puts the burden at the time of writing the data, where Cassandra is the strongest and, of course, risks data consistency due to duplication of data and the natural tendency of Cassandra to provide customers with different types of data to optimize the clustering of data availability.

Merging sstables is Cassandra Achilles Heel, so to speak. Cassandra optimizes write speed and reliability with read latency and latency. Itโ€™s perfectly normal that Cassandra has โ€œslowerโ€ reads, which persist depending on the duration. To reduce this problem, there are two approaches: first, to avoid any updating or deletion of data in the column family, as this leads to subsequent failures. But even then, that only a delay in the operation of sstable as an insert will cause memtables to be reset. So another solution that can be considered if the change / duration is still too large is Cassandra's front with a cache such as Memcache. This is the approach that Netflix has registered here for Netflix benchmarking of Cassandra .

For completeness, I must add that the Cassandra column family settings can be tuned, fixed, and then tuned again to reduce this problem. But it will only be that this problem is inherent in the design of Cassandra. The options you should pay attention to are cache sizes, such as memtables and its overflow speed, which is the point at which a new SSTable will be created. Compression can also help, as it helps to compress more data into memory. I usually expect non-indexed reads to take 2-10 ms (average 5 ms) depending on the hardware and cluster activity in Amazon EC2 (this is the environment I work in these days).

+3
source

Cassandra's requests are usually very fast and usually take a constant amount of time. If you run a query on a single column in your column family, how long does it take to return compared to running a query on all columns? Some overhead is expected due to more columns, but not many, for example about 1 or 2 ms.

If there is a big difference (more than twice) between all queries and a single row query, even if the column family does not have a lot of data, your query may not be built correctly. If you expect to have predictable columns in a row, you can try to combine them instead of a query using a wildcard. This can have a dramatic effect on query speed.

+1
source

Source: https://habr.com/ru/post/1481227/


All Articles