High Disk I / O on Cassandra Nodes

Setup:
We have 3 nodes of the Cassandra cluster with data of about 850G on each node, we have installed LVM for the Cassandra data directory (currently it consists of 3 800G + 100G + 100G disks) and has a separate volume (not LVM) for cassandra_logs

Versions:
Kassandra v2.0.14.425
DSE v4.6.6-1

Question:
After adding a 3rd (100G) volume to LVM on each node, all nodes went very high in disk I / O, and they go quite often, servers also become unavailable, and we need to restart the servers, t is stable, and we need to reboot every 10 to 15 minutes.

Additional Information:
We have recommended DSE server parameters (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node: 24G
CPU on each node: 6 cores / 2600 MHz
Disk on each node: 1000G (Data dir) / 8G (logs)

+5
source share
1 answer

As I suspected, there are bandwidth problems on your disk. Here I looked to give you the background. The output of nodetool tpstats from your three nodes had the following lines:

 Pool Name Active Pending Completed Blocked All time blocked FlushWriter 0 0 22 0 8 FlushWriter 0 0 80 0 6 FlushWriter 0 0 38 0 9 

The column I'm worried about is All Time. As a relation to the completed, you have a lot of blocking. Flushwriter is responsible for flushing memtables to disk so that the JVM does not run out of memory or creates massive problems with the GC. Memory is the representation of your tables in memory. As your nodes take up more records, they begin to fill up and need to be cleaned. This operation is a long sequential write to disk. Bookmark I will return to him.

When flushwriters are blocked, the heap begins to fill up. If they remain blocked, you will see that the requests start to queue, and in the end the node will be OOM.

Sealing may also work. Consolidation is a long sequential read of SSTables into memory, and then a long sequential dump of merge sort results. More consistent IO.

So, all these disk operations are sequential. No random IOP. If your drive is not capable of handling simultaneous sequential read and write, IOWait starts up, requests are blocked, and then Cassandra has a very bad day.

You mentioned that you are using Ceph. I have not seen a successful deployment of Cassandra on Ceph. It will be held for a while, and then tipped over in series. The simplest solution in the short term is to add more nodes to spread the load. The medium-term approach is to find some ways to optimize your stack for sequential disk loads, but this will ultimately fail. In the long run, you get data on real disks and from shared storage.

I have been talking about this to advise clients for many years when using Cassandra "If your storage has a network plug, you are doing it wrong." Good rule of thumb.

+8
source

Source: https://habr.com/ru/post/1246638/


All Articles