As I suspected, there are bandwidth problems on your disk. Here I looked to give you the background. The output of nodetool tpstats from your three nodes had the following lines:
Pool Name Active Pending Completed Blocked All time blocked FlushWriter 0 0 22 0 8 FlushWriter 0 0 80 0 6 FlushWriter 0 0 38 0 9
The column I'm worried about is All Time. As a relation to the completed, you have a lot of blocking. Flushwriter is responsible for flushing memtables to disk so that the JVM does not run out of memory or creates massive problems with the GC. Memory is the representation of your tables in memory. As your nodes take up more records, they begin to fill up and need to be cleaned. This operation is a long sequential write to disk. Bookmark I will return to him.
When flushwriters are blocked, the heap begins to fill up. If they remain blocked, you will see that the requests start to queue, and in the end the node will be OOM.
Sealing may also work. Consolidation is a long sequential read of SSTables into memory, and then a long sequential dump of merge sort results. More consistent IO.
So, all these disk operations are sequential. No random IOP. If your drive is not capable of handling simultaneous sequential read and write, IOWait starts up, requests are blocked, and then Cassandra has a very bad day.
You mentioned that you are using Ceph. I have not seen a successful deployment of Cassandra on Ceph. It will be held for a while, and then tipped over in series. The simplest solution in the short term is to add more nodes to spread the load. The medium-term approach is to find some ways to optimize your stack for sequential disk loads, but this will ultimately fail. In the long run, you get data on real disks and from shared storage.
I have been talking about this to advise clients for many years when using Cassandra "If your storage has a network plug, you are doing it wrong." Good rule of thumb.
source share