Update - Short Version:
The PropertyFileSnitch cassandra-topology.properties property for the first 3 nodes (Rack 1-3) states that only these nodes are in DC1, and the rest are in DC2, specifying the default value default=DC2:r1 . When the cluster was enlarged by adding nodes 4 and 5, the PropertyFileSnitch for these nodes was configured to add them to DC1, as well as to racks 4 and 5, but the snitch of the first 3 nodes remained unchanged and as a result the cluster was in an inconsistent state.
My question is that this cluster can be rebalanced (fixed). Would it be enough if I completely restarted the cluster after fixing cassandra-topology.properties ?
Please advise how I can safely rebalance the cluster.
Longer version:
I am new to Cassandra and I started working on an already built cluster.
I have 5 nodes in one data center on different racks running Cassandra version 3.0.5 with vnodes num_tokens: 256 and a key with replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true .
Historically, there were only 3 nodes, and the cluster was enlarged with an additional 2 nodes. I have an automatic repair script that launches nodetool repair with parameters parallelism: parallel, primary range: false, incremental: true, job threads: 1 .
After entering a lot of data, problems arose. When the recovery script starts on node 4 or 5, node 2 reboots: the processor load remains at 100%, the MutationStage queue grows, and the GC pauses to accept for at least 1 s until the Cassandra process finally dies. Typically, the result of the recovery failed with error Stream failed (progress: 0%) .
When I run the nodetool status command on nodes 1, 2, or 3, I get the following output: Datacenter: DC2 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.13 10.68 GB 256 0.0% 75e17b8a r1 UN 10.0.0.14 9.43 GB 256 0.0% 21678ddb r1 Datacenter: DC1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.10 16.14 GB 256 100.0% cf9d327f Rack1 UN 10.0.0.11 22.83 GB 256 100.0% e725441e Rack2 UN 10.0.0.12 19.66 GB 256 100.0% 95b5c8e3 Rack3
But when I run the nodetool status command on nodes 4 or 5, I get the following output: Datacenter: DC1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.13 10.68 GB 256 58.9% 75e17b8a Rack4 UN 10.0.0.14 9.43 GB 256 61.1% 21678ddb Rack5 UN 10.0.0.10 16.14 GB 256 60.3% cf9d327f Rack1 UN 10.0.0.11 22.83 GB 256 61.4% e725441e Rack2 UN 10.0.0.12 19.66 GB 256 58.3% 95b5c8e3 Rack3
After further research, it seems that PropertyFileSnitch cassandra-topology.properties did not update on nodes 1, 2, and 3 (which are also seeds for this cluster) after the cluster has been enlarged.
Thanks!