I tested rotation through a 4 node cluster by adding and removing nodes in a circular manner so that the cluster members adhered to the following repeating sequence
1 2 3 2 3 2 3 4 3 4 1 3 4 1 4 1 2 4 1 2 1 2 3 2 3 2 3 4 3 4 1 3 4 1 4 ...
Node addition was done by stopping cassandra, clearing /var/lib/cassandra/* and restarting cassandra (with the same cassandra.yaml file that listed nodes 1 and 2 as seeds). Removing the node was done by stopping cassandra, and then issuing the nodetool removenode $nodeId from another node. In all cases, the next operation did not start until the previous one was completed.
The above sequence of node members was repeated several times, while after 4 iterations I performed the "add node" operation to move from the cluster of nodes {1, 2} to the cluster of nodes {1, 2, 3}. At this iteration, my custom keyspace did not extend to node 3. The Nodetool state looked great:
$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.12.206 164.88 KB 256 66.2% 7018ef8a-af08-40e9-b3d3-065f4ba6eb0d rack1 UN 192.168.12.207 60.85 KB 256 63.2% ff18b636-6287-4c70-bf23-0a1a1814b864 rack1 UN 192.168.12.205 217.19 KB 256 70.6% 2bc38fa8-42a1-457f-84d7-35b3b46e1daa rack1
But cqlsh on node 3 did not know about my key space. I tried to start nodetool repair , which seemed to be infinitely closed, while spewing the following pair of stacks in the log:
WARN [Thread-9781] 2014-09-16 19:34:30,081 IncomingTcpConnection.java (line 83) UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=08768b1d-97a1-3528-8191-9acee7b08ef4 at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:178) at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:103) at org.apache.cassandra.service.paxos.Commit$CommitSerializer.deserialize(Commit.java:145) at org.apache.cassandra.service.paxos.Commit$CommitSerializer.deserialize(Commit.java:134) at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:153) at org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:130) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74) ERROR [Thread-9782] 2014-09-16 19:34:31,484 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-9782,5,main] java.lang.NullPointerException at org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:247) at org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:156) at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:153) at org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:130) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
Any ideas what is happening and how to fix it (ideally, a reliable working repair and a way to avoid entering this condition in the first place)?
source share