I would use Zookeeper + Norbert to find out which hosts go up and down:
http://www.ibm.com/developerworks/library/j-zookeeper/
Now every node in my chat server farm can know all the hosts in the logical cluster. They will receive a callback when the node goes offline (or goes online). Any node can now save a sorted list of current cluster members, hash the chat identifier and mod according to the size of the list to get an index in the list, which is a node that should host any specific section in the chat. We can add 1 and rephrase to select the second index (it takes a loop until you get a new index) to calculate the second host to store the second copy of the chat for redundancy. On each of the two chat hosts, there is a chat participant who simply passes all the chat messages to each Websocket actor who is a member of the chat.
Now we can send chat messages through both active chat participants with a regular Akka router. The client simply sends the message once, and the router will execute hash mods and send them to the two participants in the remote chats. I would use the twitter snowflake algorithm to generate unique 64-bit identifiers for sent messages. See the Algorithm in the nextId () method of the code at the following link. DatacenterId and workerId can be set using norbert properties to ensure that counter identifiers are not created on different servers:
https://github.com/twitter/snowflake/blob/master/src/main/scala/com/twitter/service/snowflake/IdWorker.scala
Now, two copies of each message will be sent to each endpoint of the client through each of the two active participants in the chat. For each Websocket client client, I would not mask the snowflake identifiers to find out the datacenterId + workerId number sending the message and track the highest chat message number visible from each host in the cluster. Then I would ignore any messages that are no higher than what was already visible on this client for this sender node. This will deduplicate a couple of messages coming through two active chat participants.
So far so good; we would have persistent messages in that if any node dies, we won’t lose one surviving copy of the chats. Messages will be continuously transmitted in the second chat.
Then we need to deal with removing nodes from the cluster or adding them back to the cluster. We will receive a norbert callback in each node to notify us of cluster membership changes. In this callback, we can send an akka message through a custom router with a new list of participants and the current host name. The user router on the current node will see this message and update its status to find out about the new membership in the cluster, to calculate a new pair of nodes to send any given chat traffic. This confirmation of new cluster membership will be sent by the router to all nodes so that each server can track when all servers have caught up with membership changes and are now sending messages correctly.
A persistent chat may remain active after a membership change. In this case, all routers on all nodes will continue to send to it as usual, but will also send a message speculatively to the new chat host. This second chat may not yet rise, but this is not a problem, as the messages will go through the survivor. If the surviving chat is no longer active after a membership change, all routers on all hosts will first be sent to three hosts; survival and two new nodes. The akka death clock mechanism can be used so that all nodes can ultimately see the survival of the remaining chat in order to return to the routing of chat traffic through two hosts.
Then we need to transfer the chat from the remaining server to one or two new hosts, depending on the circumstances. At some point, a fascinating chat actor will receive a message about a new membership in the cluster. It will start by sending a copy of the chat membership to new sites. This message will create a new copy of the chat member with the correct membership on the new sites. If survival is no longer one of the two nodes that must hold the chat, it goes into decommissioning mode. In the disarming mode, it will redirect only messages to new primary and secondary nodes that are not members of the chat. Akka messaging is perfect for this.
Radiating Chat will listen for norbert cluster confirmation messages from each node. In the end, he will see that all the nodes in the cluster have recognized the new membership in the cluster. He then knows that he will no longer receive any further messages for forwarding. Then he can kill himself. Akka Hot Standby is ideal for implementing decommissioning.
So far so good; we have a fault tolerant messaging system that will not lose messages for node failure. At the moment of cluster membership change, we will receive a surge of traffic within the network in order to copy chats to new nodes. We also have a residual flurry of intranode message forwarding to the nodes until all the servers have caught up to which chats the two servers have moved. If we want to expand the system, we can wait until the low point in user traffic and just enable a new node. Chats will be automatically redistributed to new sites.
The above description is based on reading the following article and translating it into akka concepts:
https://www.dropbox.com/s/iihpq9bjcfver07/VLDB-Paper.pdf