Avoiding overuse of consensus protocols in a distributed system

I am new to distributed systems and I read about "simple Paxos". This creates a lot of chatter, and I think about the implications for the job.

Suppose you are creating a globally distributed database with several small clusters located in different places. It seems important to minimize the number of cross-site messages.

  • What decisions should you definitely use to reach consensus? The only thing I thought about for sure was to decide whether to add or remove a node (or a set of nodes?) From the network. It seems that this is necessary for the vector clock to work. Another, of which I was less sure, was deciding to record in the same place, but should a leader who is elected through Paxos have to do this?

  • It would be nice to avoid having all the nodes in the system make decisions together. Can several nodes in each local cluster participate in solutions with several clusters, and all local nodes exchange data using local Paxos to determine local answers to cross-site site questions? The delay will be the same if the network is not saturated, but network traffic between sites will be much easier.

  • Say you can split a database table into rows and assign each subset of rows a subset of nodes. Is it normal to select a node set to host each subset of data using Paxos on all the machines in the system, and then start Paxos between these nodes for all operations related to this subset of data?

And yet all: are there any other design-related or algorithmic optimizations that they handle?

+6
source share
1 answer

Good questions and good ideas!

This creates a lot of chatter, and I think about the implications for the job.

Suppose you are creating a globally distributed database with several small clusters located in different places. It seems important to minimize the number of cross-site messages.

What decisions should you use to reach consensus? The only thing I thought about for sure was to decide whether to add or remove a node (or a set of nodes?) From the network. It seems that this is necessary for the vector clock to work. Another, of which I was less sure, was deciding to record in the same place, but should a leader who is elected through Paxos have to do this?

Yes, performance is a problem that my team also observed in practice. We maintain a consistent database and distributed lock manager; and orignally used Paxos for all records, some reads and cluster membership updates.

Here are some of our optimizations:

  • As far as possible, the nodes sent transitions to the Distinguished Proposer / Learner (selected through Paxos), which
    • decided on a record order and
    • batch transitions, waiting for a response from the previous instance. (But too many failures also caused problems.)
  • We examined the use of multi-paxos, but in the end we did something cooler (see below).

With these optimizations, we still lacked performance, so we divided our server into three levels. The bottom layer is Paxos; he does what you offer; namely, it just solves the middle node node membership. The middle tier is a consensus protocol with content based on high-speed chains that makes consensus and order for the database. (BTW, chain consensus can be thought of as Vertical Paxos.) At the top level, databases / locks and client connections are now simply supported. This design has increased latency by several orders of magnitude and increased throughput.


It would be nice to avoid having all the nodes in the system make decisions together. Can several nodes in each local cluster participate in solutions with several clusters, and all local nodes exchange data using local Paxos to determine local answers to cross-site site questions? The delay will be the same if the network is not saturated, but network traffic between sites will be much easier.

Say you can split a database table into rows and assign each subset of rows a subset of nodes. Is it normal to select a set of nodes to host each subset of data using Paxos on all computers in the system, and then run Paxos between these nodes for all operations related to this subset of data?

These two together remind me of Google Wrench . If you skip parts regarding time, this essentially makes 2PCs worldwide and Paxos shards. (IIRC.)

+6
source

Source: https://habr.com/ru/post/943992/


All Articles