Paxos Made Simple


1. Introduction

The Paxos algorithm used to implement high fault-tolerant distributed systems has always been considered difficult to understand, perhaps because for many people, the initial version is like “Greek” (the original paper is in the form of Greek stories) [5]. In fact, it is one of the most obvious distributed algorithms. Its core is a consistency algorithm — “synod” algorithm in paper [5]. As you can see in the next chapter, it is basically deduced naturally from the conditions that a consistent algorithm must satisfy. In the last chapter, we describe the Paxos algorithm as a consistent implementation of a distributed system that implements the state machine. This paper [4] using the state machine method should have been well known for a long time, because it is probably the most widely cited field of distributed system theory research.

2. Consistency algorithm

2.1 Problem Description

Suppose there is a set of processes that can propose proposals. A consistent algorithm needs to be guaranteed:
Of the proposals put forward, only one will be chosen.
If no proposal is made, no proposal will be selected.
When a proposal is selected, the process should be able to obtain information about the selected proposal.

For consistency, the security requirements are as follows:
Only proposals put forward can be selected.
Only one value can be selected, and at the same time
The process cannot assume that a proposal is selected unless it is really the one selected.

We will not attempt to accurately describe the Liveness requirements. But overall, the ultimate goal is to ensure that a proposal is selected, and when the proposal is selected, the process can ultimately obtain information about the selected proposal.
A distributed algorithm has two important attributes: Safety and Liveness.
Safety refers to things that need to be guaranteed never to happen.
Liveness is something that will eventually happen.

In this consistency algorithm, there are three participating roles, which are represented by Proposer, Acceptor and Learner. In a concrete implementation, a process may play more than one role, but here we don’t care about the mapping relationship between them.

Assuming that different participants can communicate by sending messages, we use a common non-Byzantine asynchronous model:
Each participant runs at any speed and may fail to execute due to stopping or may restart. When a proposal is selected, all participants are likely to fail and restart. Unless these participants can record some information, there is no solution.
Messages may take any time to transmit, may be duplicated or lost, but they will not be damaged (they will not be tampered with, that is, Byzantine problems will not occur).

2.2 Selection of proposals

The simplest way to select a proposal is to have only one Acceptor. Proposer sends a proposal to Acceptor, which chooses the first proposal it receives as the selected proposal. Simple as it is, this solution is difficult to satisfy, because when Acceptor makes a mistake, the whole system will not work.

Therefore, we should choose other ways to select proposals, such as using multiple Acceptors to avoid a single Acceptor problem. In this way, Proposer sends a proposal to an Acceptor collection, and an Acceptor may accept the proposal. When enough Acceptors pass it, we think the proposal is chosen. So what is enough? To ensure that only one proposal is selected, we can make the collection large enough to include most members of the Acceptor collection. Because any two majority sets contain at least one public member, if we specify that an Acceptor can only pass a proposal, then we can ensure that only one proposal is selected (this is a common application of most sets studied in many papers [3].

Assuming that there are no failures or missing messages, if we want to be able to select a proposal on the premise that each Proposer can only submit one proposal, this means the following requirements:

P1. An Acceptor must pass the first proposal it receives.

But this demand can cause other problems. If multiple proposals are presented by different Proposers at the same time, this will lead to the fact that although each Acceptor has passed one proposal, none of the proposals has been approved by the majority. Even if only two proposals were put forward, if almost half of them were passed by Acceptor, even if only one Acceptor made a mistake, it would be impossible to determine which proposal to choose.
For example, there are five Acceptors, two of which have passed proposal a and three of which have passed proposal B. If one of the three Acceptors has made a mistake, then the number of passes of both a and B is 2, so it is impossible to determine.

P1 plus the need for more than half of the Acceptors to pass a proposal suggests that an Acceptor must be able to pass more than one proposal. We assign a number to each proposal to record the proposals passed by an Acceptor, so a proposal contains a proposal number and its value. To avoid confusion, it is necessary to ensure that different proposals have different numbers. How to implement this function depends on the specific implementation details, and here we assume that this guarantee has been achieved. When a proposal with value is approved by most Acceptors, we think that value is selected. At the same time, we believe that the proposal has been chosen.

We allow multiple proposals to be selected, but we must ensure that all the selected proposals have the same value. By agreeing on the number of the proposal, it needs to satisfy the following guarantees:
P2. If a proposal with value V is selected, then the value of all proposals with a higher number than it must also be v.

Because numbering is completely ordered, condition P2 ensures that only one value value is selected as a key security attribute.

A proposal can be selected and must be approved by at least one Acceptor, so we can satisfy P2 by satisfying the following conditions:

P2a. If a proposal with value value V is selected, then the value value of all proposals with higher number adopted by Acceptor must also be v.

We still need P1 to ensure that proposals are selected. Because communication is asynchronous, a proposal may be selected before an Acceptor C receives any proposal. Assuming that a new Proposer wakes up, a higher numbered proposal with different value values is proposed. According to P1, C is required to pass the proposal, but this is inconsistent with P2a. Therefore, in order to satisfy both P1 and P2a, it is necessary to strengthen P2a:

P2b. If a proposal with value V is selected, then the value of all proposals proposed by Proposer with a higher number than it must also be v.

A proposal must have been proposed by a Proposer before it was approved by Acceptor, so P2b implies P2a, and then P2.

To find out how to guarantee P2b, let’s see how to prove it works. Let’s assume that a proposal with a number m and a value value V is selected. We need to prove that any proposal with a number n (n > m) has a value v. We can simplify the proof of N by using mathematical induction, so that we can prove that a proposal numbered n has a value value V under the additional assumption that the proposal numbered between M. (n-1) has a value v, where I. J represents the set from I to J. Because the proposal numbered m has been selected, this means that there is a set C composed of most Acceptors, and each member of C has passed the proposal. Combined with the inductive hypothesis, m is chosen to mean:

Each Acceptor in C passes a proposal numbered between M. (n-1), and each proposal numbered between M. (n-1) that Acceptor passes has a value v.

Since any set S containing most Acceptors contains at least one member in C, we can ensure that a proposal numbered n has a value V by maintaining the following invariance:

For any V and n, if a proposal with the number N and the value V is proposed, then there must be a set S composed of most Acceptors satisfying one of the following conditions:
    There is no Acceptor in S that has passed a proposal numbered less than n
    B. V is the value of all proposals in S with the largest number that have been approved by Acceptors with numbers less than n.

By maintaining the invariance of P2c, we can meet the requirements of P2b.

In order to maintain the invariance of P2c, when a Proposer proposes a proposal numbered n, if there is a proposal with a maximum number less than n that will be or has been approved by most Acceptors, Proposer needs to know its information. It’s easy to get the proposals that have been passed, but it’s difficult to predict what will be passed in the future. To avoid predicting the future, Proposer controls it by promising that there will be no such passage. In other words, Proposer asks the Acceptor not to pass any proposals with a number less than n. This leads to the following proposal generation algorithms:
Proposer selects a new proposal number N and then sends a request to a member of an Acceptor collection to respond as follows:

(a) Guarantee that no proposal with a number less than n will be adopted.
    (b) Currently it has adopted a proposal with a maximum number less than N and how it exists.
 We call such a request a preparerequest numbered n.

If Proposer receives a response from most members of the set, it can propose a proposal numbered n with value v, where V is the value of the largest numbered proposal in all responses. If no proposal is included in the response, then the value is freely determined by Proposer.

Proposer generates a proposal by sending a proposal request that needs to be passed to an Acceptor collection (where the Acceptor collection is not necessarily a collection that responds to the previous request). This is called an accept request.

At present, we describe the Proposer-side algorithm. So what’s the Acceptor side like? It may receive two requests from the Proposer side: the preparerequest and the accept request. Acceptor can ignore arbitrary requests without worrying about compromising the security of the algorithm. So we just need to explain in what circumstances it can respond to a request. It can respond to preparedness requests at any time or accept requests without violating existing commitments. Let me put it another way:

P1a. An Acceptor can adopt a proposal numbered n as long as it does not respond to any preparedness request numbered greater than n.

It can be seen that P1a contains P1.

Now we have a proposal selection algorithm that meets the security requirements, assuming that the proposal number is unique. As long as we do a little more optimization, we can get the final algorithm.
Suppose an Acceptor receives a preparerequest numbered n, but it has already responded to a preparerequest numbered greater than n, so it will certainly not pass any new proposal numbered n. So it doesn’t have to respond to this request, because it certainly won’t pass a proposal numbered n, so we’ll let Acceptor ignore such a preparedness request, and we’ll let it ignore the preparedness requests of the proposals it has passed.
With this optimization, Acceptor only needs to remember the maximum number of the proposal it has passed and the maximum number of the proposal it has responded to the preparerequest. Because the invariance of P2c must be guaranteed in case of errors, Acceptor should also remember this information in case of failure and restart. Proposer can discard proposals and all its information at any time – as long as it can guarantee that no proposals with the same number will be submitted.

Combining the behavior of Proposer and Acceptor, we can get the following two stages of execution of the algorithm:
Phase 1:
Proposer selects a proposal number N and then sends a preparerequest numbered n to most sets of Acceptors.
If an Acceptor receives a preparedness request numbered N and N is greater than the number of all requests it has responded to, it guarantees that no proposal with an arbitrary number less than n will be passed, and that it will respond to the largest number proposal it has passed (if it exists).
Phase 2:
If Proposer receives the response of most Acceptors to its preparedness request (numbered n), it sends an accept request for a proposal numbered N and value V to each Acceptor, where V is the maximum numbered proposal value in the received response. If no proposal is included in the response, it can be arbitrary. Value.
If Acceptor receives an accept request for a proposal numbered n, it can adopt the proposal as long as it has not responded to a preparer numbered greater than n.

A Proposer can submit multiple proposals as long as it follows the algorithmic conventions mentioned above. It can discard a proposal at any time (even if the request or response to the proposal arrives long after the proposal has been discarded, the correctness can still be guaranteed). If Proposer is already trying to submit a bigger number of proposals, then discarding them is a good thing. Therefore, if an Acceptor ignores a preparedness or accept request because it has received a higher number of preparedness requests, it should notify the corresponding Proposer, and then the Proposer can discard the proposal. This is a performance optimization that does not affect correctness.

2.3 Get the selected proposal value

To get the selected value, a learner must know that a proposal has been approved by most Acceptors. The most intuitive algorithm is to let each Acceptor notify all learners when it passes a proposal and inform them of the approved proposal. This allows Learner to find the selected value as soon as possible, but it requires each Acceptor and Learner to communicate with each other – the number of communications equals the product of the number of the two.

Assuming non-Byzantine errors, one learner can easily understand that a value has been selected through another learner. We can have all Acceptors send their messages to a specific Learner, which notifies other Learners when a value is selected. This method requires an additional step to notify all learners, and it is not reliable because that particular learner may have some failures. But in this case, the number of Communications only needs the sum of the two.

More generally, Acceptor can send their information to a close-up collection of learners, any of which can notify all learners when a value is selected. The more Learners in this set, the better the reliability and the higher the communication complexity.

Because messages may be lost, a value may not be found by Learner when it is selected. Learner can ask Acceptor what proposals they have passed, but any Acceptor error can lead to the failure to distinguish whether most Acceptors have passed a proposal. In this case, only when a new proposal is selected can Learner discover the selected value. If a Learner wants to know if a value has been selected, it can let Proposer make a proposal using the above algorithm.

2.4 Progressiveness

It’s easy to construct a situation where two Proposers consistently propose incremental proposals, but no proposals are selected. Proposer P completes Phase 1 for the proposal numbered n1, and then another Proposer Q completes Phase 1 for the proposal numbered N2 (n2 > n1). Proposer p’s accept request for Phase 2, numbered n1, will be ignored because Acceptor promises not to adopt any proposals numbered less than n2. In this way, Proposer P starts and completes Phase 1 with a new number N3 (n3 > n2), which in turn causes Proposer q’s accept request for Phase 2 to be ignored, and so on.

To ensure progress, a specific Proposer must be selected as the sole sponsor of the proposal. If the Proposer can communicate with most Acceptors and can use a larger number than the one already used to make a proposal, then the proposal put forward by the Proposer can be successfully adopted. If it knows that there are some higher-numbered requests, it can abandon the current proposal and start over again, and the Proposer will eventually pick a sufficiently large proposal number.

If there are enough components (Proposer, Acceptor and Network Communication) working well in the system, the activity can be achieved by electing a specific Proposer. The well-known FLP theory [1] points out that a reliable Prooser election algorithm can be realized either by using timeliness or by using timeliness, for example, by using timeout mechanism. However, whether the election is successful or not, security can be guaranteed.

2.5 Implementation

Paxos algorithm [5] assumes a set of process networks. In its consistency algorithm, each process plays the roles of Proposer, Acceptor and Learner. The algorithm chooses a Leader to play that particular Proposer and Learner. Paxos consistency algorithm is the one described above. Requests and responses are sent as normal messages (response messages are identified by the corresponding proposal number to avoid confusion). Use reliable storage devices to store information Acceptor needs to remember to prevent errors. Acceptor records a response to a reliable storage device before it actually sends it.

All that remains is to describe the mechanism for guaranteeing that duplicate numbers will not be used. Different Proposers choose their own numbers from an intersecting set of numbers so that no two Proposers will use the same number. Each Proposer records (in a reliable storage device) the maximum number it has used, and then starts Phase 1 with a proposal with a larger number than this.

3. State Machine Implementation

One simple way to implement distributed systems is to send commands to the central server using a set of client collections. The server can be seen as a deterministic state machine that executes client commands in some order. This state machine has a current state, which generates an output and a new state by receiving a command as input. For example, the client of a distributed banking system may be a cashier, and the state machine consists of the account balances of all users. A withdrawal operation is implemented by executing a state machine command to reduce account balances (if and only if the balance is greater than the number of withdrawals), and then the old and new balances are output.

The system using a single point central server will fail if the server fails. So we replace it with a set of servers, each of which implements the state machine independently. Because this state machine is deterministic, if all servers execute commands in the same order, they will produce the same state machine state and output. A client that gives commands can use the output generated for it by any server.

To ensure that all servers can execute the same sequence of state machine commands, we need to implement a series of independent Paxos consistency algorithm instances. The value selected by the first instance is the first state machine command in the sequence. In each instance of the algorithm, each server plays all roles (Proposer, Acceptor and Learner). Now, let’s assume that the set of servers is fixed so that all consistency algorithm instances have the same set of participants.

In normal execution, a server is elected Leader, which acts as a specific Proposer (the only Proposer) among all instances of consistency algorithms. The client sends commands to the Leader to determine where each command appears in the sequence. If Leader decides that a client command should be 135, it will try to make the command the value selected by the 135th instance of the consistency algorithm. This usually succeeds, but it may fail when some failures or other servers consider themselves Leaders and disagree with 135 commands. But the consistency algorithm guarantees that at most one command will be selected as Article 135.

The key of this method is that in Paxos consistency algorithm, the value proposed is only selected in Phase 2. Recall that when Proposer completes Phase 1, either the value of the proposal is determined or the Proposer is free to present any value.

Now we describe how the Paxos state machine implementation works under normal conditions. Next, let’s look at what goes wrong and what happens when the previous Leader failures and the new Leader is elected (system boot is a special case, and no command has been given at this time).

After the new Leader is elected, it is necessary to know most of the commands that have been selected so far in order to be the earner of all instances of consistency algorithms. Suppose it knows the commands 1-134, 138 and 139 — that is, the values selected by the consistency algorithm instances 1-134, 138 and 139 (we will see how such command gaps are generated later). Next it will execute hase 1 of the 135-137 and 139 later algorithm instances (described below). Suppose execution results show that the proposed values of instances 135 and 140 have been determined, but the proposed values of other execution instances are unrestricted. Leader can then execute hase 2 of instances 135 and 140, and then select commands 135 and 140.

Leader and other servers that have acquired all the known commands of Leader can now execute commands 1-135. However, it could not execute commands 138-140 because commands 136 and 137 had not yet been selected. Leader can treat the commands requested by the next two clients as commands 136 and 137. At the same time, we can also propose a special “noop” instruction to fill the vacancy immediately but keep the state unchanged (by executing consistency algorithm instances 136 and 137 hase 2). Once the no-op instruction is selected, commands 138-140 can be executed.

Commands 1-140 are currently selected. Leader has also completed hase 1 for all consistency algorithm instances greater than 140, and it is free to specify arbitrary values for these instances in Phase 2. It assigns the serial number 141 to the next command received from the client and uses it as the value of the 141st instance of the consistency algorithm in Phase 2. It takes the next client command it receives as command 142, and so on.

Leader can issue command 142 before its command 141 is selected. The proposal information it sends about command 141 may be lost altogether, so command 142 may be selected before all other servers are informed of the command 141 selected by Leader. When Leader fails to receive the expected response from Phase 2 of instance 141, it retransmits the information. If all goes well, its proposal order will be chosen. However, it may still fail, resulting in gaps in the selected command sequence. Generally speaking, it is assumed that Leader can determine a command in advance, which means that when command I is selected, it can issue commands from i+1 to i+a. This may create a command gap as long as a-1.

A newly selected Leader needs to execute Phase 1 for countless consistency algorithm instances — in the above scenario, 135-137 and all execution instances larger than 139. By sending an appropriate message to other servers, all execution instances can use the same proposal number (counter). In Phase 1, as long as an Acceptor has received a hase 2 message from a Proposer, it can respond to more than one instance (in the above scenario, for situations 135 and 140). So a server (as an Acceptor) can respond to all instances with an appropriate short message. There will be no problem with hase 1 executing such an infinite number of instances.
This should mean a stable Paxos model, Phase 1 can be omitted as long as the number counter is unique.

Since Leader failures and new Leader elections are rare, the main overhead of executing a state machine command, that is, the overhead of reaching consensus on command values, is the overhead of Phase 2 in the consistency algorithm. It can be proved that Phase 2 of Paxos consistency algorithm has the least possible time complexity among all consistency algorithms in the case of allowable failure. So Paxos algorithm is basically optimal.

In the normal operation of the system, we assume that there will always be only one Leader, which will only be violated in a short time between the current Leader failure and the election of a new Leader. In exceptional circumstances, the Leader election may fail. If no server acts as a Leader, no new command is proposed. If multiple servers consider themselves Leaders at the same time, they may propose different value values in an instance of consistency algorithm execution, which may result in no value being selected. But security is guaranteed — it’s impossible to have two different values chosen as the I state machine command. The election of a single leader is just to ensure that the process goes down.

If the set of servers is changing, there must be some way to determine which servers implement which consistency algorithm instances. The simplest way is through the state machine itself. The current set of servers can be part of the state and can also be changed by state machine commands. By describing the server set that executes the consistency algorithm i+a by the state after executing the state machine command in Article I, we can enable Leader to obtain a state machine command in advance. This allows a simple implementation of any complex reconfiguration algorithm.

Participatory literature

[1] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson.
Impossibility of distributed consensus with one faulty process.
Journal of the ACM, 32(2):374–382, April 1985. [2] Idit Keidar and
Sergio Rajsbaum. On the cost of fault-tolerant consensus when there
are no faults—a tutorial. TechnicalReport MIT-LCS-TR-821, Laboratory
for Computer Science, Massachusetts Institute Technology, Cambridge,
MA, 02139, May 2001. also published in SIGACT News 32(2) (June 2001).
[3] Leslie Lamport. The implementation of reliable distributed
multiprocess systems. Computer Networks, 2:95–114, 1978. [4] Leslie
Lamport. Time, clocks, and the ordering of events in a distributed
system. Communications of the ACM, 21(7):558–565, July 1978. [5] (1,
2, 3, 4) Leslie Lamport. The part-time parliament. ACM Transactions on
Computer Systems, 16(2):133–169, May 1998.