For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.
Failure detectors were proposed by Chandra and Toueg Chandra and Toueg. Unreliable failure detectors for reliable distributed systems. JACM 43(2):225–267, 1996 as a mechanism for solving consensus in an asynchronous message-passing system with crash failures by distinguishing between slow processes and dead processes. The basic idea is that each process has attached to it a failure detector module that continuously outputs an estimate of which processes in the system have failed. The output need not be correct; indeed, the main contribution of Chandra and Toueg's paper (and a companion paper by Chandra, Hadzilacos, and Toueg Chandra, Hadzilacos, and Toueg. The weakest failure detector for solving consensus. PODC 1992, pp 147–158) is characterizing just how bogus the output of a failure detector can be and still be useful.
We will mostly follow Chandra and Toueg in these notes; see the paper for the full technical details.
To emphasize that the output of a failure detector is merely a hint at the actual state of the world, a failure detector (or the process it's attached to) is said to suspect a process at time t if it outputs "failed" at that time. Failure detectors can then be classified based on when their suspicions are correct.
We use the usual AsynchronousMessagePassing model, and in particular assume that non-faulty processes execute infinitely often, get all their messages delivered, etc. From time to time we will need to talk about time, and unless we are clearly talking about real time this just means any steadily increasing count (e.g., of total events), and will be used only to describe the ordering of events.
1. How to build a failure detector
Failure detectors are only interesting if you can actually build them. In a fully asynchronous system, you can't (this follows from the FischerLynchPaterson result and the existence of failure-detector-based consensus protocols). But with timeouts, it's not hard: have each process ping each other process from time to time, and suspect the other process if it doesn't respond to the ping within twice the maximum round-trip time for any previous ping. Assuming that ping packets are never lost and there is an (unknown) upper bound on message delay, this gives what is known as an eventually perfect failure detector: once the max round-trip times rise enough and enough time has elapsed for the live processes to give up on the dead ones, all and only dead processes are suspected.
2. Classification of failure detectors
Chandra and Toueg define eight classes of failure detectors, based on when they suspect faulty processes and non-faulty processes. Suspicion of faulty processes comes under the heading of completeness; of non-faulty processes, accuracy.
2.1. Degrees of completeness
- Strong completeness
- Every faulty process is eventually permanently suspected by every non-faulty process.
- Weak completeness
- Every faulty process is eventually permanently suspected by some non-faulty process.
There are two temporal logic operators embedded in these statements: "eventually permanently" means that there is some time t0 such that for all times t ≥ t0, the process is suspected. Note that completeness says nothing about suspecting non-faulty processes: a paranoid failure detector that permanently suspects everybody has strong completeness.
2.2. Degrees of accuracy
These describe what happens with non-faulty processes, and with faulty processes that haven't crashed yet.
- Strong accuracy
- No process is suspected (by anybody) before it crashes.
- Weak accuracy
- Some non-faulty process is never suspected.
- Eventual strong accuracy
- After some initial period of confusion, no process is suspected before it crashes. This can be simplified to say that no non-faulty process is suspected after some time, since we can take end of the initial period of chaos as the time at which the last crash occurs.
- Eventual weak accuracy
- After some initial period of confusion, some non-faulty process is never suspected.
Note that "strong" and "weak" mean different things for accuracy vs completeness: for accuracy, we are quantifying over suspects, and for completeness, we are quantifying over suspectors. Even a weakly-accurate failure detector guarantees that all processes trust the one visibly good process.
2.3. Failure detector classes
Two degrees of completeness times four degrees of accuracy gives eight classes of failure detectors, each of which gets its own name. Fortunately, the distinction between strong and weak completeness turns out to be spurious; a weakly-complete failure detector can simulate a strongly-complete one (but this requires a proof). We can use this as an excuse to consider only the strongly-complete classes:
- P (perfect)
- Strongly complete and strongly accurate: non-faulty processes are never suspected; faulty processes are eventually suspected by everybody. Easily achieved in synchronous systems.
- S (strong)
- Strongly complete and weakly accurate. The name is misleading if we've already forgotten about weak completeness, but the corresponding W (weak) class is only weakly complete and weakly accurate, so it's the strong completeness that the S is referring to.
- ⋄P (eventually perfect)
- Strongly complete and eventually strongly accurate.
- ⋄S (eventually strong)
- Strongly complete and eventually weakly accurate.
Jumping to the punch line: P can simulate any of the others, S and ⋄P can both simulate ⋄S but can't simulate P or each other, and ⋄S can't simulate any of the others. Thus ⋄S is the weakest class of failure detectors in this list. However, ⋄S is strong enough to solve consensus, and in fact any failure detector (whatever its properties) that can solve consensus is strong enough to simulate ⋄S (this is the result in the Chandra-Hadzilacos-Toueg paper)—this makes ⋄S the "weakest failure detector for solving consensus" as advertised. Continuing our tour through Chandra and Toueg, we'll show the simulation results and that ⋄S can solve consensus, but we'll skip the rather involved proof of ⋄S's special role from Chandra-Hadzilacos-Toueg.
3. Boosting completeness
Recall that the difference between weak completeness and strong completeness is that with weak completeness, somebody suspects a dead process, while with strong completeness, everybody suspects it. So to boost completeness we need to spread the suspicion around a bit. On the other hand, we don't want to break accuracy in the process, so there needs to be some way to undo a premature rumor of somebody's death. The simplest way to do this is to let the alleged corpse speak for itself: I will suspect you from the moment somebody else reports you dead until the moment you tell me otherwise.
Formally, this looks like:
initially suspects = ∅ do forever: for each process p: if my weak-detector suspects p, then send p to all processes upon receiving p from some process q: suspects := suspects + p - q
It's not hard to see that this boosts completeness: if p crashes, somebody's weak-detector eventually suspects it, this process tells everybody else, and p never contradicts it. So eventually everybody suspects p.
What is slightly trickier is showing that it preserves accuracy. The essential idea is this: if there is some good-guy process p that everybody trusts forever (as in weak accuracy), then nobody ever reports p as suspect—this also covers strong accuracy since the only difference is that now every non-faulty process falls into this category. For eventual weak accuracy, wait for everybody to stop suspecting p, wait for every message ratting out p to be delivered, and then wait for p to send a message to everybody. Now everybody trusts p, and nobody every suspects p again. Eventual strong accuracy is again similar.
This justifies our ignoring the weakly-complete classes.
4. Consensus with S
Here the failure detectors as applied to most processes are completely useless. However, there is some non-faulty process c that nobody every suspects, and this is enough to solve consensus with as many as n-1 failures.
Basic idea of the protocol: There are three phases. In the first phase, the processes gossip about input values for n-1 asynchronous rounds. In the second, they exchange all the values they've seen and prune out any that are not universally known. In the third, each process decides on the lowest-id input that hasn't been pruned (min input also works since at this point everybody has the same view of the inputs).
In more detail, in phase 1 each process p maintains two partial functions Vp and Δp, where Vp lists all the input values (q,vq) that p has ever seen and Δp lists only those input values seen in the most recent of n-1 asynchronous rounds. Vp and Δp are both initialized to {(p, vp}}. In round i, p sends (i,Δp) to all processes. It then collects (i,Δq) from each q that it doesn't suspect and sets Δp to ∪q(Δq) - Vp (where q ranges over the processes from which p received a message in round i) and sets Vp to Vp∪Δp. In the next round, it repeats the process. Note that each pair (q,vq) is only sent by a particular process p the first round after p learns it: so any value that is still kicking around in round n-1 had to go through n-1 processes.
In phase 2, each process p sends (n,Vp), waits to receive (n,Vq) from every process it does not suspect, and sets Vp to the intersection of Vp and all received Vq. At the end of this phase all Vp values will in fact be equal, as we will show.
In phase 3, everybody picks some input from their Vp vector according to a consistent rule.
4.1. Proof of correctness
Let c be a non-faulty process that nobody every suspects.
The first observation is that the protocol satisfies validity, since every Vp contains vc after round 1 and each Vp can only contain input values by examination of the protocol. Whatever it may do to the other values, taking intersections in phase 2 still leaves vc, so all processes pick some input value from a nonempty list of same in phase 3.
To get termination we have to prove that nobody ever waits forever for a message it wants; this basically comes down to showing that the first non-faulty process that gets stuck eventually is informed by the S-detector that the process it is waiting for is dead.
For agreement, we must show that in phase 3, every Vp is equal; in particular, we'll show that every Vp = Vc. First it is necessary to show that at the end of phase 1, Vc ⊆ Vp for all p. This is done by considering two cases:
If (q,vq) ∈ Vc and c learns (q,vq) before round n-1, then c sends (q,vq) to p no later than round n-1, p waits for it (since nobody ever suspects c), and adds it to Vp.
If (q,vq) ∈ Vc and c learns (q,vq) only in round n-1, then (q,vq) was previously sent through n-1 other processes, i.e. all of them. Each process p ≠ c thus added (q,vq) to Vp before sending it and again (q,vq) is in Vp.
(The missing case where (q,vq) isn't in Vc we don't care about.)
But now phase 2 knocks out any extra elements in Vp, since Vp gets set to Vp∩Vc∩(some other Vq's that are supersets of Vc). It follows that at the end of phase 2 Vp = Vc for all p. Finally in phase 3 everybody applies the same selection rule to these identical sets and we get agreement.
5. Consensus with ⋄S and f < n/2
The consensus protocol for S depends on some process c never being suspected; if c is suspected during the entire (finite) execution of the protocol—as can happen with ⋄S—then it is possible that no process will wait to hear from c (or anybody else) and the processes will all decide their own inputs. So to solve consensus with ⋄S we will need to assume less than n/2 failures, allowing any process to wait to hear from a majority no matter what lies its failure detector is telling it.
The resulting protocol, known as the Chandra-Toueg consensus protocol, is structurally similar to the consensus protocol in Paxos. The difference is that instead of initiators blindly showing up, the protocol is divided into rounds with a rotating coordinator pi in each round r with r = i (mod n). The termination proof is based on showing that in any round where the coordinator is not faulty and nobody suspects it, the protocol finishes.
Here's the essence of the protocol. It uses as a subroutine a protocol for ReliableBroadcast, which guarantees that any message that is sent is either received by no processes or exactly once by all non-faulty processes.
- Each process keeps track of a preference (initially its own input) and a timestamp, the round number in which it last updated its preference.
- The processes go through a sequence of asynchronous rounds, each divided into four phases:
- All processes send (round, preference, timestamp) to the coordinator for the round.
- The coordinator waits to hear from a majority of the processes (possibly including itself). The coordinator sets its own estimate to some estimate with the largest timestamp of those it receives and sends (round, estimate) to all processes.
Each process waits for the new proposal from the coordinator or for the failure detector to suspect the coordinator. If it receives a new estimate, it adopts it as its own, sets timestamp ← round, and sends (round, ack) to the coordinator. Otherwise, it sends (round, nack) to the coordinator.
The coordinator waits to receive ack or nack from a majority of processes. If it receives ack from a majority, it announces the current estimate as the protocol decision value using ReliableBroadcast.
Any process that receives a value in a ReliableBroadcast decides on it immediately.
5.1. Proof of correctness
For validity, observe that the decision value is an estimate and all estimates start out as inputs.
For termination, observe that no process gets stuck in phase 1, 2, or 4, because either it isn't waiting or it is waiting for a majority of non-faulty processes who all sent messages unless they have already decided (this is why we need the nacks in phase 3). The loophole here is that processes that decide stop participating in the protocol; but because any non-faulty process retransmits the decision value in the ReliableBroadcast, if a process is waiting for a response from a non-faulty process that already terminated, eventually it will get the ReliableBroadcast instead and terminate itself. In phase 3, a process might get stuck waiting for a dead coordinator, but the strong completeness of ⋄S means that it suspects the dead coordinator eventually and escapes. So at worst we do infinitely many rounds.
Now suppose that after some time t there is a process c that is never suspected by any process. Then in the next round in which c is the coordinator, in phase 3 all surviving processes wait for c and respond with ack, c decides on the current estimate, and triggers the ReliableBroadcast protocol to ensure everybody else decides on the same value. Since ReliableBroadcast guarantees that everybody receives the message, everybody decides this value or some value previously broadcast—but in either case everybody decides.
Agreement is the tricky part. It's possible that two coordinators both initiate a ReliableBroadcast and some processes choose the value from the first and some the value from the second. But in this case the first coordinator collected acks from a majority of processes in some round r, and all subsequent coordinators collected estimates from an overlapping majority of processes in some round r' > r. By applying the same induction argument as for Paxos we get that all subsequent coordinators choose the same estimate as the first coordinator, and so we get agreement.
6. f < n/2 is still required even with ⋄P
We can show that with a majority of failures, we're in trouble with just ⋄P (and thus with ⋄S, which is trivially simulated by ⋄P). The reason is that ⋄P can lie to us for some long initial interval of the protocol, and consensus is required to terminate eventually despite these lies. So the usual partition argument works: start half of the processes with input 0, half with 1, and run both halves independently with ⋄P suspecting the other half until the processes in both halves decide on their common inputs. We can now make ⋄P happy by letting it stop suspecting the processes, but it's too late.
7. Relationships among the classes
It's easy to see that P simulates S and ⋄P simulates ⋄S without modification. It's also immediate that P simulates ⋄P and S simulates ⋄S (make "eventually" be "now"), which gives a diamond-shaped lattice structure between the classes. What is trickier is to show that this structure doesn't collapse: there is no simulation from ⋄P to S, S to ⋄P, or from ⋄S to any of the other classes.
First let's observe that there is no simulation of S by ⋄P: if there were, we would get a consensus protocol for f ≥ n/2 failures, which we can't do. It follows that ⋄P can't simulate P (which can simulate S).
To show that S can't simulate ⋄P, choose some non-faulty victim process v and consider an execution in which S periodically suspects v (which it is allowed to do as long as there is some other non-faulty process it never suspects). If the ⋄P-simulator ever responds to this by refusing to suspect v, there is an execution in which v really is dead, and the simulator violates strong completeness. But if not, we violate eventual strong accuracy. Note that this also implies S can't simulate P, since P can simulate ⋄P. It also shows that ⋄S can't simulate either of ⋄P or P.
We are left with showing ⋄S can't simulate S. Consider a system where p's ⋄S detector suspects q but not r from the start of the execution, and similarly r's ⋄S detector also suspects q but not p. Run p and r in isolation until they give up and decide that q is in fact dead (which they must do eventually by strong completeness, since this run is indistinguishable from one in which q is faulty). Then wake up q and crash p and r. Since q is the only non-faulty process, we've violated weak accuracy.
Chandra and Toueg give as an example of a natural problem that can be solved only with P the problem of Terminating Reliable Broadcast, in which a single leader process attempts to send a message and all other processes eventually agree on the message if the leader is non-faulty but must terminate after finite time with a default "no message" return value if the leader is faulty.1 The process is solvable using P by just having each process either wait for the message or for P to suspect the leader, which can only occur if the leader does in fact crash. If the leader is dead, the processes must eventually decide on no message; this separates P from ⋄S and ⋄P since we can then wake up the leader and let it send its message. But it also separates P from S, since we can have the S-detector only be accurate for non-leaders. For other similar problems see the paper.
CategoryDistributedComputingNotes
This is a slight weakening of the problem, which however still separates P from the other classes. For the real problem see Chandra and Toueg. (1)