Contents
1. Markov chains
A Markov chain or Markov process is a stochastic process where the distribution of Xt+1 depends only on the value of Xt and not any previous history. (Formally, for all sets of states A, E[ [Xt+1∈A] | X0...Xt ] = E[ [Xt+1∈A] | Xt ]; contrast with the martingale property.) The state space of the chain is just the set of all values that each Xt can have. A Markov chain is finite or countable if it has a finite or countable state space, respectively. We'll mostly be interested in countable Markov chains. We'll also assume that our Markov chains are homogeneous, which means that Pr[Xt+1 = j | Xt = i] doesn't depend on t.
For a countable Markov chain, we can describe its behavior completely by giving the state space and the one-step transition probabilities pij = Pr[Xt+1 = j | Xt = i]. Given pij, we can calculate two-step transition probabilities p(2)ij = Pr[Xt+2 = j | Xt = i] = ∑k Pr[Xt+2 = j | Xt+1 = k] Pr[Xt+1 = k | Xt = i] = ∑k pik pkj. This is identical to the formula for matrix multiplication, and so we can think of the transition probabilities as given by a transition matrix P, then p(2)ij = (P2)ij and in general the n-step transition probability p(n)ij = (Pn)ij.
Conversely, given any matrix with non-negative entries where the rows sum to 1 (∑j Pij = 1, or P1 = 1), there is a corresponding Markov chain given by pij = Pij. Such a matrix is called a stochastic matrix.
The general formula for (n+m)-step transition probabilities is that p(n+m)ij = ∑k p(n)ikp(m)kj. This is known as the Chapman-Kolmogorov equation and is equivalent to the matrix identity Pn+m = PnPm.
A distribution over states of the Markov chain at some time t can be given by a row vector x, where xi = Pr[Xt = i]. To compute the distribution at time t+1, we use the law of total probability: Pr[Xt+1 = j] = ∑i Pr[Xt = i] Pr[Xt+1 = j | Xt = i] = ∑i xi pij. Again we have the formula for matrix multiplication (where we treat x as a 1×i matrix); so the distribution vector at time t+1 is just xP, and at time t+n is xPn.
We like Markov chains for two reasons:
- They describe what happens in a randomized algorithm; the state space is just the set of all states of the algorithm, and the Markov property holds because the algorithm can't remember anything that isn't part of its state. So if we want to analyze randomized algorithms, we will need to get good at analyzing Markov chains.
They can be used to do sampling over interesting distributions. Under appropriate conditions (see below), the state of a Markov chain converges to a stationary distribution. If we build the right Markov chain, we can control what this stationary distribution looks like, run the chain for a while, and get a sample close to the stationary distribution.
In both cases we want to have a bound on how long it takes the Markov chain to converge, either because it tells us when our algorithm terminates, or because it tells us how long to mix it up before looking at the current state.
1.1. Examples
A fair ±1 random walk. The state space is ℤ, the transition probabilities are pij = 1/2 if |i-j| = 1, 0 otherwise. This is an example of a Markov chain that is also a martingale.
- A fair ±1 random walk on a cycle. As above, but now the state space is ℤ/m, the integers mod m. An example of a finite Markov chain.
- Random walks with absorbing/reflecting barriers.
Random walk on a graph G = (V,E). The state space is V, the transition probalities are puv = 1/d(u) if uv ∈ E. (One can also have more general transition probabilities.)
The Markov chain given by Xt+1 = Xt+1 with probability 1/2, and 0 with probability 1/2. The state space is ℕ.
- 2-SAT algorithm. State is a truth-assignment. The transitional probabilities are messy but arise from the following process: pick an unsatisfied clause, pick one of its two variables uniformly at random, and invert it. Then there is an absorbing state at any satisfying assignment. (A similar process works for 2-colorability, 3-SAT, 3-colorability, etc.; note that for the NP-hard problems, it may take a while to reach an absorbing state.)
1.2. Classification of states
Given a Markov chain, define fij(n) = Pr[j ∉ {X1...Xn-1} ∧ j = Xn | X0 = i]; fij(n) is the probability that the first passage time from i to j equals n. Define fij = ∑n∈ℤ^+^ fij(n); this is the probability that the chain ever reaches j starting from i. A state is persistent if fii = 1 and transient if fii < 1. Finally, define the mean recurrence time μi = ∑n∈ℤ^+^ n fi(n) if i is persistent and μi = ∞ if i is transient. We use the mean recurrence time to further classify states:
fi |
μi |
classification |
< 1 |
∞ |
transient |
1 |
∞ |
null persistent |
1 |
< ∞ |
non-null persistent |
Examples: In a random walk with absorbing barriers at ±n, states -n and +n are (non-null) persistent, and the rest of the states are transient. In general, any state in a finite chain is either non-null persistent or transient, and there is at least one persistent state. In a random walk on ℤ, all states are null persistent. In the process on ℕ where Xt+1 = Xt+1 with probability 1, all states are transient.
The period d(i) of a state i is gcd({ n : pi(n) > 0 }). A state i is aperiodic if d(i) = 1, and periodic otherwise. A Markov chain is aperiodic if all its states are aperiodic.
The most well-behaved states are the aperiodic non-null persistent states; these are called ergodic. A Markov chain is ergodic if all its states are.
1.3. Reachability
State i communicates with state j if pij(n) > 0 for some n; i.e., it is possible to reach j from i. This is often abbreviated as i→j. Two states i and j intercommunicate if i→j and j→i; this is similarly abbreviated as i↔j. If i↔j, it's not hard to show that i and j have the same period and classification. A set of states S is closed if pij = 0 for all i∈S, j∉S, and irreducible if i↔j for all i,j in S.
Using graph-theoretic terminology, i→j if j is reachable from i (through edges corresponding to transition with nonzero probabilities), and a set of states is closed if it is strongly connected. We can thus decompose a Markov chain into strongly-connected components, and observe that all states in a particular strongly-connected component are persistent if and only if the strongly-connected component has no outgoing edges (i.e. is closed). For finite chains there must be at least one closed strongly-connected component (corresponding to a sink in the quotient DAG); this gives an instant proof that finite chains have persistent states.
If the entire chain forms a single strongly-connected component, we say it is irreducible.
2. Stationary distributions
The key useful fact about irreducible Markov chains is that (at least in the finite case) they have stationary distributions, where a stationary distribution is a distribution π (non-negative row vector summing to 1) such that πP = π. The full theorem is:
- Theorem
An irreducible Markov chain has a stationary distribution π if and only if all states are non-null persistent, and if π exists, then πi = 1/μi for all i.
The proof of this theorem takes a while, so we won't do it (see GrimmetStirzaker if you want to see the details). For finite chains, the first part follows from the Perron-Frobenius theorem, which says that stochastic matrices always have an eigenvector corresponding to an eigenvalue of 1. The second part (πi = 1/μi) is just a special case of the renewal theorem.
For non-irreducible Markov chains, there is a stationary distribution on each closed irreducible subset, and the stationary distributions for the chain as a whole are all convex combinations of these stationary distributions.
Examples: In the random walk on ℤm the stationary distribution satisfies πi = 1/m for all i (immediate from symmetry). By contrast, the random walk on ℤ has no stationary distribution (the states are all null persistent). The process on ℕ where pij = 1/2 for j = i+1 or j = 0 has a stationary distribution π given by πi = 2-i-1; this is an example of an infinite chain that nonetheless has only non-null persistent states. For a random walk with absorbing barriers at ±n, there is no unique stationary distribution; instead, any vector π where πi = 0 unless i = ±n and π1 = 1 is stationary.
Note that it's generally not a good strategy to compute π by computing μ first; instead, solve the matrix equation πP = π by rewriting it as π(P-I) = 0 and then compute μi = 1/pi.
2.1. The ergodic theorem
The basic version of the ergodic theorem says that if an aperiodic Markov chain has a stationary distribution π (i.e., it's ergodic), then it converges to π if we run it long enough. This is not terribly hard to prove, using a very clever technique called coupling, where we take two copies of the chain, once of which starts in an arbitrary distribution, and one of which starts in the stationary distribution, and show that we can force them to converge to each other by carefully correlating their transitions. Since coupling is a generally useful technique for bounding rate of convergence, we'll give the proof of the ergodic theorem below.
A more general version of the ergodic theorem says that for any Markov chain and any aperiodic state j, then pjj(n) → 1/μj and pij(n) → fij/μj. This allows us in principle to compute the limit distribution for any aperiodic Markov chain with a given starting distribution x by summing ∑i xifij/μj. It also implies that transient and non-null persistent states vanish in the limit.
2.1.1. Proof
We'll do this by constructing a coupling between two copies of the Markov chain. In the first process {Xt}, we start with an arbitrary initial distribution for X0. In the second {Yt}, we start with Pr[Yt = i] = πi for all i. We now evolve the joint process {(Xt,Yt)} by the rule (i,i')→(j,j') with probability pijpi'j' if i≠i' and (i,i')→(j,j') with probability pij if i=i' and j=j' and 0 if i=i' and j≠j'. In other words, we let Xt and Yt both wander around independently until they collide, after which they stick together and wander around together.
The reason this shows convergence is that we can write
Pr[Xt=j] = Pr[Xt=Yt] Pr[Yt=j|Xt=Yt] + Pr[Xt≠Yt] Pr[Xt=j|Xt≠Yt]
and similarly
Pr[Yt=j] = Pr[Xt=Yt] Pr[Yt=j|Xt=Yt] + Pr[Xt≠Yt] Pr[Yt=j|Xt≠Yt].
Suppose now that Pr[Xt=Yt] → 1. Then the first equation gives Pr[Xt=j] - Pr[Yt=j|Xt=Yt] → 0 and the second gives Pr[Yt=j] - Pr[Yt=j|Xt=Yt] → 0. Since both Pr[Xt=j] and Pr[Yt=j] converge to the same value, they converge to each other, giving Pr[Xt=j] → Pr[Yt=j] = πj.
So now we just need to show Pr[Xt=Yt] → 1. Without loss of generality, suppose that X0 = j, where j is a non-null persistent aperiodic state. With probability πj, X0=Y0 and we are done. If not, let i = Y0 pick some future time n for which p(n)ij > 0. If we are very luck, p(n)j is also > 0, and so we get a nonzero probability of collision. If we are not so lucky, let m be some value such that p(n+m)j > 0 and p(m)j > 0 (if no such value exists, then j is periodic). Then we have Pr[Xt=Yt=j] ≥ Pr[Xt=j] Pr[Yt=j] = p(n)ijp(m)jjp(n+m)jj > 0. So there is a finite time n+m after which we get some nonzero probability that Xt and Yt collide. If we do not collide, repeat the argument starting with i = Yn+m. If we let k be the maximum value of n+m taken over all initial states i, and p > 0 be the minimum probability of collision over all initial states i after k steps, then we have that Pr[Xαk≠Yαk] ≤ (1-p)α → 0.
3. Bounding convergence rates
In order to use the stationary distribution of a Markov chain to do sampling, we need to have a bound on the rate of convergence to tell us when it is safe to take a sample. There are two standard techniques for doing this: coupling, where we show that a copy of the process starting in an arbitrary state can be made to converge to a copy starting in the stationary distribution; and spectral methods, where we bound the rate of convergence by looking at the second-largest eigenvalue of the transition matrix. We'll start with coupling because it requires less development.
(See also this survey of the relationship between the various methods.)
3.1. Coupling
Note: these notes will be somewhat sketchy. If you want to read more about coupling, a good place to start might be Chapter 11 of MitzenmacherUpfal, this chapter of the unpublished but nonetheless famous Aldous-Fill manuscript (which is a good place to learn about Markov chains and Markov chain Monte Carlo methods in general1, or even an entire book: Lindvall, Torgny, Lectures on the Coupling Method, Wiley, 1992. We'll mostly be using examples from the Aldous-Fill text.
3.1.1. The basic coupling lemma
The same sort of coupling trick that proves eventual convergence to π can in some cases be used to prove fast convergence. The idea is that we can bound the total variation distance dTV(p, π) = maxA |Prp[A]-Prπ[A]| = ∑i min(pi-πi, 0) = (1/2) ∑i |pi-πi| by bounding Pr[Xt≠Yt], where Xt is the copy of the process that starts in some arbitrary distribution and Yt is the copy that starts in π. Recall that, for all j,
Pr[Xt=j] = Pr[Xt=Yt] Pr[Yt=j|Xt=Yt] + Pr[Xt≠Yt] Pr[Xt=j|Xt≠Yt]
and
Pr[Yt=j] = Pr[Xt=Yt] Pr[Yt=j|Xt=Yt] + Pr[Xt≠Yt] Pr[Yt=j|Xt≠Yt].
Subtract to get
Pr[Xt=j] - Pr[Yt=j] = Pr[Xt≠Yt] (Pr[Xt=j|Xt≠Yt] - Pr[Yt=j|Xt≠Y,,t]).
Now observe that the parenthesized term is somewhere in the range -1 to 1, giving
|Pr[Xt=j] - Pr[Yt=j]| ≤ Pr[Xt≠Yt].
But in fact we can do much better than this. If we sum the second-to-last equation over all j, we get
∑j |Pr[Xt=j] - Pr[Yt=j]|
= Pr[Xt≠Yt] ∑j |Pr[Xt=j|Xt≠Yt]-Pr[Yt=j|Xt≠Yt]|
≤ Pr[Xt≠Yt] ∑j (Pr[Xt=j|Xt≠Yt]+Pr[Yt=j|Xt≠Yt])
≤ 2 Pr[Xt≠Yt].
So the total variation distance between the distributions of Xt and Yt is bounded by half of this, or Pr[Xt≠Yt]. Armed with this fact, we can now go hunting for couplings that make Xt=Yt with high probability for large enough t.
3.1.2. Random walk on a cycle
Let's suppose we do a random walk on ℤm, where to avoid periodicity at each step we stay put with probability 1/2, move counterclockwise with probability 1/4, and move clockwise with probability 1/4. To make our lives slightly easier, we'll assume m is even. What's a good choice for a coupling to show this process converges quickly?
We'll assume we choose X's move first and then try to figure out Y's move in response. If Xt=Yt, we'll just have Y move with X (the usual outcome). If Xt-Yt is even, we'll have Yt+1-Yt = -(Xt+1-Xt); X and Y move in opposite directions. If Xt-Yt is odd, we'll have Y move if and only if X doesn't; this makes the distance between them (in both directions, because m is even) even after one step, and the other transitions will keep the distance between them even. In each case the movement of Y follows the distribution given by the underlying Markov chain; we just happen to correlate it in a somewhat convoluted way with the movement of X.
Now let us ask how long it takes on average until Xt and Yt collide. We have at most one step to make the distance between them even. After that point, the distance increases or decreases by exactly 2 with probability 1/4 each, and stays the same with probability 1/2. If we consider only the steps where the distance changes, we have a random ±2 walk with absorbing barriers at 0 and m, which is equivalent to a random ±1 walk with absorbing barriers at 0 and m/2; the worst-case starting position for this second walk is exactly in the middle, giving an expected time to reach a barrier of (m/4)2 = m2/16. But we have to multiply this by 2 to account for the fact that we only take a step every other time unit on average, and add one for the possible initial parity-fixing step, which gives a bound of m2/8 + 1 on the expected time to collide. Using Markov's inequality, after m2/4 + 2 steps we have Pr[Xt≠Yt] ≤ 1/2, and by iterating this argument after αm2/4 steps we will have Pr[Xt≠Yt] ≤ 2-α. This gives a mixing time to reach dTV ≤ of at most (m2/4 + 2) lg(1/ε), which means that the process is rapid mixing: the mixing time is polynomial in the size of the underlying process and log(1/ε).
It's worth noting that this example was very carefully rigged to make the coupling argument clean. It still works (perhaps with a slight change in the bound) if m is not even or Pr[Xt+1=Xt] ≠ Pr[Xt+1≠Xt], but the details are messier.
3.1.3. Random walk on a hypercube
Start with a bit-vector of length n. At each step, choose an index uniformly at random, and set the value of the bit-vector at that index to 0 or 1 with equal probability. How long until we get a nearly-uniform distribution over all 2n possible bit-vectors?
Here we apply the same transformation to both the X and Y vectors. It's easy to see that the two vectors will be equal once every index has been selected once. The waiting time for this to occur is just the waiting time nHn for the coupon collector problem. We can either use this expected time directly to show that the process mixes in time O(n log n log (1/ε)) as above, or we can use known sharp concentration bounds on coupon collector (see for example MotwaniRaghavan §3.6.3, which shows limn→∞ Pr[T ≥ n(ln n + c)] = 1-exp(-exp(-c))) to show that in the limit n ln n + n ln ln (1/(1-ε)) = n ln n + O(n log(1/ε)) is enough.2
We can improve the bound slightly by observing that, on average, half the bits in X0 and Y0 are already equal; doing this right involves summing over a lot of cases, so we won't do it.
3.1.4. Various shuffling algorithms
Here we have a deck of n cards, and we repeatedly apply some random transformation to the deck to converge to a stationary distribution that is uniform over all permutations of the cards (usually this is obvious by symmetry, so we won't bother proving it). Our goal is to show that the expected coupling time at which our deck ends up in the same permutation as an initially-stationary deck is small. We do this by counting how many cards k are in the same position in both decks, and showing that, for a suitable coupling, (a) this quantity never decreases, and (b) it increases with some nonzero probability at each step. The expected coupling time is then ∑k 1/Pr[k+1|k].
3.1.4.1. Move-to-top
This is a variant of card shuffling that is interesting mostly because it gives about the easiest possible coupling argument. At each step, we choose one of the cards uniformly at random (including the top card) and move it to the top of the deck. How long until the deck is fully shuffled, i.e. until the total variation distance between the actual distribution and the stationary distribution is bounded by ε?
Here the trick is that when we choose a card to move to the top in the X process, we choose the same card in the Y process. It's not hard to see that this links the two cards together so that they are always in the same position in the deck in all future states. So to keep track of how well the coupling is working, we just keep track of how many cards are linked in this way, and observe that as soon as n-1 are, the two decks are identical.
Note: Unlike some of the examples below, we don't consider two cards to be linked just because they are in the same position. We are only considering cards that have gone through the top position in the deck (which corresponds to some initial segment of the deck, viewed from above). The reason is that these cards never become unlinked: if we pick two cards from the initial segment, the cards above them move down together. But deeper cards that happen to match might become separated if we pull a card from one deck that is above the matched pair while its counterpart in the other deck is below the matched pair.
Having carefully processed the above note, given k linked cards the probability that the next step links another pair of cards is exactly (n-k)/n. So the expected time until we get k+1 cards is n/(n-k), and if we sum these waiting times for k=0..n-1, we get nHn, the waiting time for the coupon collector problem. So the bound on the mixing time is the same as for the random walk on a hypercube.
3.1.4.2. Random exchange of arbitrary cards
Here we pick two cards uniformly and independently at random and swap them. (Note there is a 1/n chance they are the same card; if we exclude this case, the Markov chain has period 2.) To get a coupling, we reformulate this process as picking a random card and a random location, and swapping the chosen card with whatever is in the chosen location in both the X and Y processes.
First let's observe that the number of linked cards never decreases. Let xi, yi be the position of card i in each process, and suppose xi=yi. If neither card i nor position xi is picked, i doesn't move, and so it stays linked. If card i is picked, then both copies are moved to the same location; it stays linked. If position xi is picked, then it may be that i becomes unlinked; but this only happens if the card j that is picked has xj≠yj. In this case j becomes linked, and the number of linked cards doesn't drop.
Now we need to know how likely it is that we go from k to k+1 linked cards. We've already seen a case where the number of linked cards increases; we pick two cards that aren't linked and a location that contains cards that aren't linked. The probability of doing this is ((n-k)/n)2, so our total expected waiting time is n2 ∑ (n-k)-2 = n2 ∑ k-2 ≤ nπ2/6 (see Basel problem for an extensive discussion of the useful but non-obvious fact that ∑k∈ℕ+ k-2 = ζ(2) = π2/6.) The final bound is O(n2 log(1/ε)).
This bound is much worse that the bound for move-to-top, which is surprising. In fact, the real bound is O(n log n) with high probability, although the proof uses very different methods (see Chapter 7 of Aldous-Fill). This shows that the coupling method doesn't always give tight bounds (perhaps we need a better coupling?).
3.1.4.3. Random exchange of adjacent cards
Suppose now that we only swap adjacent cards. Specifically, we choose one of the n positions i in the deck uniformly at random, and then swap the cards a positions i and i+1 (mod n) with probability 1/2. (The 1/2 is there for the usual reason of avoiding periodicity.)
So now we want a coupling between the X and Y processes where each possible swap occurs with probability 1/(2n) on both sides, but somehow we correlate things so that like cards are pushed together but never pulled apart. The trick is that we will use the same position i on both sides, but be sneaky about when we swap.
The coupled process works like this. Let D be the set of indices i where the same card appears in both decks at position i or at position i+1. Then we do:
- For i∈D, swap (i,i+1) in both decks with probability 1/2n.
- For i∉D, swap (i,i+1) in the X deck only with probability 1/2n.
- For i∉D, swap (i,i+1) in the Y deck only with probability 1/2n.
- Do nothing with probability |D|/2n.
It's worth checking that the total probability of all these events is |D|/2n + 2(n-|D|)/2n + |D|/2n = 1. More important is that if we consider only one of the decks, the probability of doing a swap at (i,i+1) is exactly 1/2n (since we catch either case 1 or 2 for the X deck or 1 or 3 for the Y deck).
Now suppose that some card c is at position x in X and y in Y. If x=y, then both x and x-1 are in D, so the only way the card can move is if it moves in both decks: linked cards stay linked. If x≠y, then c moves in deck X or deck Y, but not both. (The only way it can move in both is in case 1, where i=x and i+1=y or vice verse; but in this case i can't be in D since the copy of c at position x doesn't match whatever is in deck Y, and the copy at position y doesn't match what's in deck X.) In this case the distance x-y goes up or down by 1 with equal probability 1/2n. Considering x-y (mod m), we have a "lazy" random walk that moves with probability 1/n, with absorbing barriers at 0 and n. The worst-case expected time to converge is n(n/2)2 = n3/4, giving Pr[time for c to become linked ≥ αn3/8] ≤ 2-α using the usual argument. Now apply the union bound to get Pr[time for every c to become linked ≥ αn3/8] ≤ n2-α to get an expected coupling time of O(n3 log n). In this case (sayeth Aldous-Fill, quoting a then-unpublished result of David Bruce Wilson that eventually appeared in this very nice paper) the bound is optimal up to a constant factor.
3.1.4.4. Real-world shuffling
In real life, the handful of people who still use physical playing cards tend to use a dovetail shuffle, which is closely approximated by the reverse of a process where each card in a deck is independently assigned to a left or right pile and the left pile is place on top of the right pile. Coupling doesn't really help much here; instead, the process can be analyzed using more sophisticated techniques due to Bayer and Diaconis; the short version of the result is that Θ(log n) shuffles are needed to randomize a deck of size n.
3.2. Reversible chains
(See also Aldous-Fill chapter 3, from which many of the details in the notes below are taken.)
Some chains have the property of being reversible. Formally, a chain with transition probabilities pij is reversible if there is a distribution π such that πipij = πjpji (the detailed balance equations—this says that in the stationary distribution, the probability of seeing a transition from i to j is equal to the probability of see a transition from j to i). If this is the case, then ∑i πipij = ∑i πjpji = πj, which means that π is stationary.
This often gives a very quick way to compute the stationary distribution, since if we know πi, and pij≠0, then πj = πipij/pji.
The reason that such a chain is called reversible is that if we start in the stationary distribution at time 0, then the sequence of random variables (X0,...,Xt) has exactly the same distribution as the reversed sequence (Xt,...,X0). Proof: Start with Pr[∀i Xi=xi] = . Now observe that we can move the across the first factor to get and in general . At j=t we get = Pr[∀i Xi = xt-i].
Note that reversible chains can't have a period higher than 2, since we can always step back to where we came from.
Some examples of reversible chains:
- Random walk on a graph
Given two adjacent vertices u and v, we have πv = πud(v)/d(u). This suggests πv = c d(v) for some constant c and all v (the stationary distribution is proportional to the degree), and the constant c is determined by ∑v πv = c ∑ d(v) = 1, giving πv = d(v)/∑u d(u).
- Random walk on a weighted graph
Here each edge has a weight wuv where 0 < wuv = wvu < ∞, with self-loops permitted. A step of the random walk goes from u to v with probability wuv/∑v' wuv'. It is easy to show that this random walk has stationary distribution πu = ∑u wuv / ∑u∑v wuv, generalizing the previous case, and that the resulting Markov chain satisfies the detailed balance equations.
3.2.1. Time-reversed chains
Given any Markov chain with transition matrix P and stationary distribution π, define the time-reversed chain with matrix P* where πipij = πjp*ji. (Check: ∑i p*ji = ∑i pijπi/πj = πj/πj = 1, so it's a Markov chain, and the fact P*'s stationary distribution is the same as P's, and that in general P*'s paths starting from the stationary distribution are a reverse of P's paths starting from the same distribution follows from an argument similar to that for reversible chains.) This gives an alternate definition of a reversible chain as a chain for which P = P*.
Examples:
- Given a biased random walk on a cycle that moves right with probability p and left with probability q, its time-reversal is the walk that moves left with probability p and right with probability q. (Here the fact that the stationary distribution is uniform makes things simple.)
Given the random walk defined by Xt+1 = Xt+1 with probability 1/2 and 0 with probability 1/2, we have πi = 2-i-1. This is not reversible (there is a transition from 1 to 2 but none from 2 to 1), but we can reverse it by setting p*ij = 1 for i = j+1 and p*0i = 2-i-1. (Check: πipi i+1 = 2-i-1(1/2) = πi+1p*i+1 i = 2-i-2(1); πipi0 = 2-i-1(1/2) = π0p*0i = (1/2) 2-i-1.)
Reversed versions of chains with messier stationary distributions are messier.
We can use time-reversal to generate reversible chains from arbitrary chains. The chain with transition matrix (P+P*)/2 (corresponding to moving 1 step forward or back with equal probability at each step) is always a reversible chain.
3.2.2. Spectral properties of a reversible chain
(See Aldous-Fill chapter 3, section 4, or these lecture notes from Ravi Montenegro.)
Suppose P is irreducible and reversible, i.e. that πi≠0 for all i and πipij = πjpji for all i,j. Divide both sides by √(πiπj) to get (πi)1/2 pij (πj)-1/2 = (πj)1/2 pji (πi)-1/2. This shows that the matrix S with entries Sij = (πi)1/2 pij (πj)-1/2 is symmetric.
Why we care about this: Any symmetric real-valued matrix has can, by the spectral theorem, be decomposed as S = UΛU' where Λ is a diagonal matrix whose entries are eigenvalues of S and the columns of U are orthonormal eigenvectors. If you didn't take the advice to take a linear algebra course as part of your undergraduate Computer Science major, an eigenvector of a matrix is a vector v such that vS = λv, where λ is the corresponding eigenvalue—in other words, it's a vector that changes in length but not direction when we run it through S. The fact that the columns of U are orthonormal (orthogonal to each other, and having length 1), means that they act like a set of coordinate axes and that vU gives the coordinates of v in this coordinate system, while (vU)U' undoes the transformation by multiplying each coordinate by the corresponding row vector, since UU' equals the identity matrix I.
In more detail, we can write an arbitrary row vector v = c1u1 + c2u2 + ... cnun, where the vectors ui correspond to the rows of U' and the coefficients ci are unique. If we multiply v by S on the right, we get vUΛU' = ∑ λiciui, and if we multiply it by St, we get v(U∆U')t = vUΛtU' = ∑ λitciui. So repeatedly multiplying by S corresponds to shrinking v along the eigenvectors with eigenvalues whose absolute value is less than 1, while leaving behind the component(s) corresponding to eigenvalues 1 or greater in absolute value.
What if we multiply v by Pt? Let Π be the diagonal matrix given by Πii = (πi)1/2; then we can write S = ΠPΠ-1 and thus P = Π-1SΠ. So Pt = (Π-1SΠ)t = Π-1StΠ = Π-1UΛtU'Π. So we can compute vPt by dividing each component of v by the square root of the stationary distribution, shrinking along all the eigenvectors, and scaling back. If we expand out all the matrix products to get a specific (vPt)ij, we obtain the spectral representation formula:
Assuming that P is irreducible and aperiodic, we have a unique limit to this process. This means that S has exactly one eigenvalue 1 (corresponding to the eigenvector whose entries are (πi)1/2), and all the remaining eigenvalues λ satisfy λt → 0 as t → ∞. We can write the eigenvalues of S (which turn out to also be the eigenvalues of P, although the eigenvectors are different) as 1 = λ1 ≥ λ2 ≥ ... ≥ λn > -1.3 Assuming that |λ2| ≥ |λn|, as t grows large λ2t will dominate the other smaller eigenvalues, and so the size of λ2 will control the rate of convergence of the underlying Markov process.
In detail, we have that the first eigenvector u1 satisfies u1i = √πi; so the first term in the spectral representation formula for Pr[Xt = j | X0 = i] is √(πj/πi) 1t √(πiπj) = πj. Each subsequent term is √(πj/πi) (λm)t umi umj ≤ √(πj/πi) (λm)t, giving a bound of (n-1) √(πj/πi) (λmax)t where λmax = max(λ2,|λn|) on the total (in the unlikely event that all the eigenvalues except λ1 are close to λmax) and O(√(πi/πj) (λmax)t) in the typical case. In either case, since λmax < 1, we get an exponentially-decreasing bound.
It is often convenient to express this bound as exp(-t/c) for an appropriate constant. Here we observe that λmax = 1-(1-λmax) ≤ exp(-(1-λmax)) gives (λmax)t ≤ exp(-(1-λmax)t). It is customary to describe this fact by giving the mixing rate or relaxation time τ2 = 1/(1-λmax), this being the time for which this bound drops by a factor of 1/e.
So far we have just considered the probability that we land in a particular state. To bound total variation distance, we (following the Montenegro notes mentioned above) go through the Lp norm. Given a vector v, define ||v||p,π = (∑i |vi|p πi)1/p. (This is a weighted version of the usual Lp norm.) Then dTV(p,π) = (1/2) ||π-p||1 = (1/2) ||1-p/π||1,π. A general property of Lp norms is that if we increase p, we also increase ||v||p,π, so we can bound dTV(p,π) by (1/2) ||1-p/π||2,π, which is a little easier to compute. Omitting the details of the proof (see here), the result is that, starting from an initial vector x that puts all of its weight on state i,
So now we just need a tool for bounding λmax.
3.2.3. Conductance
Definition:
Basically probability of leaving S on the next step starting from the stationary distribution conditioned on being in S. A measure of how easy it is to escape from a set. Can also be thought of as a weighted version of edge expansion.
As above, but now we take the minimum over all S with probability ≤ 1/2. The usefulness of conductance is that it bounds λ2:
- Theorem
In a reversible Markov chain, 1 - 2Φ ≤ λ2 ≤ 1 - Φ2/2.
This result was proved in Alexander Sinclair's Ph.D. dissertation. For lazy walks we always have λ2 = λmax, and so we can convert this to a bound on the relaxation time:
- Corollary
1/2Φ ≤ τ2 ≤ 2/Φ2.
In other words the high conductance implies low relaxation time and vice versa, up to squaring.
3.2.3.1. Computing conductance directly
For very simple Markov chains we can compute the conductance directly. Consider a lazy random walk on an odd cycle. Any proper subset S has at least two outgoing edges, each of which carries a flow of 1/4n, giving ΦS ≥ (1/4n)/π(S). If we now take the minimum of ΦS over all S with π(S) ≤ 1/2, we get Φ ≥ 1/n, which gives τ2 ≤ 2n2. This is essentially the same bound as we got from coupling.
Here's a slightly more complicated chain. Take two copies of Kn, and join them by a path with n edges. Now consider ΦS for a lazy random walk on this graph where S consists of half the graph, split in the middle of the path (we can't actually do this exactly, but we'll get as close as we can). There is a single outgoing edge uv, with π(u) = d(u)/2|E| = 2/(2n(n-1)n/2 + n) = 2n-2 and puv = 1/4, for π(u) puv = n-2/2. By symmetry, we have π(S) → 1/2 as n → ∞, giving ΦS → n-2(1/2)/(1/2) = n-2. So we have n2/2 ≤ τ2 ≤ 2n4.2n-2.
How does this compare to the actual mixing time? In the stationary distribution, we have a constant probability of being in each of the copies of Kn. Suppose we start in the left copy. At each step there is a 1/n chance that we are sitting on the path endpoint. We then step onto the path with probability 1/n, and reach the other end before coming back with probability 1/n. So (assuming we can make this extremely sloppy handwaving argument rigorous) it takes at least n3 steps on average before we reach the other copy of Kn, which gives us a rough estimate of the mixing time of Θ(n3). In this case the exponent is exactly in the middle of the bounds derived from conductance.
3.2.3.2. Edge expansion using canonical paths
(Here and below we are mostly following the presentation in guruswami-survey.pdf, except with slightly different examples and substantially more errors.)
For more complicated Markov chains, it is helpful to have a tool for bounding conductance that doesn't depend on intuiting what sets have the smallest boundary. The canonical paths method does this by assigning a unique path γxy from each state x to each state y in a way that doesn't send too many paths across any one edge. So if we have a partition of the state space into sets S and T, then there are |S|⋅|T| paths from states in S to states in T, and since (a) every one of these paths crosses an S-T edge, and (b) each S-T edge carries at most ρ paths, there must be at least |S|⋅|T|/ρ S-T edges.
Let's start with a small example. Let G = Cn□Cm, the n×m torus. A lazy random walk on this graph moves north, east, south, or west with probability 1/4 each, wrapping around when the coordinates reach n or m. Since this is a random walk on a regular graph, the stationary distribution is uniform. What is the relaxation time?
Intuitively, we expect it to be O(max(n,m)2), because we can think of this two-dimensional random walk as consisting of two one-dimensional random walks, one in the horizontal direction and one in the vertical direction, and we know that a random walk on a cycle mixes in O(n2) time. Unfortunately, the two random walks are not independent: every time I take a horizontal step is a time I am not taking a vertical step. We can show that the expected coupling time is O(n2+m2) by running two sequential instances of the coupling argument for the cycle, where we first link the two copies in the horizontal direction and then in the vertical direction. So this gives us one bound on the mixing time. But what happens if we try to use conductance?
Here it is probably easiest to start with just a cycle. Given points x and y on Cn, let the canonical path γxy be a shortest path between them, breaking ties so that half the paths go one way and half the other. Then each each is crossed by exactly k paths of length k for each k=1..n/2-1, and at most n/4 paths of length n/2 (0 if n is odd), giving a total of ρ ≤ (n/2-1)(n/2)/2 + n/4 = n2/8 paths across the edge.
If we now take an S-T partition where |S| = m, we get at least m(n-m)/ρ = 8m(n-m)/n2 S-T edges. This peaks at m=n/2, where we get 2 edges—exactly the right number—and in general when m≤n/2 we get at least 8m(n/2)/n2 = 4m/n outgoing edges, giving a conductance ΦS ≥ (1/4n)(4m/n)/(m/n) = 1/n. (This is essentially what we got before, except we have to divide by 2 because we are doing a lazy walk. Note that for small m, the bound is a gross underestimate, since we know that every nonempty proper subset has at least 2 outgoing edges.)
Now let's go back to the torus Cn□Cm. Given x and y, define γxy to be the L-shaped path that first changes x1 to y1 by moving the shortest distance vertically, then changes x2 to y2 by moving the shortest distance horizontally. For a vertical edge, the number of such paths that cross it is bounded by n2m/8, since we get at most n2/8 possible vertical path segments and for each such vertical path segment there are m possible horizontal destinations to attach to it. For a horizontal edge, n and m are reversed, giving m2n/8 paths. To make our life easier, let's assume n≥m, giving a maximum of ρ = n2m/8 paths.
For |S| ≤ nm/2, this gives at least |S|⋅|G-S|/ρ ≥ |S|(nm/2)/(n2m/8) = 4|S|/n outgoing edges. We thus have φS ≥ (1/8nm)(4|S|/n)/(|S|/nm) = 1/2n. This gives τ2 ≤ 2/Φ2 ≤ 8n2, pretty much what we expect.
3.2.3.3. Congestion and canonical paths
For less symmetrical chains, we weight paths by the probabilities of their endpoints when counting how many cross each edge, and treat the flow across the edge as a capacity. This gives the congestion of a collection of canonical paths Γ = {γxy}, which is computed as
and we define ρ = minΓ ρ(Γ).
Now we can show that
from which it follows that τ2 ≤ 2/φ2 ≤ 8ρ2.
Proof: Pick some set of canonical paths Γ with ρ(Γ) = ρ. Now pick some S-T partition with Φ(S) = Φ. Consider a flow where we route π(x)π(y) units of flow along each path γxy with x∈S and y∈T. This gives a total flow of π(S)π(T) ≥ π(S)/2. We are going to show that we need a lot of capacity across the S-T cut to carry this flow, which will give the lower bound on conductance.
For any S-T edge uv, we have
or
Since each S-T path crosses at least one S-T edge, we have
But then
4. The Metropolis-Hastings algorithm and simulated annealing
The basic idea of Metropolis-Hastings (sometimes just called Metropolis) is that we start with a reversible Markov chain P with a known stationary distribution π, but we'd rather get a different stationary distribution μ, where μi = f(i) / ∑j f(j) is proportional to some function f≥0 on states that we can compute easily. Define qij for i≠j to be pij min(1, πif(j)/πjf(i)), = pij min(1, πiμj/πjμi) and let qii be whatever probability is left over. Now consider two states i and j, and suppose that πif(j) ≥ πjf(i). Then μiqij = μipij while μjqji = μjpji(πjμi/πiμj) = pji(πjμi/πi) = μi (pjiπj) / πi = μi (pijπi) / πi = μipij (note the use of reversibility of P in the second-to-last step). So the Q chain is a reversible Markov chain with stationary distribution μ.
We can simplify this in some cases by making our underlying chain a random walk on a graph with maximum degree d, where each transition occurs with probability exactly 1/d (for lower-degree nodes, there is some leftover probability of staying put). Then we have π(i) = π(j) for all i, j, so the new transition probability qij are just (1/d) min(1, f(j)/f(j)).4
A typical application is that we want to sample according to Pr[i|A], but A is highly improbable (so we can't just use rejection sampling), and Pr[i|A] is easy to compute for any fixed i but tricky to compute for arbitrary events (so we can't use divide-and-conquer). If we let f(i) ∝ Pr[i|A], then Metropolis-Hastings will do exactly what we want, assuming it converges in a reasonable amount of time.
Another application, which generally involves tinkering with the chain while it's running, is the global optimization heuristic known as simulated annealing. Here we have some function g that we are trying to minimize. So we set f(i) = exp(-αg(i)) for some α>0. Running Metropolis-Hastings gives a stationary distribution that is exponentially weighted to small values of g; if i is the global minimum and j is some state with high g(j), then π(i) = π(j) exp(α(g(j)-g(i)), which for large enough α goes a long way towards compensating for the fact that in most problems there are likely to be exponentially more bad j's than good i's. The problem is that the same analysis applies if i is a local minimum and j is on the boundary of some depression around i; large α means that it is exponentially unlikely that we escape this depression and find the global minimum.
The simulated annealing hack is to vary α over time; initially, we set α small, so that we get conductance close to that of the original Markov chain. This gives us a sample that is roughly uniform, with a small bias towards states with smaller g(i). After some time we increase α to force the process into better states. The hope is that by increasing α slowly, by the time we are stuck in some depression, it's a deep one—optimal or close to it. If it doesn't work, we can randomly restart and/or decrease α repeatedly to jog the chain out of whatever depression it is stuck in. How to do this effectively is deep voodoo that depends on the structure of the underlying chain and the shape of g(i), so most of the time people who use simulated annealing just try it out with some generic annealing schedule and hope it gives some useful result. (This is what makes it a heuristic rather than an algorithm. It's continued survival is a sign that it does work at least sometimes.)
4.1. Examples of simulated annealing with provable convergence times
It turns out to be surprisingly hard to come up with an interesting toy example of simulated annealing with provable convergence (see the Extended examples section below for some failed attempts). Here are some non-interesting examples.
4.1.1. Single peak
Let's suppose x is a random walk on an n-dimensional hypercube (i.e., n-bit vectors where we set 1 bit at a time), g(x) = |x|, and we want to maximize g. Now a transition that increase |x| is accepted always and a transition that decreases |x| is accepted only with probability e-α. For large enough α, this puts a constant fraction of π on the single peak at x=1; the observation is that that there are only ≤ nk points with k zeros, so the total weight of all points is at most π(1) ∑k≥0 nk exp(-αk) = π(1) ∑ exp(ln n - α)k = π(1) / (1 - exp(ln n - α)) = π(1) ⋅ O(1) when α > ln n, giving π(1) = Ω(1) in this case.
So what happens with convergence? Let p = exp(-α). Let's try doing a path coupling (see below) between two adjacent copies x and y of the Metropolis-Hastings process, where we first pick a bit to change, then pick a value to assign to it, accepting the change in both processes if we can. The expected change in |x-y| is then (1/2n) (-1-p), since if we pick the bit where x and y differ, we have probability 1/2n of setting both to 1 and probability p/2n of setting both to 0, and if we pick any other bit, we get the same distribution of outcomes in both processes. This gives a general bound of E[|Xt+1-Yt+1| | |Xt-Yt|] ≤ (1-(1+p)/2n) |Xt-Yt|, from which we have E[|Xt-Yt|] ≤ exp(-t(1+p)/2n) E[|X0-Y0|] ≤ n exp(-t(1+p)/2n). So after t = 2n/(1+p) ln(n/ε) steps we expect to converge to within ε of the stationary distribution in total variation distance. This gives an O(n log n) algorithm for finding the peak. This is kind of a silly example, but in this case the running time for simulated annealing is almost equal to the Θ(n) optimum, if we scramble x by letting g(x) = |x⊕r| for some random constant r.
4.1.2. Single peak with very small amounts of noise
Now we'll let g:2n→ℕ be some arbitrary Lipschitz function (i.e., |g(x)-g(y)| ≤ |x-y|) and ask for what values of p = exp(-α) the Metropolis-Hastings walk with f(i) = exp(-αg(i)) can be shown to converge quickly. Given adjacent states x and y, with xi≠yi but xj=yj for all j≠i, we still have a probability of at least (1+p)/2n of coalescing the states by setting xi=yi. But now there is a possibility that if we try to move to (x[j/b], y[j/b]) for some j and b, that x rejects while y does not or vice versa (note if xj=yj=b, we don't move in either copy of the process). Conditioning on j and b, this occurs with probability 1-p precisely when x[j/b] < x and y[j/b] ≥ y or vice versa, giving an expected increase in |x-y| of (1-p)/2n. We still get an expected net change of -2p/2n = -p/n provided there is only one choice of j and b for which this occurs. So we converge in time τ(ε) ≤ (n/p) log(n/ε) in this case.5
One way to think of this is that the shape of the neighborhoods of nearby points is similar. If I go up in a particular direction from point x, it's very likely that I go up in the same direction from some neighbor y of x.
If there are more bad choices for j and b, then we need a much larger value of p: the expected net change is now (k(1-p)-1-p)/2n = (k-1-(k+1)p)/2n, which is only negative if p > (k-1)/(k+1). This gives much weaker pressure towards large values of g, which still tends to put us in high neighborhoods but creates the temptation to fiddle with α to try to push us even higher once we think we are close to a peak.
5. Extended examples of convergence bounds
5.1. k-colorings of a degree d graph
(This is pretty ad hoc. See MitzenmacherUpfal §11.5 for a better analysis that only requires 2d colors, or this paper for even more sophisticated results and a history of the problem.)
Consider the following chain on proper k-colorings of a graph with degree d, where k is substantially larger than d (we need at least k ≥ d+1 to guarantee that a coloring exists, but we will assume k somewhat bigger to show fast convergence). At each step, we choose one of the n nodes and one of the k colors, and recolor the node to the new color if and only if no neighbor already has that color.
We'll think of colorings as vectors. Given two colorings x and y, let ||x-y|| be the Hamming distance between them, which is the number of nodes assigned different colorings by x and y. To show convergence, we will construct a coupling that showing that ||Xt-Yt|| converges to 0 over time starting from arbitrary initial points X0 and Y0. The easiest way to do so is to use a variant of coupling called path coupling.
5.1.1. Path coupling
Instead of just looking at Xt and Yt, consider a path of intermediate states Xt = Z0,tZ1,tZ2,t...Zm,t = Yt, where each pair Zi,tZi+1,t is adjacent, i.e., satisfies ||Zi,t-Zi+1,t|| = 1. (In general, we'll look at distance according to some natural metric for whatever Markov chain we've got; a typical choice is just the minimum number of steps needed to reach one state from another.) We now construct a coupling only for adjacent nodes that reduces their distance on average. The idea is that d(Xt,Yt) ≤ ∑ ||Zi,tZi+1,t||, so if the distance between each adjacent pair shrinks on average, so does the total length of the path.
A complication for graph coloring is that it's not immediately evident that d(Xt,Yt) = ||Xt-Yt||. The problem is that it may not be possible to transform Xt into Yt one node at a time without producing improper colorings. With enough colors, we can explicitly construct a short path between Xt and Yt that uses only proper colorings; but for this particular process it is easier to simply extend the Markov chain to allow improper colorings, and show that our coupling works anyway. This also allows us to start with an improper coloring for X0 if we are particularly lazy.
For the graph coloring problem, the natural coupling to consider given adjacent Xt and Yt is to pick the same node and the same new color for both. If we pick the one node on which they differ, and choose a color that is not used by any neighbor (which will be the same for both copies of the process, since all the neighbors have the same colors), then we get Xt+1 = Yt+1; this even occurs with probability at least (k-d)/kn. If we pick a node that is neither the different node nor adjacent to it, then the distance between X and Y doesn't change; either both get a new identical color or both don't. If we pick a node adjacent to the different node, then there is a possibility that the distance increases; if Xt assigns color c to this node and Yt color c', then choosing either c or c' for the neighbor will cause one copy to stay the same (because the new color isn't feasible) and the other to change. This even occurs with probability at most 2d/kn = 2/n if we sum over all neighbors. So E[||Xt+1-Yt+1|| | ||Xt-Yt|| = 1] ≤ 1-(k-d)/kn + 2d/kn = 1 - (k-3d)/kn. This is less than 1 if k ≥ 3d+1.
Applying this bound to the entire path, we get for general Xt, Yt that E[||Xt+1-Yt+1|| | ||Xt-Yt||] ≤ (1-(k-3d)/kn) ||Xt-Yt|| ≤ exp(-(k-3d)/kn) ||Xt-Yt||. A simple induction then gives Pr[Xt ≠ Yt] ≤ E[||Xt-Yt||] ≤ exp(-t(k-3d)/kn) E[||X0-Y0||] ≤ exp(-t(k-3d)/kn) n.
Setting this bound equal to ε and solving for t gives τ(ε) ≤ kn log(n/ε) / (k-3d).
5.1.2. Counting by sampling
Basic idea is to nail down colors one at a time and use sampling to approximate ratios between (# of colorings with m fixed colors)/(# of colorings with m+1 fixed colors). See MitzenmacherUpfal for more details.
5.1.3. Finding a cheap coloring using Metropolis-Hastings
(I'm not going to do this in class, because the result is not very good, but I will leave the notes here in case anybody is interested.)
Let's see what happens if we try doing simulated annealing with a fixed temperature on the graph coloring chain. The idea is that we will assign cost i to color i=0..k-1, and try to find a proper coloring with low total cost by running a Markov chain with stationary distribution proportional to exp(-α ∑ xi). The chain is easily constructed using Metropolis-Hastings; when we attempt to switch a node from color i to color j, we accept the transition with probability min(1, exp(-αj)/exp(-αi)) = min(1, exp(-α(j-i))). The question then is how quickly this Markov chain converges for reasonable values of k and α.
Let's adapt the path-coupling analysis we did before. Given adjacent states x and y with xi≠yi, we will couple their next transition by having both copies of the process choose the same node and same color. We now have to worry about whether the two transitions will be accepted. Given probabilities px ≤ py of acceptance, we have both copies accept the transition with probability px, and y alone accept with probability py-px; the situation with px ≥ py is symmetric.
Now we have to consider what happens to ||x-y|| as the result of a transition. There are two classes of interesting interactions:
Node i chooses a new color c that is not a color of any neighbor of i. Then xi and yi both switch to c with probability min(1, exp(-α(c-(min(xi,yi))))). In the worst case, we have min(xi,yi) = 0, and the smallest nonzero color that does not create a monochromatic edge is d+1. Summing over all possible values of c gives a probability of coalescence of (1/nk) (1 + ∑c=d+1..k exp(-αc)) = (1/nk) (1 + (exp(-α(d+1)) - exp(-α(k+1)))/(1-exp(-α))).
Some neighbor j of i chooses a new color c that is equal to xi or yi. In this case the distance between x and y increases by 1, assuming the transition is accepted. But if i's neighbors all have colors greater than both xi and yi, then they will accept the transition with probability 1. So the worst-case probability that this event occurs is 2d/nk, as in the original process.
Ignore the 1/nk denominator for the moment, our distance drops if
Our only real hope of getting the RHS bigger than the LHS is to make the fraction denominator small, which means making α very small. We have exp(-α) ≥ 1-α ⇒ 1-exp(-α) ≤ α ⇒ 1/(1-exp(-α)) ≥ 1/α. So the above inequality holds provided
We can't do much about the numerator, but if α(d+1) is a small constant and α(k+1) is a much larger constant, we should be happy. Let's let α = β/(d+1) and k+1 = γ(d+1); then the RHS becomes (d+1)(exp(-β)-exp(-γβ))/β. This will easily exceed 2d (giving us an extra 1/nk expected drop) if exp(-β) - exp(-γβ) ≥ 2β, which occurs if exp(-γβ) ≤ exp(-β) - 2β, or -γβ ≤ log(exp(-β) - 2β), or γ ≥ -log(exp(-β) - 2β)/β. For example, setting β = 1/5 and γ = 5 works (this can be determined numerically).
So with α = 1/(5(d+1)) and k ≥ 5d+4, we get E[||Xt+1-Yt+1||] ≤ (1-1/nk) E[Xt-Yt]. So the Metropolis-Hastings chain converges to within ε of its stationary distribution (in total variation distance) starting from an arbitrary state after at most nk log(1/ε) steps.
So what does this mean in terms of finding a good coloring? As in the old joke about the field service engineer, it depends on how many bad colorings you bring with you. And in fact most colorings are pretty bad, which means that our exp(-αc) multiplier isn't actually going to push us too far away from them.
Quick calculation: For a single node with no neighbors, the stationary distribution on its color is π(i) = exp(-αi)/∑i=0..k-1 exp(-αi). The expected value is ∑i i exp(-αi) / ∑i exp(-αi). Taking the limit as k→∞ gives E[color] = exp(-α)/(1-exp(-α)) which for small α is in the neighborhood of 1/α, e.g. 5(d+1). Even for k ≅ 1/α (as in the case above), we have π(k-1) ≅ e-1π(0), which doesn't leave a lot of room for pressure toward 0. So we can reasonably expect the chain to be pretty far from optimal almost all of the time for α this small.
I guess this is why people resort to simulated annealing. I will try to come up with a different example for class.
5.2. Another example: Independent sets
Let's try something we know is hard, so we won't feel bad when it doesn't work. We'd like to sample independent sets of vertices on a graph, with a bias toward large sets.
By analogy with the graph coloring problem, a natural way to set up the random walk is to represent each potentially independent set as a bit vector, where 1 indicates membership in the set, and at each step pick one of the n bits uniformly at random and set it to 0 or 1 with probability 1/2 each, assuming that the resulting set is independent.
Easy to see this is a reversible Markov chain with π(x) = 1/|S| where S is the set of all independent sets. Also easy to see that the distance between x and y is exactly |x-y|1, since we can alway remove all the extra ones from x and put back the extra ones in y.
Obvious coupling: Pick the same position and value for both copies of the chain. If x and y are adjacent, then they coalesce with probability 1/n (both probability 1/2n transitions are feasible for both copies, since the neighboring nodes always have the same state). What is the probability that they diverge? We can only be prevented from picking a value if the value is 1 and some neighbor is 1. So the bad case is when xi = 1, yi = 0, and we attempt to set some neighbor of i to 1; in the worst case, this happens d/2n of the time, which is at least 1/n when d ≥ 2. No coalescence for you!
We now have two choices. We can try to come up with a more clever random walk (see MitzenmacherUpfal §11.6 for proof that a more sophisticated algorithm converges when d≤4), or we can try to persevere with our dumb random walk, a more clever analysis, and possibly a more restricted version of the problem where it will miraculously work even though it shouldn't really. Let's start with d=2. Here the path coupling argument gives no expected change in |Xt-Yt|, so we have some hope that with enough wandering around they do in fact collide.
How do we prove this? Given arbitrary Xt, Yt, we know from the path coupling argument above that on average they don't move any farther away. We also know that that there is a probability of at least 1/n that they move closer if they are not already identical (this is slightly tricky to see in the general case; basically we always have a 1/2n chance of removing an extra 1 from one or the other, and either we also have an extra 1 in the other process, we get another 1/2n, and if we don't, we can put in a missing 1 and get 1/2n that way instead). And we know that any change resulting from our not-very-smart coupling will change the distance by ±1. So if we sample only at times when a move has just occurred, we see a random walk (with a possible downward bias) with a reflecting barrier at n and an absorbing barrier at 0: this converges in n2 steps. But since our random walk steps may take an expected n real steps each, we get a bound of n3 on the total number of steps to converge.
But if d=3, we seem to be doomed. Also if we try to bias the walk in favor of adding vertices: since our good case is a 1→0 transition, decreasing its probability breaks the very weak balance we have even with d=2 (for d=1, no problem!) So maybe we should see what we can do with the sneakier random walk from MitzenmacherUpfal.
Here the idea is that we pick a random edge uv, and then try to do one of the following operations, all with equal probability:
- Set u=v=0.
- Set u=0 and v=1.
- Set u=1 and v=0.
In each case, if the result would be a non-independent set, we instead do nothing.
Verifying that this has a uniform stationary distribution is mildly painful if we are not careful, since there may be several different transitions that move from some state x to the same state y. But for each transition (occuring with probability 1/(3|E|)), we can see that there is a reverse transition that occurs with equal probability; so we get detailed balance etc. and win.
To make things more exciting, we will assume that we accept changes that decrease the number of elements of the independent set only with probability p; this will turn out to be a bad idea, but we can always recover the original process by setting p=1. This changes the stationary distribution so that π(x) is proportional to p-|x|.
So now what happens if we run two coupled copies of this process, where the copies differ on exactly one vertex i?
First, every neighbor of i is 0 in both processes. A transition that doesn't involve any neighbors of i will have the same effect on both processes. So we need to consider all choices of edges where one of the endpoints is either i or a neighbor j of i. In the case where the other endpoint isn't i, we'll call it k; there may be several such k.
The tables below adds up the effect of all the transitions that involve some particular neighbor j of i. Each contribution should be multiplied by 1/(3|E|) to get the actual effect on the expectation. We consider several cases, depending on how many neighbors k there are that are set to 1, which affects when j can be set to 1. The values in the left-hand columns are the new values of i, j, and k. For the notes, we assume that xi=0 and yi=1; the reverse case is symmetric.
We'll start with the s=0 case:
i
j
k
contribution to |x-y|
notes
0
0
-p
Sets xi=yi, but only if both processes move.
0
1
-1
Sets xi=yi always.
1
0
-1
Sets xi=yi always.
0
0
0
Any change to x also appears in y.
0
1
0
Any change to x also appears in y.
1
0
+d(j)-1
Causes divergence between x and y, since we can only move in x.
So in this case, each node j contributes d(j)-3-p to the expected change in distance between x and y.
For s=1, we have:
i
j
k
contribution to |x-y|
notes
0
0
-p
Sets xi=yi, but only if both processes move.
0
1
0
No effect; can't set j=1 because some k=1.
1
0
-1
Sets xi=yi always.
0
0
0
Any change to x also appears in y.
0
1
0
Any change to x also appears in y.
1
0
+2
Diverges at both j and k, but only for one choice of k.
The total contribution is now 1-p.
For s=2, we have:
i
j
k
contribution to |x-y|
notes
0
0
-p
Sets xi=yi, but only if both processes move.
0
1
0
No effect; can't set j=1 because some k=1.
1
0
-1
Sets xi=yi always.
0
0
0
Any change to x also appears in y.
0
1
0
Any change to x also appears in y.
1
0
0
No effect; can't set j=1 because some other k'=1.
Total contribution: -1-p.
Here the fact that j has two nonzero neighbors (besides i) means that we can't set it to 1 no matter what.
To calculate the total shift, we have to multiply by d(i), so that we cover all j. The resulting expected change is at most (1/(3|E|)) d(i) max(d(j)-3-p, 1-p, -1-p) ≤ (d/3|E|) max(d-3-p, 1-p). This is never negative, but we can drive it to 0 by setting d≤4 and p=1.
This means that for graphs of maximum degree 4 or less, E[|Xt+1-Yt+1 | |Xt-Yt|] ≤ |Xt-Yt| as in our previous dumb algorithm, and since we have a probability of at least 1/3|E| of having something happen, the same analysis as before shows that we converge in at most 3n2|E| steps on average.
If we could come up with a variant of this process that converged in polynomial time (for d=3, say) even for small p, miraculous things would follow. In particular, we could use it to find near-maximum independent sets in cubic graphs (graphs where every node has degree 3) in polynomial time. The argument is that in a degree-3 graph, there is always a maximum independent set of size n/4 or greater. If we call an independent set good if it has size within (1-δ) of optimal, then the bad sets are each assigned weight less than p(δn/4) times the weight of the optimal set, and even if there are 2n of them, their total weight is still less than 2np(δn/4) times the optimal set. So if p ≤ 24/δ, a single optimal set is sampled at least as often as all the bad sets put together, and if there are other good sets besides the optimal ones, it only increases the probability of getting a good set in the stationary distribution. But if there is a polynomial-time randomized algorithm that finds a (1-δ)-approximation to maximum independent set in cubic graphs with non-trivial probability, then RP=NP, because of this paper: alimanti-mis-cubic.pdf. So if RP≠NP (as is commonly believed), then for any chain we build, there is some value of p for which it starts failing to converge.
CategoryRandomizedAlgorithmsNotes
To put it more strongly, the Aldous-Fill manuscript is the Necronomicon of Markov chain convergence proofs. You can't truly join the cult without reading it. You should stop reading these notes right now and read the Aldous-Fill manuscript instead. Iä! Iä! Markov fhtagn! (1)
This is a little tricky; we don't know from this bound alone how fast the probability converges as a function of n, so to do this right we need to look into the bound in more detail. (2)
For aperiodic chains and chains with period 2, all the eigenvalues are real; the period-2 case includes an eigenvalue of -1. Chains with periods greater than 2 have pairs of complex-valued eigenvalues that are roots of unity, which happen to cancel out to only produce real probabilities in vPt. (3)
MitzenmacherUpfal §10.4.1 concentrates on this case. (4)
You might reasonably ask if such functions g exist. One example is g(x) = (x1⊕x2) + ∑i>2 xi. (5)