A more up-to-date version of these notes can be found in http://www.cs.yale.edu/homes/aspnes/cs469/notes.pdf.
A randomized algorithm flips coins during its execution to determine what to do next. When considering a randomized algorithm, we usually care about its expected worst-case performance, which is the average amount of time it takes on the worst input of a given size. This average is computed over all the possible outcomes of the coin flips during the execution of the algorithm. See ExpectedValue for more information about computing expectations. We may also ask for a high-probability bound, showing that the algorithm doesn't consume too much resources most of the time.
Formally, we think of a randomized algorithm as a machine M that computes M(x,r), where x is the problem input and r is the sequence of random bits. Because the running time may depend on the random bits, it is now a RandomVariable. The expected worst-case cost is then the max over all x (of a given size) of Er[time(M(x,r))].
This is distinct from traditional worst-case analysis, where there is no r and no expectation, and average-case analysis, where there is again no r and the value reported is not a max but an expectation over some distribution on x. The following trivial example shows the distinction.
In studying randomized algorithms, we consider pretty much the same issues as for deterministic algorithms: how to design a good randomized algorithm, and how to prove that it works within given time or error bounds. The main difference is that it is often easier to design a randomized algorithm—randomness turns out to be a good substitute for cleverness more often than one might expect—but harder to analyze it. So much of what one does is develop good techniques for analyzing the often very complex random processes that arise in the execution of an algorithm. Fortunately, in doing so we can often use techniques already developed by probabilists and statisticians for analyzing less overtly algorithmic processes.
1. A trivial example
Suppose we have two doors. Behind one door is a valuable prize, behind the other is nothing. Our goal is to obtain the prize after opening the fewest possible doors.
A deterministic algorithm tries one door, then the next. In the worst case, two doors are opened. In the average case, if we assume that both doors are equally likely to hide the prize, we open one door half the time and the other door half the time, or 3/2 doors on average. We can obtain the same expected time even in the worst case by flipping a coin ourselves to decide which door to open first. This gives a randomized algorithm, and because we flip the coin (instead of nature, in the case of the average-case algorithm), we can guarantee the good expected performance no matter what the person hiding the prize does.
2. A less trivial example: randomized Quicksort
The Quicksort algorithm works as follows. For simplicity, we assume that no two elements of the array being sorted are equal.
If the array has >1 elements,
Pick a pivot uniformly at random from the elements of the array.
Split the array into A1 and A2, where A1 contains all elements < the pivot and A2 contains all elements > the pivot.
Sort A1 and A2 recursively and return A1, pivot, A2.
The splitting step takes exactly n-1 comparisons, since we have to check each non-pivot against the pivot. We assume all other costs are dominated by the cost of comparisons. How many comparisons does randomized quicksort do on average?
There are two ways to solve this problem: the dumb way and the smart way. We'll show both.
2.1. Brute force method: solve the recurrence
Let T(n) be the expected number of comparisons done on an array of n elements. We have T(0) = T(1) = 0 and for larger n, . Why? Because there are n equally-likely choices for our pivot (hence the 1/n), and for each choice the expected cost is T(k) + T(n-1-k), where k is the number of elements that land in A1. Formally, we are using here the law of total probability, which says that for any random variable X and partition of the probability space by events B1...Bn, then E[X] = ∑ Bi E[X|Bi].
So now we just have to solve this ugly recurrence. We can reasonably guess that when n ≥ 1, T(n) ≤ an log n for some constant a. Clearly this holds for n = 1. Now apply induction on larger n to get
If we squint carefully at this recurrence for a while we notice that setting a = 2 makes this less than or equal to a n log n, since the remaining terms become (n-1) - n + 1/n = 1/n - 1, which is negative for n ≥ 1. We can thus confidently conclude that T(n) ≤ 2n log n (for n ≥ 1).
2.2. Clever method: use linearity of expectation
Imagine we use the following method for choosing pivots: we generate a random permutation of all the elements in the array, and when asked to sort some subarray A', we use as pivot the first element of A' that appears in our list. Since each element is equally likely to be first, this is equivalent to the actual algorithm. Pretend that we are always sorting the numbers 1..n and define for each pair of elements i < j the indicator variable (see RandomVariables) Xij to be 1 if i is compared to j at some point during the execution of the algorithm and 0 otherwise. Amazingly, we can actually compute the probability of this event (and thus E[Xij]): the only time i and j are compared is if one of them is chosen as a pivot before they are split up into different arrays. How do they get split up into different arrays? If some intermediate element k is chosen as pivot first, i.e., if some k with i < k < j appears in the permutation before both i and j. Occurrences of other elements don't affect the outcome, so we can concentrate on the restriction of the permutations to just the number i..j, and we win if this restricted permutation starts with either i or j. This event occurs with probability 2/(j-i+1), so we have E[Xij] = 2/(j-i+1). Summing over all pairs i < j gives:
This is pretty close to the bound of 2 n log n we computed the dumb way. Note that in principle we can compute an exact solution by being more careful about the sum.
Which way is better? Solving the recurrence requires less probabilistic handwaving (a more polite term might be "insight") but more grinding out inequalities, which is a pretty common trade-off. Since I am personally not very clever I would try the brute-force approach first. But it's worth knowing about better methods so you can try them in other situations.
3. Karger's min-cut algorithm
Suppose we have a multigraph1 G, and we want to partition the vertices into nonempty sets S and T such that the number of edges with one endpoint in S and one endpoint in T is as small as possible. There are many sophisticated ways to do this. There is also a randomized algorithm due to David Karger that solves the problem using almost no sophistication at all (at least in the algorithm itself).
The main idea is that given an edge uv, we can construct a new multigraph G1 by contracting the edge: in G1, u and v are replaced by a single vertex, and any edge that used to have either vertex as an endpoint now goes to the combined vertex (edges with both endpoints in {u,v} are deleted). Karger's algorithm is to contract edges chosen uniformly at random until only two vertices remain. All the vertices that got packed into one of these become S, the others become T. This finds a global minimum cut with probability at least .
Proof: Let (S,T) be a min cut of size k. Then the degree of each vertex v is at least k (otherwise (v,G-v) would be a smaller cut), and G contains at least kn/2 edges. The probability that we contract an S-T edge is thus at most k/(kn/2) = 2/n, and the probability that we don't contract one is 1-2/n = (n-2)/n. Assuming we missed collapsing (S,T) the first time, we now have a new graph G1 with n-1 vertices in which the min cut is still of size k. So now the chance that we miss (S,T) is (n-3)/(n-1). We stop when we have two vertices left, so the last step succeeds with probability 1/3. Multiplying all the probabilities together gives
If the graph has more than one min cut, this only makes our life easier. Note that since each min cut turns up with probability at least , there can't be more than of them. But even if there is only one, we have a good chance of finding it if we simply re-run the algorithm substantially more than n2 times.
4. Las Vegas vs Monte Carlo algorithms
It is possible for a randomized algorithm to fail some of the time but still be useful; we just need a bound on the probability of failure. We typically consider two classes of algorithms:
- Las Vegas algorithms
- The algorithm fails with some probability, but we can tell when it fails. In particular, we can run it again until it succeeds, which means that we can eventually succeed with probability 1 (but with a potentially unbounded running time). Quicksort is an example of a Las Vegas algorithm.
- Monte Carlo algorithms
- The algorithm fails with some probability, but we can't tell when it fails. If the algorithm produces a yes/no answer and the failure probability is significantly less than 1/2, we can reduce the probability of failure by running it many times and taking a majority of the answers. Karger's min-cut algorithm in an example of a Monte Carlo algorithm.
The heuristic for remember which class is which is that they were named by English speakers: in Las Vegas, the dealer tells you whether you've won or lost, but in Monte Carlo, le croupier ne parle que Français, so you have no idea what he's saying.
5. Randomized complexity classes
We can also have algorithms with "one-sided" failure properties. For these algorithms, we never get a bogus "yes" answer but may get a bogus "no" answer (or vice versa). This gives us several complexity classes that act like randomized versions of NP, co-NP, etc.:
The class R or RP (randomized P) consists of all languages L for which a poly-time Turing machine M exists such that if x∈L, then Pr[M(x,r) = 1] ≥ 1/2 and if x∉L, then Pr[M(x,r) = 1] = 0. In other words, we can find a witness that x∈L with constant probability. This is the randomized analog of NP (but it's much more practical, since with NP the probability of finding a winning witness may be exponentially small).
The class co-R consists of all languages L for which a poly-time Turing machine M exists such that if x∉L, then Pr[M(x,r) = 1] ≥ 1/2 and if x∈L, then Pr[M(x,r) = 1] = 0. This is the randomized analog of co-NP.
The class ZPP (zero-error probabilistic P) is defined as RP∩co-RP. If we run both our RP and co-RP machines for polynomial time, we learn the correct classification of x with probability at least 1/2. The rest of the time we learn only that we've failed (because both machines return 0, telling us nothing). This is the class of (polynomial-time) Las Vegas algorithms. The reason it is called "zero-error" is that we can equivalently define it as the problems solvable by machines that always output the correct answer eventually, but only run in expected polynomial time.
The class BPP (bounded-error probabilistic P) consists of all languages L for which a poly-time Turing machine exists such that if x∉L, then Pr[M(x,r) = 1] ≤ 1/3, and if x∈L, then Pr[M(x,r) = 1] ≥ 2/3. These are the (polynomial-time) Monte Carlo algorithms: if our machine answers 0 or 1, we can guess whether x∈L or not, but we can't be sure.
The class PP (probabilistic P) consists of all languages L for which a poly-time Turing machine exists such that if x∉L, then Pr[M(x,r) = 1] ≥ 1/2, and if x∈L, then Pr[M(x,r) = 1] < 1/2. Since there is only an exponentially small gap between the two probabilities, such algorithms are not really useful in practice; PP is mostly of interest to complexity theorists.
Assuming we have a source of random bits, any algorithm in RP, co-RP, ZPP, or BPP is good enough for practical use. We can usually even get away with using a pseudo-random number generator, and there are good reasons to suspect that in fact every one of these classes is equal to P.
6. More examples
See MotwaniRaghavan for many more examples.
CategoryAlgorithmNotes CategoryRandomizedAlgorithmsNotes
Unlike ordinary graphs, multigraphs can have more than one edge between two vertices. (1)