[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. Peaks and valleys

Suppose we flip a fair coin n times. For each i, 1≤i≤n, let Xi be the indicator of the event that the i-th coin-flip is heads. For each i, 2≤i≤n-1, let Yi be the indicator for the event that Xi-1=0, Xi=1, and Xi+1=0; if this even occurs we say that the sequence of coin-flips has a peak at i. Let $S = \sum_{i=2}^{n-1} Y_i$ be the random variable counting the number of peaks in the sequence.

  1. What is E[S]?
  2. What is Var[S]?

1.1. Solution

1. To compute E[S], we just need to compute E[Yi] for each i and sum the expectations. Each variable Yi is one provided we get a pattern THT in the three coins surrounding position i; since the coins are fair and independent, this event occurs with probability (1/2)3 = 1/8. Summing over all n-2 values of i gives E[S] = (n-2)/8.

2. To compute Var[S], we use the sum formula for variance, taking into account the fact that not all of the Yi are pairwise independent. First let's compute Var[Yi]. We have previously shown that E[Yi] = 1/8; since Yi2 = Yi this also means E[Yi2] = 1/8. So Var[Yi] = E[Yi2] - (E[Yi])2 = 1/8 - (1/8)2 = 7/64.

Now we need to compute Cov(Yi,Yj) for all i≠j. We'll consider several cases.

If j=i+1, then there is a sequence of 4 coin-flips starting at position i-1 such that Yi=1 iff the first three coins have the pattern THT and Yj=1 iff the last three coins have the pattern THT. But we can't realize both patterns simultaneously because they disagree on the middle two coins. So in this case we have E[YiYi+1] = 0, and Cov(Yi,Yi+1) = 0 - E[Yi]E[Yj] = -1/64.

If j=i+2, then there is a sequence of 5 coin-flips starting at position i-1 such that Yi=Yj=1 iff the coins come up THTHT. This occurs with probability (1/2)5 = 1/32, giving Cov(Yi,Yi+1) = 1/32 - 1/64 = 1/64.

If j>i+2, then there is no overlap between the Yi coins and the Yj coins. The random variables Yi and Yj are thus independent, and we have Cov(Yi,Yj) = 0.

The cases where j<i are symmetric.

Applying the sum formula gives:

\begin{align*}
\Var[S]
&=
\sum_{i=1}^{n} \sum_{j=1}^{n} \Cov(Y_i,Y_j)
\\
&=
\sum_{i=3}^{n} \Cov(Y_i,Y_{i-2})
+ \sum_{i=2}^{n} \Cov(Y_i,Y_{i-1})
+ \sum_{i=1}^{n} \Var(Y_i)
+ \sum_{i=1}^{n-1} \Cov(Y_i,Y_{i+1})
+ \sum_{i=1}^{n-2} \Cov(Y_i,Y_{i+2})
\\
&=
(7/64)n
+ (-1/64)(n-1)
+ (1/64)(n-2)
+ (-1/64)(n-1)
+ (1/64)(n-2)
\\
&=
\frac{7n - (n-1) + (n-2) - (n-1) + (n-2)}{64}
\\
&=
\frac{7n-2}{64}.
\end{align*}

2. Stronger than Markov

Let X1...Xn be the indicator variables for independent coin-flips with bias p; that is, each Xi is 1 with probability p and 0 with probability 1-p. Let $S = \sum_{i=1}^{n} X_i$.

  1. Compute E[eS].

  2. Use the expected value above and Markov's inequality to compute an upper bound on Pr[S≥m] = Pr[eS≥em].

  3. Can you find values of p, n, and m, for which this bound is smaller than the bound E[S]/m given by a direct application of Markov's inequality?

2.1. Solution

1. Since the Xi are all independent, so are the random variables $e^{X_i}$. So we have

\begin{align*}
\E\left[e^S\right]
&=
\E\left[e^{\sum_{i=1}^{n} X_i}\right]
\\
&=
\E\left[\prod_{i=1}^{n} e^{X_i}\right]
\\
&=
\prod_{i=1}^{n} \E\left[e^{X_i}\right]
\\
&= (pe^1 + (1-p)e^0)^n
\\
&= (pe+1-p)^n.
\end{align*}

2. Markov's inequality gives Pr[eS≥em] ≤ E[eS]/em = (pe+1-p)n/em.

3. The bound is generally better when m is significantly greater than E[S] = pn. For example, if p=1/n and m=n, the new bound is (e/n+1-1/n)n/en = (1/n+1/e-1/(en))n; for large n this approaches (1/e)n, which is very small (though not as small as the exact probability (1/n)n). A straight application of Markov's inequality gives only n(1/n)/n = 1/n.

A similar case is when p=1/2 and m=n. Here the bound is ((e+1)/2)n/en ~= (0.6839...)n. The straight Markov's bound in this case is only 1/2; it doesn't even depend on n.

The main difference between applying Markov's inequality directly to S and applying it to eS is that in the latter case we need independence to compute eS. The technique of bounding E[eαS] for appropriate choices of α goes by the name of Chernoff bounds and is a standard method for proving bounds on the tails of the distribution of a random variable.

3. At the genome factory

A custom gene splicing shop charges $2 to splice a thymine (T) amino acid into a single-strand DNA fragment, and $1 each for adenine (A) and cytosine (C). Guanine (G) is free. So, for example, constructing the sequence GATTACA costs 0+1+2+2+1+1+1=8 dollars.

  1. Write a generating function such that the zk coefficient gives the number of ways to buy a single amino acid for k dollars.

  2. Write a generating function such that the zk coefficient gives the number of ways to buy a single DNA strand consisting of n amino acids for k dollars.

  3. Give a simple expression for the number of strands of length n that cost k dollars.

3.1. Solution

  1. 1+2z+z2.

  2. (1+2z+z2)n.

  3. The trick here is that (1+2z+z2) factors as (1+z)2. So (1+2z+z2)n = (1+z)2n. From the binomial theorem, this is equal to $\sum_{k=0}^{2n} {2n \choose k} z^k$. It follows that there are exactly ${2n \choose k}$ strands of length n that cost k dollars.


2014-06-17 11:57