1. Peaks and valleys
Suppose we flip a fair coin n times. For each i, 1≤i≤n, let Xi be the indicator of the event that the i-th coin-flip is heads. For each i, 2≤i≤n-1, let Yi be the indicator for the event that Xi-1=0, Xi=1, and Xi+1=0; if this even occurs we say that the sequence of coin-flips has a peak at i. Let be the random variable counting the number of peaks in the sequence.
- What is E[S]?
- What is Var[S]?
1.1. Solution
1. To compute E[S], we just need to compute E[Yi] for each i and sum the expectations. Each variable Yi is one provided we get a pattern THT in the three coins surrounding position i; since the coins are fair and independent, this event occurs with probability (1/2)3 = 1/8. Summing over all n-2 values of i gives E[S] = (n-2)/8.
2. To compute Var[S], we use the sum formula for variance, taking into account the fact that not all of the Yi are pairwise independent. First let's compute Var[Yi]. We have previously shown that E[Yi] = 1/8; since Yi2 = Yi this also means E[Yi2] = 1/8. So Var[Yi] = E[Yi2] - (E[Yi])2 = 1/8 - (1/8)2 = 7/64.
Now we need to compute Cov(Yi,Yj) for all i≠j. We'll consider several cases.
If j=i+1, then there is a sequence of 4 coin-flips starting at position i-1 such that Yi=1 iff the first three coins have the pattern THT and Yj=1 iff the last three coins have the pattern THT. But we can't realize both patterns simultaneously because they disagree on the middle two coins. So in this case we have E[YiYi+1] = 0, and Cov(Yi,Yi+1) = 0 - E[Yi]E[Yj] = -1/64.
If j=i+2, then there is a sequence of 5 coin-flips starting at position i-1 such that Yi=Yj=1 iff the coins come up THTHT. This occurs with probability (1/2)5 = 1/32, giving Cov(Yi,Yi+1) = 1/32 - 1/64 = 1/64.
If j>i+2, then there is no overlap between the Yi coins and the Yj coins. The random variables Yi and Yj are thus independent, and we have Cov(Yi,Yj) = 0.
The cases where j<i are symmetric.
Applying the sum formula gives:
2. Stronger than Markov
Let X1...Xn be the indicator variables for independent coin-flips with bias p; that is, each Xi is 1 with probability p and 0 with probability 1-p. Let .
Compute E[eS].
Use the expected value above and Markov's inequality to compute an upper bound on Pr[S≥m] = Pr[eS≥em].
- Can you find values of p, n, and m, for which this bound is smaller than the bound E[S]/m given by a direct application of Markov's inequality?
2.1. Solution
1. Since the Xi are all independent, so are the random variables . So we have
2. Markov's inequality gives Pr[eS≥em] ≤ E[eS]/em = (pe+1-p)n/em.
3. The bound is generally better when m is significantly greater than E[S] = pn. For example, if p=1/n and m=n, the new bound is (e/n+1-1/n)n/en = (1/n+1/e-1/(en))n; for large n this approaches (1/e)n, which is very small (though not as small as the exact probability (1/n)n). A straight application of Markov's inequality gives only n(1/n)/n = 1/n.
A similar case is when p=1/2 and m=n. Here the bound is ((e+1)/2)n/en ~= (0.6839...)n. The straight Markov's bound in this case is only 1/2; it doesn't even depend on n.
The main difference between applying Markov's inequality directly to S and applying it to eS is that in the latter case we need independence to compute eS. The technique of bounding E[eαS] for appropriate choices of α goes by the name of Chernoff bounds and is a standard method for proving bounds on the tails of the distribution of a random variable.
3. At the genome factory
A custom gene splicing shop charges $2 to splice a thymine (T) amino acid into a single-strand DNA fragment, and $1 each for adenine (A) and cytosine (C). Guanine (G) is free. So, for example, constructing the sequence GATTACA costs 0+1+2+2+1+1+1=8 dollars.
Write a generating function such that the zk coefficient gives the number of ways to buy a single amino acid for k dollars.
Write a generating function such that the zk coefficient gives the number of ways to buy a single DNA strand consisting of n amino acids for k dollars.
- Give a simple expression for the number of strands of length n that cost k dollars.
3.1. Solution
1+2z+z2.
(1+2z+z2)n.
The trick here is that (1+2z+z2) factors as (1+z)2. So (1+2z+z2)n = (1+z)2n. From the binomial theorem, this is equal to . It follows that there are exactly strands of length n that cost k dollars.