(For a more up-to-date version of these notes, see http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf.)
In a data stream computation we are presented with a sequence of pairs (it,ct) where 1≤it≤n is an index and ct and count, and we want to maintain a small data structure, known as a sketch, that will allows us to approximately answer statistical queries about the vector a given by ai = ∑t, i[t]=i ct. The size of the sketch should be polylogarithmic in the size of a and the length of the stream and polynomial in the error bounds, and updating the sketch given a new data point should be cheap. The motivation is the existence of data sets that are too large to store at all (e.g., network traffic statistics), or too large to store in fast memory (e.g., very large database tables). By building a sketch we can make one pass through the data set but answer queries after the fact, with some loss of accuracy.
The count-min sketch of Cormode and Muthukrishnan (see also MitzenmacherUpfal §13.4) gives approximations of ai, ∑i=l..r ai, and a⋅b with various error bounds, and can be used for more complex tasks like finding "heavy hitters"—indices with high weight. The easiest case is approximating ai when all ct are non-negative, so we'll concentrate on that.
1. Structure
A bit like a Bloom filter made out of counters.
Build an array c with width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉, where ε is the error bound and δ is the probability of exceeding the error bound. Choose d independent pairwise-independent hash functions. Initialize c to all zeroes.
2. Updates
Given an update (it,ct), increment c[j,hj(it)] by ct for j=1..d.. (This is the count part of count-min.)
3. Queries
3.1. Point queries
Here we want to estimate ai for some fixed i. There are two cases, depending on whether the increments are all non-negative, or arbitrary. In both cases we will get an estimate whose error is linear in both the error parameter ε and the L1-weight ‖a‖1 of a. It follows that the relative error will be low for heavy points, but we may get a large relative error for light points (and especially large for points that don't appear in the data set at all).
3.1.1. Non-negative case
To estimate ai, compute âi = minj c[j,hj(i)]. (This is the min part.) Then ai≤âi, and with probability at least 1-δ,âi ≤ âi + ε‖a‖1.
Proof: The lower bound is easy. Since each pair (i,ct) increments each c[j,hj(i)] by ct, we have an invariant that âi≤ai throughout the computation.
For the upper bound, let Iijk be the indicator for the event that (i≠k) ∧ (hj(i) = hj(k)), i.e., that we get a collision between i and k using hj. Pairwise independence of hj gives E[Iijk] = 1/w ≤ ε/e.
Now let Xij = ∑k=1..n Iijkak. Then c[j,hj(i)] = ai + Xij. (The fact that Xij≥0 gives an alternate proof of the lower bound.) Now use linearity of expectation to get E[Xij] = E[∑k Iijkak] = ∑k akE[Iijk] ≤ ∑k ak (ε/e) = (ε/e)‖a‖1. So Pr[c[j,hj(i)] > ai + ε‖a‖1] = Pr[Xij > eEXij] < 1/e, by Markov's inequality. With d choices for j, and each hash function chosen independently the probability that every count is too big is at most (1/e)-d = exp(-d) ≤ exp(-ln(1/δ)) = δ.
3.1.2. General case
If the increments might be negative, instead of using the min count we use the median count: âi = medianj c[j,hj(i)]. We again define the error term Xij as above, and observe that E[|Xij|] = E[|∑k Iijkak|] ≤ ∑k |akEijk| ≤ ∑k |ak|(ε/e) = (ε/e)‖a‖1. Using Markov's inequality, we get Pr[|Xij| > 3ε‖a‖1] = Pr[|Xij| > 3eEXij] < 1/3e < 1/8. In order for the median to be off by more than 3ε‖a‖1, we need d/2 of these low-probability events to occur. The expected number that occur is μ = d/8, so applying the Chernoff bound we are looking at Pr[S ≥ (1+3)μ] ≤ (e3/44)d/8 ≤ (e3/8/2)ln (1/δ) = δln 2 - 3/8 < δ1/4 (the actual exponent is about 0.31, but 1/4 is easier to deal with). It follows that
Pr[ai-3ε‖a‖1 ≤ âi ≤ ai+3ε‖a‖1] > 1-δ1/4.
One way to think about this is that getting an estimate within ε‖a‖1 of the right value with probability at least 1-δ requires 3 times the width and 4 times the depth—or 12 times the space and 4 times the time—when we aren't assuming increments are non-negative.
3.2. Inner products
Here we want to estimate a⋅b, where a and b are both stored as count-min sketches. The paper concentrates on the case where a and b are both non-negative, which has applications in estimating the size of a join in a database. The method is to estimate a⋅b as minj ∑k ca[j,k]⋅cb[j,k].
For a single j, the sum consists of both good values and bad collisions; we have ∑k ca[j,k]⋅cb[j,k] = ∑n aibi + ∑p≠q, hj(p)=hj(q) apbq. The second term has expectation ∑p≠q Pr[hj(p)=hj(q)]apbq ≤ ∑p≠q (ε/e)apbq ≤ ∑p,q,(ε/e)apbq ≤ (ε/e)‖a‖1‖b‖1. So as in the point-query case we get probability at most e-1 that a single j gives a value that's more than ε‖a‖1‖b‖1 too high, so the probability that the min value is too high is at most exp(-d) ≤ δ.
3.3. Finding heavy hitters
Here we want to find the heaviest elements in the set: those indices i for which ai exceeds φ‖a‖1 for some constant threshold φ. The easy case is when increments are non-negative (for the general case, see the paper), and uses a method from a previous paper by Charikar, Chen, and Farach-Colton (http://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf). Instead of trying to find the elements after the fact, we extend the data structure and update procedure to track all the heavy elements found so far (stored in a heap), as well as ‖a‖1 = ∑ ct. When a new increment (i,c) comes in, we first update the count-min structure and then do a point query on ai; if âi ≥ φ‖a‖1, we insert i into the heap, and if not, we delete i along with any other value whose stored point-query estimate has dropped below threshold.
The trick here is that the threshold φ‖a‖1 only increases over time (remember that we are assuming non-negative increments). So if some element i is below threshold at time t, it can only go above threshold if it shows up again, and we have a probability of at least 1-δ of including it then.