Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

For an updated version of these notes, see http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. For practical implementation of hash tables in C, see C/HashTables.

These are theoretical notes on hashing based largely on MotwaniRaghavan §§8.4–8.5 (which is in turn based on work of Carter and Wegman on universal hashing and Fredman, Komlós, and Szemerédi on O(1) worse-case hashing).

1. Hashing: basics

Want to store s elements from a universe M of size m in a table with indices drawn from an index space N of size n.
Typically we assume M = [m] = {0...m-1} and N = [n] = {0..n-1}.
If m ≤ n, just use an array.
Otherwise use a hash function h:M→N.
When m > n, we always get collisions = pairs (x,y) with h(x) = h(y).
Collisions are handled using some secondary data structure in each bin.
- Linked lists are typical (open addressing).
- FKS (see below) uses hash tables.
Choosing a random hash function from some good family of hash functions can minimize collisions, because the adversary can't tell which elements to include in the table.
Goal: Do each search in O(1+s/n) expected time, which for s ≤ n will be much better than the Θ(log n) time for balanced trees or SkipLists.

This only works if we are working in a RAM (random-access machine model), where we can access arbitrary memory locations in time O(1) and similarly compute arithmetic operations on O(log M)-bit values in time O(1). There is an argument that in reality any actual RAM machine requires either Ω(log N) time to read one of N memory locations (routing costs) or, if one is particularly pedantic, Ω(N^1/3) time (speed of light + finite volume for each location). We will ignore this argument.

2. Universal hash families

Family of hash functions H is 2-universal if for any x≠y, Pr[h(x)=h(y)] ≤ 1/n for random h∈H.
- Strongly 2-universal if for any x₁≠x₂∈M, y₁,y₂∈N, Pr[h(x₁)=y₁ ∧ h(x₂) = y₂] ≤ 1/n² for random h∈H.
- k-universal usually means strongly k-universal: Given distinct x₁...x_k, and any y₁...y_k Pr[h(x_i)=y_i) ∀i] ≤ 1/n^k. It is possible to generalize the weak version of 2-universality to get a weak version of k-universality (Pr[h(x_i) are all equal] ≤ 1/n^k-1), but this generalization is not as useful as strong k-universality.
Let δ(x,y,h) = 1 if x≠y and h(x)=h(y), 0 otherwise.
- Abuse of notation: δ(X,Y,H) = ∑_{x∈X,y∈Y,h∈H} δ(x,y,h), with e.g. δ(x,Y,h) = δ({x},Y,{h}).
H is 2-universal = δ(x,y,H) ≤ |H|/n.
- If H is all functions M→N, we get equality; but we might do better since h might tend to map distinct values to distinct places.
- Lower bound: For any family H, ∃x,y s.t. δ(x,y,H) ≥ |H|/n - |H|/m
  - Proof: Count collisions in inverse images of each elements z.
    - Have δ(h^-1(z), h^-1(z), h) = |h^-1(z)|⋅(|h^-1(z)|-1).
    - Sum over all z to get δ(M,M,h) ≥ ∑_z |h^-1(z)|⋅(|h^-1(z)|-1).
    - Use convexity or Lagrange multipliers to argue RHS is minimized subject to ∑_z |h^-1(z)| = m when all pre-images are the same size m/n.
    - This gives δ(M,M,h) ≥ n(m/n)(m/n-1) = m²(1/n - 1/m).
    - Sum over all h to get δ(M,M,H) ≥ |H| m²(1/n - 1/m).
    - PigeonholePrinciple says some pair x,y has δ(x,y,H) ≥ δ(M,M,H)/(m(m-1)) > |H| m²(1/n - 1/m). (Note a bit of slop there.)
    - Since 1/m is usually << 1/n, we are happy if we get the 2-universal upper bound of |H|/n.
- Claim: h_ab(x) = (ax + b mod p) mod n, where a∈ℤ_p-{0}, b∈ℤ_p and p is a prime ≥ m, is 2-universal.
  - Proof: Again, count collisions.
    - Split h_ab(x) as g(f_ab(x)) where f_ab(x) = ax+b mod p and g(x) = x mod n.
    - Claim: If x≠y, then δ(x,y,H) = δ(ℤ_p,ℤ_p,g).
      - Proof:
        Let r≠s ∈ Z_p.
        Then ax+b = r and ay+b = s has a unique solution mod p (because ℤ_p is a FiniteField).
        ⇒ ∃ f_ab is a 1-1 map between a,b and r,s
        ⇒ δ(x,y,H) = ∑_r≠s δ(r,s,g) = δ(ℤ_p,ℤ_p,g).
    - For fixed r, ⌈p/n⌉ is upper bound on number of elements in g^-1(g(r)), subtract one to get number of s that collide with r.
    - So total collisions δ(ℤ_p,ℤ_p,g) ≤ p(⌈p/n⌉-1) ≤ p(p-1)/n = |H|/n.
2-universal hash family ⇒ open addressing costs O(1+s/n) expected time per operation.
- Proof: E[cost of operation on key x] = O(1 + E[size of linked list at h(x)]) = O(1 + E[# of y∈S that collide with x]) = O(1 + s Pr[y≠x collides with x]) = O(1+s/n).

3. FKS hashing with O(s) space and O(1) worst-case search time

Goal is to hash a static set S so that we never pay more than constant time for search (not just in expectation).

A perfect hash function for a set S is a hash function h:M→N that is injective on S (x≠y ⇒ h(x)≠h(y) for x,y∈S).
Claim: If H is 2-universal and s² ≤ n, then there is a perfect h∈H for S.
- Proof: Usual collision-counting argument.
  - For all x≠y, have δ(x,y,H) ≤ |H|/n.
  - So δ(S,S,H) ≤ s(s-1) |H|/n.
  - PigeonholePrinciple says ∃h∈H such that δ(S,S,h) ≤ s(s-1)/n < 1.
  - δ(S,S,h) < 1 ⇒ δ(S,S,h) = 0 (since δ(S,S,h) is always an integer).
- Practical variant: If s² ≤ αn, then Pr[h is perfect for S] > 1/α.
This immediately gives O(1) search time with O(s²) space.
Next step: Hash to n = s bins, then rehash perfectly within each bin.
- Time cost is O(1) for outer hash + O(1) for inner (perfect) hash, giving O(1) total.
- Space cost is ∑_i O(s_i²), where s_i = | { s∈S | h(s) = i } | = | h^-1(i) |.
  - Observe that ∑_i s_i² = ∑_i (s_i + s_i(s_i-1)) = s + δ(S,S,h).
  - Since H is 2-universal, have δ(S,S,H) ≤ |H| s(s-1)/n.
  - PigeonholePrinciple says ∃h such that δ(S,S,h) ≤ (1/|H|) δ(S,S,h) ≤ s(s-1)/n = s-1 < s.
  - So space cost is really O(s) + O(s) = O(s).
- Same practical consideration as above applies, by making s = 2n (say) get a constant probability of choosing a good h, keep trying until we get one.

4. Cuckoo hashing

Goal: Get O(1) search time in a dynamic hash table at the cost of a messy insertion procedure. In fact, each search takes only two reads, which can be done in parallel; this is optimal by a lower bound of Pagh probe.pdf, which shows a matching upper bound for static dictionaries. Cuckoo hashing is an improved version of this result that allows for dynamic insertions.

We'll mostly be following the presentation in the original cuckoo hashing paper by Pagh and Rodler: cuckoo-jour.pdf.

4.1. Structure

We have two tables T₁ and T₂ of size r each, with separate, independent hash functions h₁ and h₂. These functions are assumed to be k-universal for some sufficiently large value k; as long as we never look at more than k values at once, this means we can treat them effectively as random functions. (In practice, using crummy hash functions seems to work just fine, a common property of hash tables.)

Every key x is stored either in T₁[h₁(x)] or T₂[h₂(x)]. So the search procedure just looks at both of these locations and returns whichever one contains x (or fails if neither contains x).

To insert a value x₁, we put it in T₁[h₁(x₁)_{] or T}2_[h2_(x1₎]. If one or both of these locations is empty, we put it there. Otherwise we have to kick out some value that is in the way (this is the cuckoo part of cuckoo hashing, named after the bird that leaves its eggs in other birds' nests). So we let x₂ = T₁[h₁(x₁)], and insert x₁ in T₁[h₁(x₁)]. We now have a new "nestless" value x₂, which we swap with whatever is in T₂[h₂(x₂)]. If that location was empty, we are done; otherwise, we get a new value x₃ that we have to put in T₁ and so on. The procedure terminates when we find an empty spot or if enough iterations have passed that we don't expect to find an empty spot, in which case we rehash the entire table. The code from the Pagh-Rolder paper looks like this:

procedure insert(x)
- if lookup(x) then return
- else do Maxloop time
  - x ↔ T₁[h₁(x)]
  - if x = ⊥ then return
  - x ↔ T₂[h₂(x)]
  - if x = ⊥ then return
- end loop
- rehash()
- insert(x)
end

A detail not included in the above code is that we always rehash (in theory) after r² insertions; this avoids potential problems with the hash functions used in the paper not being universal enough.

The main question is how long it takes the insertion procedure to terminate, assuming the table is not too full. Letting s = |S|, we will assume r/s ≥ 1+ε for some fixed ε.

First let's look at what happens during an insert if we have a lot of nestless values. We have a sequence of values x₁, x₂, where each pair of values x_i, x_i+1 collides in h₁ or h₂. Assuming we don't reach the MaxLoop limit, there are three main possibilities (the leaves of the tree below):

Eventually we reach an empty position without seeing the same key twice.
Eventually we see the same key twice; there is some i and j>i such that x_j=x_i. Since x_i was already moved once, when we reach it the second time we will try to move it back, displacing x_i-1. This process continues until we have restored x₂ to T₁[h₁(x₁)], displacing x₂ to T₂[h₂(x₁)] and possibly creating a new sequence of nestless values. Two outcomes are now possible:
1. Some x_l is moved to an empty location. We win!
2. Some x_l is moved to a location we've already looked at. We lose! In this case we are playing musical chairs with more players than chairs, and have to rehash.

Let's look at the probability that we get the last, closed loop case. Following Pagh-Rolder, we let v be the number of distinct nestless keys in the loop. We can now count how many different ways such a loop can form: There are v³ choices for i, j, and l, r^v-1 choices of cells for the loop, and s^v-1 choices for the non-x₁ elements of the loop. We also have 2v edges each of which occurs with probability r^-1, giving a total probability of v³r^v-1s^v-1r^-2v = v³(s/r)^v/(sn). Summing this over all v gives 1/(rs) ∑ v³(s/r)^v = O(1/(rs)) = O(1/r²) (the series converges under the assumption that s/r < 1). Since the cost of hitting a closed loop is O(r), this adds O(1) to the insertion complexity.

Now we look at what happens if we don't get a closed loop. It's a little messy to analyze the behavior of keys that appear more than once in the sequence, so the trick used in the paper is to observe that for any sequences of nestless keys x₁...x_p, there is a subsequence of size p/3 with no repetitions that starts with x₁. Since there are only two subsequences that starts with x₁ (we can't have the same key show up more than twice), this will either be x₁...x_j-1 or x₁=x_i+j-1...x_p, and a case analysis shows that at least one of these will be big. We can then argue that the probability that we get a sequence of v distinct keys starting with x₁ in either T₁ or T₂ is at most 2(s/r)^v-1 (since we have to hit a nonempty spot, with probability ≤ s/r, at each step, but there are two possible starting locations), which gives an expected insertion time bounded by ∑ 3v (s/r)^v-1 = O(1).

Using a slightly more detailed analysis (see the paper), it can be shown that the bound for non-constant ε is O(1+1/ε).

5. Bloom filters

See MitzenmacherUpfal §5.5.3 for basics and a principled analysis or Bloom filter for many variations and the collective wisdom of the unwashed masses.

CategoryRandomizedAlgorithmsNotes