For an updated version of these notes, see http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. For practical implementation of hash tables in C, see C/HashTables.
These are theoretical notes on hashing based largely on MotwaniRaghavan §§8.4–8.5 (which is in turn based on work of Carter and Wegman on universal hashing and Fredman, Komlós, and Szemerédi on O(1) worse-case hashing).
1. Hashing: basics
- Want to store s elements from a universe M of size m in a table with indices drawn from an index space N of size n.
- Typically we assume M = [m] = {0...m-1} and N = [n] = {0..n-1}.
- If m ≤ n, just use an array.
Otherwise use a hash function h:M→N.
When m > n, we always get collisions = pairs (x,y) with h(x) = h(y).
- Collisions are handled using some secondary data structure in each bin.
Linked lists are typical (open addressing).
- FKS (see below) uses hash tables.
- Choosing a random hash function from some good family of hash functions can minimize collisions, because the adversary can't tell which elements to include in the table.
Goal: Do each search in O(1+s/n) expected time, which for s ≤ n will be much better than the Θ(log n) time for balanced trees or SkipLists.
This only works if we are working in a RAM (random-access machine model), where we can access arbitrary memory locations in time O(1) and similarly compute arithmetic operations on O(log M)-bit values in time O(1). There is an argument that in reality any actual RAM machine requires either Ω(log N) time to read one of N memory locations (routing costs) or, if one is particularly pedantic, Ω(N1/3) time (speed of light + finite volume for each location). We will ignore this argument.
2. Universal hash families
Family of hash functions H is 2-universal if for any x≠y, Pr[h(x)=h(y)] ≤ 1/n for random h∈H.
Strongly 2-universal if for any x1≠x2∈M, y1,y2∈N, Pr[h(x1)=y1 ∧ h(x2) = y2] ≤ 1/n2 for random h∈H.
k-universal usually means strongly k-universal: Given distinct x1...xk, and any y1...yk Pr[h(xi)=yi) ∀i] ≤ 1/nk. It is possible to generalize the weak version of 2-universality to get a weak version of k-universality (Pr[h(xi) are all equal] ≤ 1/nk-1), but this generalization is not as useful as strong k-universality.
- Let δ(x,y,h) = 1 if x≠y and h(x)=h(y), 0 otherwise.
Abuse of notation: δ(X,Y,H) = ∑x∈X,y∈Y,h∈H δ(x,y,h), with e.g. δ(x,Y,h) = δ({x},Y,{h}).
- H is 2-universal = δ(x,y,H) ≤ |H|/n.
- If H is all functions M→N, we get equality; but we might do better since h might tend to map distinct values to distinct places.
- Lower bound: For any family H, ∃x,y s.t. δ(x,y,H) ≥ |H|/n - |H|/m
- Proof: Count collisions in inverse images of each elements z.
Have δ(h-1(z), h-1(z), h) = |h-1(z)|⋅(|h-1(z)|-1).
Sum over all z to get δ(M,M,h) ≥ ∑z |h-1(z)|⋅(|h-1(z)|-1).
Use convexity or Lagrange multipliers to argue RHS is minimized subject to ∑z |h-1(z)| = m when all pre-images are the same size m/n.
This gives δ(M,M,h) ≥ n(m/n)(m/n-1) = m2(1/n - 1/m).
Sum over all h to get δ(M,M,H) ≥ |H| m2(1/n - 1/m).
PigeonholePrinciple says some pair x,y has δ(x,y,H) ≥ δ(M,M,H)/(m(m-1)) > |H| m2(1/n - 1/m). (Note a bit of slop there.)
Since 1/m is usually << 1/n, we are happy if we get the 2-universal upper bound of |H|/n.
- Proof: Count collisions in inverse images of each elements z.
Claim: hab(x) = (ax + b mod p) mod n, where a∈ℤp-{0}, b∈ℤp and p is a prime ≥ m, is 2-universal.
- Proof: Again, count collisions.
Split hab(x) as g(fab(x)) where fab(x) = ax+b mod p and g(x) = x mod n.
Claim: If x≠y, then δ(x,y,H) = δ(ℤp,ℤp,g).
- Proof:
Let r≠s ∈ Zp.
Then ax+b = r and ay+b = s has a unique solution mod p (because ℤp is a FiniteField).
⇒ ∃ fab is a 1-1 map between a,b and r,s
⇒ δ(x,y,H) = ∑r≠s δ(r,s,g) = δ(ℤp,ℤp,g).
- Proof:
For fixed r, ⌈p/n⌉ is upper bound on number of elements in g-1(g(r)), subtract one to get number of s that collide with r.
So total collisions δ(ℤp,ℤp,g) ≤ p(⌈p/n⌉-1) ≤ p(p-1)/n = |H|/n.
- Proof: Again, count collisions.
- 2-universal hash family ⇒ open addressing costs O(1+s/n) expected time per operation.
- Proof: E[cost of operation on key x] = O(1 + E[size of linked list at h(x)]) = O(1 + E[# of y∈S that collide with x]) = O(1 + s Pr[y≠x collides with x]) = O(1+s/n).
3. FKS hashing with O(s) space and O(1) worst-case search time
Goal is to hash a static set S so that we never pay more than constant time for search (not just in expectation).
A perfect hash function for a set S is a hash function h:M→N that is injective on S (x≠y ⇒ h(x)≠h(y) for x,y∈S).
Claim: If H is 2-universal and s2 ≤ n, then there is a perfect h∈H for S.
- Proof: Usual collision-counting argument.
- For all x≠y, have δ(x,y,H) ≤ |H|/n.
- So δ(S,S,H) ≤ s(s-1) |H|/n.
PigeonholePrinciple says ∃h∈H such that δ(S,S,h) ≤ s(s-1)/n < 1.
δ(S,S,h) < 1 ⇒ δ(S,S,h) = 0 (since δ(S,S,h) is always an integer).
Practical variant: If s2 ≤ αn, then Pr[h is perfect for S] > 1/α.
- Proof: Usual collision-counting argument.
This immediately gives O(1) search time with O(s2) space.
- Next step: Hash to n = s bins, then rehash perfectly within each bin.
- Time cost is O(1) for outer hash + O(1) for inner (perfect) hash, giving O(1) total.
Space cost is ∑i O(si2), where si = | { s∈S | h(s) = i } | = | h-1(i) |.
Observe that ∑i si2 = ∑i (si + si(si-1)) = s + δ(S,S,h).
- Since H is 2-universal, have δ(S,S,H) ≤ |H| s(s-1)/n.
PigeonholePrinciple says ∃h such that δ(S,S,h) ≤ (1/|H|) δ(S,S,h) ≤ s(s-1)/n = s-1 < s.
- So space cost is really O(s) + O(s) = O(s).
- Same practical consideration as above applies, by making s = 2n (say) get a constant probability of choosing a good h, keep trying until we get one.
4. Cuckoo hashing
Goal: Get O(1) search time in a dynamic hash table at the cost of a messy insertion procedure. In fact, each search takes only two reads, which can be done in parallel; this is optimal by a lower bound of Pagh probe.pdf, which shows a matching upper bound for static dictionaries. Cuckoo hashing is an improved version of this result that allows for dynamic insertions.
We'll mostly be following the presentation in the original cuckoo hashing paper by Pagh and Rodler: cuckoo-jour.pdf.
4.1. Structure
We have two tables T1 and T2 of size r each, with separate, independent hash functions h1 and h2. These functions are assumed to be k-universal for some sufficiently large value k; as long as we never look at more than k values at once, this means we can treat them effectively as random functions. (In practice, using crummy hash functions seems to work just fine, a common property of hash tables.)
Every key x is stored either in T1[h1(x)] or T2[h2(x)]. So the search procedure just looks at both of these locations and returns whichever one contains x (or fails if neither contains x).
To insert a value x1, we put it in T1[h1(x1)] or T2[h2(x1)]. If one or both of these locations is empty, we put it there. Otherwise we have to kick out some value that is in the way (this is the cuckoo part of cuckoo hashing, named after the bird that leaves its eggs in other birds' nests). So we let x2 = T1[h1(x1)], and insert x1 in T1[h1(x1)]. We now have a new "nestless" value x2, which we swap with whatever is in T2[h2(x2)]. If that location was empty, we are done; otherwise, we get a new value x3 that we have to put in T1 and so on. The procedure terminates when we find an empty spot or if enough iterations have passed that we don't expect to find an empty spot, in which case we rehash the entire table. The code from the Pagh-Rolder paper looks like this:
- procedure insert(x)
- if lookup(x) then return
- else do Maxloop time
x ↔ T1[h1(x)]
- if x = ⊥ then return
x ↔ T2[h2(x)]
- if x = ⊥ then return
- end loop
- rehash()
- insert(x)
- end
A detail not included in the above code is that we always rehash (in theory) after r2 insertions; this avoids potential problems with the hash functions used in the paper not being universal enough.
The main question is how long it takes the insertion procedure to terminate, assuming the table is not too full. Letting s = |S|, we will assume r/s ≥ 1+ε for some fixed ε.
First let's look at what happens during an insert if we have a lot of nestless values. We have a sequence of values x1, x2, where each pair of values xi, xi+1 collides in h1 or h2. Assuming we don't reach the MaxLoop limit, there are three main possibilities (the leaves of the tree below):
- Eventually we reach an empty position without seeing the same key twice.
Eventually we see the same key twice; there is some i and j>i such that xj=xi. Since xi was already moved once, when we reach it the second time we will try to move it back, displacing xi-1. This process continues until we have restored x2 to T1[h1(x1)], displacing x2 to T2[h2(x1)] and possibly creating a new sequence of nestless values. Two outcomes are now possible:
Some xl is moved to an empty location. We win!
Some xl is moved to a location we've already looked at. We lose! In this case we are playing musical chairs with more players than chairs, and have to rehash.
Let's look at the probability that we get the last, closed loop case. Following Pagh-Rolder, we let v be the number of distinct nestless keys in the loop. We can now count how many different ways such a loop can form: There are v3 choices for i, j, and l, rv-1 choices of cells for the loop, and sv-1 choices for the non-x1 elements of the loop. We also have 2v edges each of which occurs with probability r-1, giving a total probability of v3rv-1sv-1r-2v = v3(s/r)v/(sn). Summing this over all v gives 1/(rs) ∑ v3(s/r)v = O(1/(rs)) = O(1/r2) (the series converges under the assumption that s/r < 1). Since the cost of hitting a closed loop is O(r), this adds O(1) to the insertion complexity.
Now we look at what happens if we don't get a closed loop. It's a little messy to analyze the behavior of keys that appear more than once in the sequence, so the trick used in the paper is to observe that for any sequences of nestless keys x1...xp, there is a subsequence of size p/3 with no repetitions that starts with x1. Since there are only two subsequences that starts with x1 (we can't have the same key show up more than twice), this will either be x1...xj-1 or x1=xi+j-1...xp, and a case analysis shows that at least one of these will be big. We can then argue that the probability that we get a sequence of v distinct keys starting with x1 in either T1 or T2 is at most 2(s/r)v-1 (since we have to hit a nonempty spot, with probability ≤ s/r, at each step, but there are two possible starting locations), which gives an expected insertion time bounded by ∑ 3v (s/r)v-1 = O(1).
Using a slightly more detailed analysis (see the paper), it can be shown that the bound for non-constant ε is O(1+1/ε).
5. Bloom filters
See MitzenmacherUpfal §5.5.3 for basics and a principled analysis or Bloom filter for many variations and the collective wisdom of the unwashed masses.