[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

For this assignment, you are to write a procedure findCollision that searches for collisions in a hash function: distinct values that produce the same result. The declaration of findCollision is given in the file findCollision.h:

   1 /* finds values x1 != x2 such that f(x1) & 0x7fffffff == f(x2) & 0x7fffffff */
   2 void findCollision(unsigned int (*f)(unsigned int), unsigned int *x1, unsigned int *x2);
findCollision.h

The first argument f gives the function to test. The remaining arguments x1 and x2 are for returning the collision values. To enforce that collisions exist, only the bottom 31 bits of the output of f are used. A collision is thus a pair of distinct values x1 and x2 such that f(x1) & 0x7fffffff and f(x2) & 0x7fffffff are equal.

The fact that f takes 32-bit inputs and we only care about 31 bits of the output mean that there are going to be a lot of collisions: on average, two input values map to each output value. But finding them efficiently may be tricky. A direct approach of feeding f all 232 possible inputs may go through 231+1 of them before finding the first duplicate output. This is not only going to require a lot of memory to store the past values, it is also going to take a lot of time if f is slow to compute. Instead, you may find it helpful to use a randomized birthday attack.

1. Exploiting the Birthday Paradox

The Birthday Paradox is the surprising fact that, even though there are 366 possible birthdays, a group of 23 people has a 50% chance of having two people with the same birthday, and the probability of two people with the same birthday rises rapidly as the group gets larger. The reason for this is that even though each pair of persons has a probability of only 1/366 or thereabouts1 of sharing a birthday, the number of pairs of people in a group of n people grows as O(n2). So the number of people needed to get a least one pair with the same birthday on average is roughly the square root of the number of possible birthdays.

For 231 possible function values, if we generate inputs at random, we expect to find our first collision after something in the neighborhood of 216 inputs. This is a small enough number that it is feasible both to make 216 calls to f, even if f is moderately expensive, and to store the results in a data structure that lets us detect collisions easily.

2. What you need to write

Your job is to supply the findCollision procedure in a file findCollision.c. The declaration for findCollision is given in findCollision.h above; you should include this file in your findCollision.c file to ensure consistency. We have also provided a simple test program testCollision.c and a Makefile. These can be found in /c/cs223/Hwk8/template on the Zoo if you don't want to download them from here.

Your findCollision.c file should not export any global variables except findCollision. If you need to write any other functions, make them static.

You should not assume that f returns a value in the range 0..0x7fffffff; you may need to mask off the top bit of f's value yourself.

It may be that f is very expensive. You should try to avoid calling it more than you have to.

3. Submitting your assignment

Submit your file findCollision.c using /c/cs223/bin/submit 8 findCollision.c as usual. You can run /c/cs223/bin/testit 8 public to check your submitted file using the public test script in /c/cs223/Hwk8/test.public, which may or may not resemble the final test script.

4. Sample solution

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 
   4 #include "findCollision.h"
   5 
   6 /* a convenient prime */
   7 #define HASH_SIZE ((1<<17)+29)
   8 
   9 /* return a rand 32-bit int */
  10 /* calls rand() twice to get around RAND_MAX limit */
  11 static unsigned int
  12 urand(void)
  13 {
  14     return (rand() << 16) ^ rand();
  15 }
  16 
  17 /* output of f is masked by this */
  18 #define MASK (0x7fffffff)
  19 
  20 /* finds values x1 != x2 such that f(x1) & MASK == f(x2) & MASK */
  21 void
  22 findCollision(unsigned int (*f)(unsigned int), unsigned int *x1, unsigned int *x2)
  23 {
  24     unsigned int input;
  25     unsigned int output;
  26     unsigned int hash;
  27     unsigned int multiplier;  /* for hash function */
  28 
  29     /* make this static so we don't blow out the stack */
  30     static struct {
  31         int full;             /* this slot is used */
  32         unsigned int input;
  33         unsigned int output;
  34     } table[HASH_SIZE];
  35 
  36     int i;
  37 
  38     multiplier = 2*rand()+1;
  39 
  40     /* clear the table */
  41     for(i = 0; i < HASH_SIZE; i++) {
  42         table[i].full = 0;
  43     }
  44 
  45     /* keep going until we win */
  46     for(;;) {
  47         input = urand();
  48         output = f(input) & MASK;
  49 
  50         hash = (output * multiplier) % HASH_SIZE;
  51 
  52         if(table[hash].full && table[hash].input != input && table[hash].output == output) {
  53             *x1 = input;
  54             *x2 = table[hash].input;
  55 
  56             /* let's be safe about what we are claiming */
  57             assert(*x1 != *x2);
  58             assert((f(*x1) & MASK) == (f(*x2) & MASK));
  59 
  60             return;
  61         }
  62 
  63         /* else */
  64         table[hash].full = 1;
  65         table[hash].input = input;
  66         table[hash].output = output;
  67     }
  68 }
findCollision.c
  1. The probability is higher than this, both because most years have only 365 days, and because some birthdays are more likely than others. (1)


2014-06-17 11:58