Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

KernighanPike Chapter 7 gives an excellent overview of performance tuning. This page will be limited to some Linux-specific details and an example.

1. Timing under Linux

Use time, e.g.

$ time wc /usr/share/dict/words
 45378  45378 408865 /usr/share/dict/words

real    0m0.010s
user    0m0.006s
sys     0m0.004s

This measures "real time" (what it sounds like), "user time" (the amount of time the program runs), and "system time" (the amount of time the operating system spends supporting your program, e.g. by loading it from disk and doing I/O). Real time need to be equal to the sum of user time and system time, since the operating system may be simultaneously running other programs.

Particularly for fast programs, times can vary from one execution to the next, e.g.

$ time wc /usr/share/dict/words
 45378  45378 408865 /usr/share/dict/words

real    0m0.009s
user    0m0.008s
sys     0m0.001s
$ time wc /usr/share/dict/words
 45378  45378 408865 /usr/share/dict/words

real    0m0.009s
user    0m0.007s
sys     0m0.002s

This arises because of measurement errors and variation in how long different operations take. But usually the variation will not be much.

Note also that time is often a builtin operation of your shell, so the output format may vary depending on what shell you use.

2. Profiling with gprof

The problem with time is that it only tells you how much time your whole program took, but not where it spent its time. This is similar to looking at a program without a debugger: you can't see what's happening inside. If you want to see where your program is spending its time, you need to use a profiler.

For example, here's a short but slow program for calculating the number of primes less than some limit passed as argv[1]:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 /* return 1 if n is prime, 0 otherwise */
   5 int
   6 isprime(int n)
   7 {
   8     int factor;
   9 
  10     if(n < 2) return 0;
  11     /* else */
  12     for(factor = 2; factor < n; factor++) {
  13         if(n % factor == 0) return 0;
  14     }
  15     /* else */
  16     return 1;
  17 }
  18 
  19 /* return number of primes < n */
  20 int
  21 countprimes(int n)
  22 {
  23     int i;
  24     int count;
  25 
  26     count = 0;
  27 
  28     for(i = 0; i < n; i++) {
  29         if(isprime(i)) count++;
  30     }
  31 
  32     return count;
  33 }
  34 
  35 int
  36 main(int argc, char **argv)
  37 {
  38     printf("%d\n", countprimes(atoi(argv[1])));
  39 
  40     return 0;
  41 }

countprimes.c

And now we'll time countprimes 40000:

$ time ./countprimes 40000
4203
./countprimes 40000  3.99s user 0.01s system 88% cpu 4.544 total

This is not too bad, but the reason I picked 40000 was that it didn't terminate fast enough for larger inputs. We'd like to see why it is taking so long, to have some idea what to try to speed up. So we'll compile it with the -pg option to gcc, which inserts profiling code that counts how many times each function is called and how long (on average) each call takes.

$ gcc -o countprimes -pg countprimes.c
$ time ./countprimes 40000
4203
./countprimes 40000  4.01s user 0.01s system 94% cpu 4.260 total

Hooray! We've made the program slightly slower (ignore the total time at the end---that's wall-clock time, and it only dropped because we got more of the CPU the second time we ran it). But we also just produced a file gmon.out that we can read with gprof:

$ gprof countprimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 99.95     40.18    40.18    40000     0.00     0.00  isprime
  0.05     40.20     0.02        1     0.02    40.20  countprimes

[much explanatory text deleted]

It looks like we are spending all of our time in isprime. Can we speed isprime up?

What isprime is doing is testing if a number n is prime by the most direct way possible: dividing by all numbers less than n until it finds a factor. That's a lot of divisions: if n is indeed prime, it's linear in n. Since division is a relatively expensive operation, the first thing to try is to get rid of some.

Here's a revised version of isprime:

   1 /* return 1 if n is prime, 0 otherwise */
   2 int
   3 isprime(int n)
   4 {
   5     int factor;
   6 
   7     if(n < 2) return 0;
   8     if(n % 2 == 0) return 0;
   9     /* else */
  10     for(factor = 3; factor < n; factor+=2) {
  11         if(n % factor == 0) return 0;
  12     }
  13     /* else */
  14     return 1;
  15 }

The trick is to check first if n is divisible by 2, and only test odd potential factors thereafter. Let's see how well this works:

$ gcc -o countprimes -pg countprimes.c
$ time ./countprimes 40000
4202
./countprimes 40000  2.01s user 0.01s system 96% cpu 2.095 total
$ gprof countprimes
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
100.00     20.09    20.09    40000     0.00     0.00  isprime
  0.00     20.09     0.00        1     0.00    20.09  countprimes

[...]

Twice as fast! And the answer is still the same, too---this is important.

Can we test even fewer factors? Suppose n has a non-trivial factor x. Then n equals x*y for some y which is also nontrivial. One of x or y will be no bigger than the square root of n. So perhaps we can stop when we reach the square root of n, which we can compute using the math library routine sqrt. Let's try it:

   1 #include <math.h>
   2 
   3 /* return 1 if n is prime, 0 otherwise */
   4 int
   5 isprime(int n)
   6 {
   7     int factor;
   8 
   9     if(n < 2) return 0;
  10     if(n % 2 == 0) return 0;
  11     /* else */
  12     for(factor = 3; factor < sqrt(n)+1; factor+=2) {
  13         if(n % factor == 0) return 0;
  14     }
  15     /* else */
  16     return 1;
  17 }

I added +1 to the return value of sqrt because the output of sqrt is not exact, and it would be embarrassing if I announced that 25 was prime because I stopped at 4.9999999997.

Using the math library not only requires including <math.h> but also requires compiling with the -lm flag after all .c or .o files, to link in the library routines:

$ gcc -o countprimes -pg countprimes.c -lm
$ time ./countprimes 40000
4202
./countprimes 40000  0.09s user 0.00s system 98% cpu 0.091 total
$ gprof countprimes| head -20
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  nameG
 96.88      0.31     0.31    40000     0.01     0.01  isprime
  3.12      0.32     0.01        1    10.00   320.00  countprimes

[...]

Whoosh.

Can we optimize further? Let's see what happens on a bigger input:

$ time ./countprimes 1000000
78497
./countprimes 1000000  7.01s user 0.02s system 92% cpu 7.590 total
$ gprof countprimes| head -8
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 99.32     24.96    24.96  1000000     0.00     0.00  isprime
  0.68     25.13     0.17        1     0.17    25.13  countprimes

This is still very good. Can we do better?

Maybe moving the sqrt out of the loop in isprime will make a difference:

   1 /* return 1 if n is prime, 0 otherwise */
   2 int
   3 isprime(int n)
   4 {
   5     int factor;
   6     int root;
   7 
   8     if(n < 2) return 0;
   9     if(n % 2 == 0) return 0;
  10     /* else */
  11     root = sqrt(n)+1;
  12     for(factor = 3; factor < root; factor+=2) {
  13         if(n % factor == 0) return 0;
  14     }
  15     /* else */
  16     return 1;
  17 }

$ gcc -o countprimes -pg countprimes.c -lm
$ time ./countprimes 1000000
78497
./countprimes 1000000  7.01s user 0.01s system 94% cpu 7.419 total

Nope. It looks like the compiler was smart enough to do this already.

What if we get rid of the call to sqrt and test if factor * factor <= n instead? This way we could dump the math library:

   1 /* return 1 if n is prime, 0 otherwise */
   2 int
   3 isprime(int n)
   4 {
   5     int factor;
   6 
   7     if(n < 2) return 0;
   8     if(n % 2 == 0) return 0;
   9     /* else */
  10     for(factor = 3; factor * factor <= n; factor+=2) {
  11         if(n % factor == 0) return 0;
  12     }
  13     /* else */
  14     return 1;
  15 }

$ gcc -o countprimes -pg countprimes.c
$ time ./countprimes 1000000
78497
./countprimes 1000000  1.85s user 0.01s system 98% cpu 1.897 total

Hm, looks like it's about 4 times faster this way. This is surprising, since we seem to be doing a lot of multiplications, but perhaps sqrt is slow enough that it's worth it.

At this point we could decide that countprimes is fast enough, or maybe we could look for further improvements (e.g. testing out many small primes at the beginning instead of just 2, calling isprime only on odd values of i, or reading a computational number theory textbook to find out how we ought to be doing this). A reasonable strategy for code for your own use is often to start running one version and make improvements on a separate copy while it's running. If the first version terminates before you are done writing new code, it's probably fast enough.

CategoryAlgorithmNotes