KernighanPike Chapter 7 gives an excellent overview of performance tuning. This page will be limited to some Linux-specific details and an example.
1. Timing under Linux
Use time, e.g.
$ time wc /usr/share/dict/words 45378 45378 408865 /usr/share/dict/words real 0m0.010s user 0m0.006s sys 0m0.004s
This measures "real time" (what it sounds like), "user time" (the amount of time the program runs), and "system time" (the amount of time the operating system spends supporting your program, e.g. by loading it from disk and doing I/O). Real time need to be equal to the sum of user time and system time, since the operating system may be simultaneously running other programs.
Particularly for fast programs, times can vary from one execution to the next, e.g.
$ time wc /usr/share/dict/words 45378 45378 408865 /usr/share/dict/words real 0m0.009s user 0m0.008s sys 0m0.001s $ time wc /usr/share/dict/words 45378 45378 408865 /usr/share/dict/words real 0m0.009s user 0m0.007s sys 0m0.002s
This arises because of measurement errors and variation in how long different operations take. But usually the variation will not be much.
Note also that time is often a builtin operation of your shell, so the output format may vary depending on what shell you use.
2. Profiling with gprof
The problem with time is that it only tells you how much time your whole program took, but not where it spent its time. This is similar to looking at a program without a debugger: you can't see what's happening inside. If you want to see where your program is spending its time, you need to use a profiler.
For example, here's a short but slow program for calculating the number of primes less than some limit passed as argv[1]:
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 /* return 1 if n is prime, 0 otherwise */
5 int
6 isprime(int n)
7 {
8 int factor;
9
10 if(n < 2) return 0;
11 /* else */
12 for(factor = 2; factor < n; factor++) {
13 if(n % factor == 0) return 0;
14 }
15 /* else */
16 return 1;
17 }
18
19 /* return number of primes < n */
20 int
21 countprimes(int n)
22 {
23 int i;
24 int count;
25
26 count = 0;
27
28 for(i = 0; i < n; i++) {
29 if(isprime(i)) count++;
30 }
31
32 return count;
33 }
34
35 int
36 main(int argc, char **argv)
37 {
38 printf("%d\n", countprimes(atoi(argv[1])));
39
40 return 0;
41 }
And now we'll time countprimes 40000:
$ time ./countprimes 40000 4203 ./countprimes 40000 3.99s user 0.01s system 88% cpu 4.544 total
This is not too bad, but the reason I picked 40000 was that it didn't terminate fast enough for larger inputs. We'd like to see why it is taking so long, to have some idea what to try to speed up. So we'll compile it with the -pg option to gcc, which inserts profiling code that counts how many times each function is called and how long (on average) each call takes.
$ gcc -o countprimes -pg countprimes.c $ time ./countprimes 40000 4203 ./countprimes 40000 4.01s user 0.01s system 94% cpu 4.260 total
Hooray! We've made the program slightly slower (ignore the total time at the end---that's wall-clock time, and it only dropped because we got more of the CPU the second time we ran it). But we also just produced a file gmon.out that we can read with gprof:
$ gprof countprimes Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 99.95 40.18 40.18 40000 0.00 0.00 isprime 0.05 40.20 0.02 1 0.02 40.20 countprimes [much explanatory text deleted]
It looks like we are spending all of our time in isprime. Can we speed isprime up?
What isprime is doing is testing if a number n is prime by the most direct way possible: dividing by all numbers less than n until it finds a factor. That's a lot of divisions: if n is indeed prime, it's linear in n. Since division is a relatively expensive operation, the first thing to try is to get rid of some.
Here's a revised version of isprime:
The trick is to check first if n is divisible by 2, and only test odd potential factors thereafter. Let's see how well this works:
$ gcc -o countprimes -pg countprimes.c $ time ./countprimes 40000 4202 ./countprimes 40000 2.01s user 0.01s system 96% cpu 2.095 total $ gprof countprimes Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.00 20.09 20.09 40000 0.00 0.00 isprime 0.00 20.09 0.00 1 0.00 20.09 countprimes [...]
Twice as fast! And the answer is still the same, too---this is important.
Can we test even fewer factors? Suppose n has a non-trivial factor x. Then n equals x*y for some y which is also nontrivial. One of x or y will be no bigger than the square root of n. So perhaps we can stop when we reach the square root of n, which we can compute using the math library routine sqrt. Let's try it:
I added +1 to the return value of sqrt because the output of sqrt is not exact, and it would be embarrassing if I announced that 25 was prime because I stopped at 4.9999999997.
Using the math library not only requires including <math.h> but also requires compiling with the -lm flag after all .c or .o files, to link in the library routines:
$ gcc -o countprimes -pg countprimes.c -lm $ time ./countprimes 40000 4202 ./countprimes 40000 0.09s user 0.00s system 98% cpu 0.091 total $ gprof countprimes| head -20 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call nameG 96.88 0.31 0.31 40000 0.01 0.01 isprime 3.12 0.32 0.01 1 10.00 320.00 countprimes [...]
Whoosh.
Can we optimize further? Let's see what happens on a bigger input:
$ time ./countprimes 1000000 78497 ./countprimes 1000000 7.01s user 0.02s system 92% cpu 7.590 total $ gprof countprimes| head -8 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 99.32 24.96 24.96 1000000 0.00 0.00 isprime 0.68 25.13 0.17 1 0.17 25.13 countprimes
This is still very good. Can we do better?
Maybe moving the sqrt out of the loop in isprime will make a difference:
$ gcc -o countprimes -pg countprimes.c -lm $ time ./countprimes 1000000 78497 ./countprimes 1000000 7.01s user 0.01s system 94% cpu 7.419 total
Nope. It looks like the compiler was smart enough to do this already.
What if we get rid of the call to sqrt and test if factor * factor <= n instead? This way we could dump the math library:
$ gcc -o countprimes -pg countprimes.c $ time ./countprimes 1000000 78497 ./countprimes 1000000 1.85s user 0.01s system 98% cpu 1.897 total
Hm, looks like it's about 4 times faster this way. This is surprising, since we seem to be doing a lot of multiplications, but perhaps sqrt is slow enough that it's worth it.
At this point we could decide that countprimes is fast enough, or maybe we could look for further improvements (e.g. testing out many small primes at the beginning instead of just 2, calling isprime only on odd values of i, or reading a computational number theory textbook to find out how we ought to be doing this). A reasonable strategy for code for your own use is often to start running one version and make improvements on a separate copy while it's running. If the first version terminates before you are done writing new code, it's probably fast enough.