Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Consolidated lecture notes for CS223 as taught in the Spring 2011 semester at Yale by Jim Aspnes.

Contents

WhyYouShouldLearnC
HowToCompileAndRunPrograms
Creating the program
Compiling and running a program
Some notes on what the program does
HowToUseTheComputingFacilities
Using the Zoo
1. Getting an account
2. Getting into the room
3. Remote use
Using Unix
1. Getting a shell prompt in the Zoo
2. The Unix filesystem
3. Unix command-line programs
4. Stopping and interrupting programs
5. Running your own programs
6. Input and output
Editing C programs
1. Writing C programs with Emacs
  1. My favorite Emacs commands
2. Using Vi instead of Emacs
Compiling programs
1. Using gcc
2. Using make
  1. Make gotchas
Debugging
Version control
Submitting assignments
C/Variables
Machine memory
Variables
1. Variable declarations
2. Variable names
Using variables
C/IntegerTypes
Integer types
1. C99 fixed-width types
Integer constants
Integer operators
1. Arithmetic operators
2. Bitwise operators
3. Logical operators
4. Relational operators
Input and output
Alignment
C/InputOutput
Character streams
Reading and writing single characters
Formatted I/O
Rolling your own I/O routines
File I/O
C/Statements
Simple statements
Compound statements
1. Conditionals
2. Loops
3. Choosing where to put a loop exit
C/FloatingPoint
Floating point basics
Floating-point constants
Operators
Conversion to and from integer types
The IEEE-754 floating-point standard
Error
Reading and writing floating-point numbers
Non-finite numbers in C
The math library
C/Functions
Function definitions
Calling a function
The return statement
Function declarations and modules
Static functions
Local variables
Mechanics of function calls
C/Pointers
Memory and addresses
Pointer variables
1. Declaring a pointer variable
2. Assigning to pointer variables
3. Using a pointer
4. Printing pointers
The null pointer
Pointers and functions
Pointer arithmetic and arrays
1. Arrays and functions
2. Multidimensional arrays
3. Variable-length arrays
Void pointers
Run-time storage allocation
The restrict keyword
C/Strings
String processing in general
C strings
String constants
String buffers
Operations on strings
Finding the length of a string
1. The strlen tarpit
Comparing strings
Formatted output to strings
Dynamic allocation of strings
argc and argv
C/Structs
Structs
Unions
Bit fields
AbstractDataTypes
Abstraction
Example of an abstract data type
1. Interface
  1. sequence.h
2. Implementation
  1. sequence.c
3. Compiling and linking
  1. main.c
  2. Makefile
Designing abstract data types
1. Parnas's Principle
2. When to build an abstract data type
C/Definitions
Naming types
Naming constants
Naming values in sequences
Other uses of #define
C/Debugging
Debugging in general
Assertions
gdb
1. My favorite gdb commands
2. Debugging strategies
Valgrind
1. Compilation flags
2. Automated testing
3. Examples of some common valgrind errors
Not recommended: debugging output
AsymptoticNotation
Definitions
Motivating the definitions
Proving asymptotic bounds
Asymptotic notation hints
1. Remember the difference between big-O, big-Ω, and big-Θ
2. Simplify your asymptotic terms as much as possible
3. Remember the limit trick
Variations in notation
1. Absolute values
2. Abusing the equals sign
More information
LinkedLists
Stacks and linked lists
1. Implementation
2. A more complicated implementation
3. Building a stack out of an array
Looping over a linked list
Looping over a linked list backwards
Queues
Deques and doubly-linked lists
Circular linked lists
What linked lists are and are not good for
Further reading
C/Recursion
Example of recursion in C
Common problems with recursion
1. Omitting the base case
2. Blowing out the stack
3. Failure to make progress
Tail-recursion versus iteration
An example of useful recursion
C/HashTables
Dictionary data types
Basics of hashing
Resolving collisions
1. Chaining
2. Open addressing
Choosing a hash function
1. Division method
2. Multiplication method
3. Universal hashing
Maintaining a constant load factor
Examples
1. A low-overhead hash table using open addressing
2. A string to string dictionary using chaining
BinaryTrees
Tree basics
Binary tree implementations
The canonical binary tree algorithm
Nodes vs leaves
Special classes of binary trees
BinarySearchTrees
Searching for a node
Inserting a new node
Costs
BalancedTrees
The basics: tree rotations
AVL trees
2–3 trees
Red-black trees
B-trees
Splay trees
Skip lists
Implementations
C/AvlTree
Header file
Implementation
Test code and Makefile
Heaps
Priority queues
Expensive implementations of priority queues
Heaps
Packed heaps
Bottom-up heapification
Heapsort
More information
C/FunctionPointers
Basics
Function pointer declarations
Applications
1. Callbacks
2. Dispatch tables
3. Iterators
Closures
Objects
C/Iterators
The problem
1. nums.h
2. nums.c
Option 1: Function that returns a sequence
Option 2: Iterator with first/done/next operations
Option 3: Iterator with function argument
Appendix: Complete code for Nums
1. nums.h
2. nums.c
3. test-nums.c
C/Randomization
Generating random values in C
1. The rand function from the standard library
2. Better pseudorandom number generators
3. Random numbers without the pseudo
4. Issues with RAND_MAX
Randomized algorithms
1. Randomized search
2. Quickselect and quicksort
Randomized data structures
1. Randomized tree balancing
2. Universal hash families
RadixSort
What's wrong with comparison-based sorting
Bucket sort
Classic LSB radix sort
MSB radix sort
1. Issues with recursion depth
2. Implementing the buckets
3. Further optimization
4. Sample implementation
RadixSearch
Tries
1. Searching a trie
2. Inserting a new element into a trie
3. Implementation
Patricia trees
Ternary search trees
More information
DynamicProgramming
Memoization
Dynamic programming
1. More examples
Dynamic programming: algorithmic perspective
1. Preserving alternatives
2. Knapsack
3. Non-linear structures
  1. Trees
  2. Treewidth
    1. Examples
    2. Treewidth and dynamic programming
C/Graphs
Graphs
Why graphs are useful
Operations on graphs
Representations of graphs
1. Adjacency matrices
2. Adjacency lists
  1. An implementation
3. Implicit representations
Searching for paths in a graph
1. Depth-first and breadth-first search
2. Other variations on the basic algorithm
ShortestPath
Single-source shortest paths
1. Relaxation
2. Dijkstra's algorithm
3. Bellman-Ford
All-pairs shortest paths
Implementations
SuffixArrays
Why do we want to do this?
String search algorithms
Suffix trees and suffix arrays
1. Building a suffix array
2. Searching a suffix array
Burrows-Wheeler transform
1. Suffix arrays and the Burrows-Wheeler transform
Sample implementation
C++
Hello world
References
Function overloading
Classes
Operator overloading
Templates
Exceptions
Storage allocation
1. Storage allocation inside objects
Standard library
Things we haven't talked about

1. WhyYouShouldLearnC

Why should you learn to program in C?

It is the de facto substandard of programming languages.
- C runs on everything.
- C lets you write programs that use very few resources.
- C gives you near-total control over the system, down to the level of pushing around individual bits with your bare hands.
- C imposes very few constraints on programming style: unlike higher-level languages, C doesn't have much of an ideology. There are very few programs you can't write in C.
- Most of the programming languages people actually use (Visual Basic, perl, PHP, etc.) are executed by interpreters written in C (or C++, an extension to C).
You will learn discipline.
- C makes it easy to shoot yourself in the foot.
- You can learn to avoid this by being careful about where you point it.
- Pain is a powerful teacher of caution.
You will fail CS323 if you don't learn C really well in CS223 (CS majors only).

On the other hand, there are many reasons why you might not want to use C later in life. It's missing a lot of features of modern program languages, including:

A garbage collector.
Minimal programmer-protection features like array bounds-checking.
Non-trivial built-in data structures.
Language support for exceptions, namespaces, object-oriented programming, etc.

For most problems where minimizing programmer time and maximizing robustness are more important than minimizing runtime, other languages are a better choice. But for CS223 we'll be using C.

If you want to read a lot of flaming about what C is or is not good for, see http://c2.com/cgi/wiki?CeeLanguage.

CategoryProgrammingNotes

2. HowToCompileAndRunPrograms

See HowToUseTheComputingFacilities for details of particular commands. The basic steps are

Creating the program with a text editor of your choosing. (I like vim for long programs and cat for very short ones.)
Compiling it with gcc.
Running it.

If any of these steps fail, the next step is debugging. We'll talk about debugging elsewhere.

3. Creating the program

Use your favorite text editor. The program file should have a name of the form foo.c; the .c at the end tells the C compiler the contents are C source code. Here is a typical C program:

   1 #include <stdio.h>
   2 
   3 /* print the numbers from 1 to 10 */
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     int i;
   9 
  10     puts("Now I will count from 1 to 10");
  11     for(i = 1; i <= 10; i++) {
  12         printf("%d\n", i);
  13     }
  14 
  15     return 0;
  16 }

count.c

4. Compiling and running a program

Here's what happens when I compile and run it on the Zoo:

$ gcc -o count count.c
$ ./count
Now I will count from 1 to 10
1
2
3
4
5
6
7
8
9
10
$

The first line is the command to compile the program. The second line runs the output file count. Calling it ./count is necessary because by default the shell (the program that interprets what you type) only looks for programs in certain standard system directories. To make it run a program in the current directory, we have to include the directory name.

5. Some notes on what the program does

Noteworthy features of this program include:

The #include <stdio.h> in line 1. This is standard C boilerplate, and will appear in any program you see that does input or output. The meaning is to tell the compiler to include the text of the file /usr/include/stdio.h in your program as if you had typed it there yourself. This particular file contains declarations for the standard I/O library functions like puts (put string) and printf (print formatted), as used in the program. If you don't put it in, your program may or may not still compile. Do it anyway.
Line 3 is a comment; its beginning and end is marked by the /* and */ characters. Comments are ignored by the compiler but can be helpful for other programmers looking at your code (including yourself, after you've forgotten why you wrote something).
Lines 5 and 6 declare the main function. Every C program has to have a main function declared in exactly this way---it's what the operating system calls when you execute the program. The int on Line 3 says that main returns a value of type int (we'll describe this in more detail later in C/Functions), and that it takes two arguments: argc of type int, the number of arguments passed to the program from the command line, and argv, of a pointer type that we will get to eventually (C/Pointers), which is an array of the arguments (essentially all the words on the command line, including the program name). Note that it would also work to do this as one line (as KerniganRitchie typically does); the C compiler doesn't care about whitespace, so you can format things however you like, subject to the constraint that consistency will make it easier for people to read your code.
Everything inside the curly braces is the body of the main function. This includes
- The declaration int i;, which says that i will be a variable that holds an int (C/IntegerTypes).
- Line 10, which prints an informative message using puts (C/InputOutput).
- The for loop on Lines 11–13, which executes its body for each value of i from 1 to 10. We'll explain how for loops work later (C/Statements). Note that the body of the loop is enclosed in curly braces just like the body of the main function. The only statement in the body is the call to printf on Line 12; this includes a format string that specifies that we want a decimal-formatted integer followed by a newline (the \n).
- The return 0; on Line 15 tells the operating system that the program worked (the convention in Unix is that 0 means success). If the program didn't work for some reason, we could have returned something else to signal an error.

CategoryProgrammingNotes

6. HowToUseTheComputingFacilities

Contents

Using the Zoo
Using Unix
Editing C programs
1. Writing C programs with Emacs
  1. My favorite Emacs commands
2. Using Vi instead of Emacs
Compiling programs
1. Using gcc
2. Using make
  1. Make gotchas
Debugging
Version control
Submitting assignments

7. Using the Zoo

The best place for information about the Zoo is at http://zoo.cs.yale.edu/. Below are some points that are of particular relevance for CS223 students.

7.1. Getting an account

To get an account in the Zoo, follow the instructions at http://zoo.cs.yale.edu/cgi-bin/accounts.pl. You will need your NetID and password to sign up for an account.

Even if you already have an account, you still need to use this form to register as a CS 223 student, or you will not be able to submit assignments.

7.2. Getting into the room

The Zoo is located on the third floor of Arthur K Watson Hall, toward the front of the building. If you are a Yale student, your ID should get you into the building and the room. If you are not a student, you will need to get your ID validated in AKW 008a to get in after hours.

7.3. Remote use

See HowToUseTheZooRemotely.

8. Using Unix

The Zoo runs a Unix-like operating system called Linux. Most people run Unix with a command-line interface provided by a shell. Each line typed to the shell tells it what program to run (the first word in the line) and what arguments to give it (remaining words). The interpretation of the arguments is up to the program.

8.1. Getting a shell prompt in the Zoo

When you log in to a Zoo node directly, you may not automatically get a shell window. If you use the default login environment (which puts you into the KDE window manager), you need to click on the picture of the display with a shell in from of it in the toolbar at the bottom of the screen. If you run Gnome instead (you can change your startup environment using the popup menu in the login box), you can click on the foot in the middle of the toolbar. Either approach will pop up a terminal emulator from which you can run emacs, gcc, and so forth.

The default login shell in the Zoo is bash, and all examples of shell command lines given in these notes will assume bash. You can choose a different login shell on the account sign-up page if you want to, but you are probably best off just learning to like bash.

8.2. The Unix filesystem

Most of what one does with Unix programs is manipulate the filesystem. Unix files are unstructured blobs of data whose names are given by paths consisting of a sequence of directory names separated by slashes: for example /home/accts/some-user/cs223/hw1.c. At any time you are in a current working directory (type pwd to find out what it is and cd new-directory to change it). You can specify a file below the current working directory by giving just the last part of the pathname. The special directory names . and .. can also be used to refer to the current directory and its parent. So /home/accts/some-user/cs223/hw1.c is just hw1.c or ./hw1.c if your current working directory is /home/accts/some-user/cs223, cs223/hw1.c if your current working directory is /home/accts/some-user, and ../cs223/hw1.c if your current working directory is /home/accts/some-user/illegal-downloads.

All Zoo machines share a common filesystem, so any files you create or change on one Zoo machine will show up in the same place on all the others.

8.3. Unix command-line programs

Here are some handy Unix commands:

man

man program will show you the on-line documentation for a program (e.g., try man man or man ls). Handy if you want to know what a program does. On Linux machines like the ones in the Zoo you can also get information using info program, which has an Emacs-like interface.

ls

ls lists all the files in the current directory. Some useful variants:

ls /some/other/dir; list files in that directory instead.
ls -l; long output format showing modification dates and owners.

mkdir

mkdir dir will create a new directory in the current directory named dir.

rmdir

rmdir dir deletes a directory. It only works on directories that contain no files.

cd

cd dir changes the current working directory. With no arguments, cd changes back to your home directory.

pwd

pwd ("print working directory") shows what your current directory is.

mv

mv old-name new-name changes the name of a file. You can also use this to move files between directories.

cp

cp old-name new-name makes a copy of a file.

rm

rm file deletes a file. Deleted files cannot be recovered. Use this command carefully.

chmod

chmod changes the permissions on a file or directory. See the man page for the full details of how this works. Here are some common chmod's:

chmod 644 file; owner can read or write the file, others can only read it.
chmod 600 file; owner can read or write the file, others can't do anything with it.
chmod 755 file; owner can read, write, or execute the file, others can read or execute it. This is typically used for programs or for directories (where the execute bit has the special meaning of letting somebody find files in the directory).
chmod 700 file; owner can read, write, or execute the file, others can't do anything with it.

emacs, gcc, make, gdb, git

See corresponding sections.

8.4. Stopping and interrupting programs

Sometimes you may have a running program that won't die. Aside from costing you the use of your terminal window, this may be annoying to other Zoo users, especially if the process won't die even if you close the terminal window or log out.

There are various control-key combinations you can type at a terminal window to interrupt or stop a running program.

ctrl-C: Interrupt the process. Many processes (including any program you write unless you trap SIGINT using the sigaction system call) will die instantly when you do this. Some won't.
ctrl-Z: Suspend the process. This will leave a stopped process lying around. Type jobs to list all your stopped processes, fg to restart the last process (or fg %1 to start process %1 etc.), bg to keep running the stopped process in the background, kill %1 to kill process %1 politely, kill -KILL %1 to kill process %1 whether it wants to die or not.
ctrl-D: Send end-of-file to the process. Useful if you are typing test input to a process that expects to get EOF eventually or writing programs using cat > program.c (not really recommmended). For test input, you are often better putting it into a file and using input redirection (./program < test-input-file); this way you can redo the test after you fix the bugs it reveals.
ctrl-\: Quit the process. Sends a SIGQUIT, which asks a process to quit and dump core. Mostly useful if ctrl-C and ctrl-Z don't work.

If you have a runaway process that you can't get rid of otherwise, you can use ps g to get a list of all your processes and their process ids. The kill command can then be used on the offending process, e.g. kill -KILL 6666 if your evil process has process id 6666. Sometimes the killall command can simplify this procedure, e.g. killall -KILL evil kills all process with command name evil.

8.5. Running your own programs

If you compile your own program, you will need to prefix it with ./ on the command line to tell the shell that you want to run a program in the current directory (called '.') instead of one of the standard system directories. So for example, if I've just built a program called count, I can run it by typing

$ ./count

Here the "$ " is standing in for whatever your prompt looks like; you should not type it.

Any words after the program name (separated by whitespace---spaces and/or tabs) are passed in as arguments to the program. Sometimes you may wish to pass more than one word as a single argument. You can do so by wrapping the argument in single quotes, as in

$ ./count 'this is the first argument' 'this is the second argument'

8.6. Input and output

Some programs take input from standard input (typically the terminal). If you are doing a lot of testing, you will quickly become tired of typing test input at your program. You can tell the shell to redirect standard input from a file by putting the file name after a < symbol, like this:

$ ./count < huge-input-file

A '>' symbol is used to redirect standard output, in case you don't want to read it as it flies by on your screen:

$ ./count < huge-input-file > huger-output-file

A useful file for both input and output is the special file /dev/null. As input, it looks like an empty file. As output, it eats any characters sent to it:

$ ./sensory-deprivation-experiment < /dev/null > /dev/null

You can also pipe programs together, connecting the output of one to the input of the next. Good programs to put at the end of a pipe are head (eats all but the first ten lines), tail (eats all but the last ten lines), more (lets you page through the output by hitting the space bar, and tee (shows you the output but also saves a copy to a file). A typical command might be something like ./spew | more or ./slow-but-boring | tee boring-output. Pipes can consist of a long train of programs, each of which processes the output of the previous one and supplies the input to the next. A typical case might be:

$ ./do-many-experiments | sort | uniq -c | sort -nr

which, if ./do-many-experiments gives the output of one experiment on each line, produces a list of distinct experimental outputs sorted by decreasing frequency. Pipes like this can often substitute for hours of real programming.

9. Editing C programs

To write your programs, you will need to use a text editor, preferably one that knows enough about C to provide tools like automatic indentation and syntax highlighting. There are three reasonable choices for this in the Zoo: kate, emacs, and vim (which can also be run as vi). Kate is a GUI-style editor that comes with the KDE window system; it plays nicely with the mouse, but Kate skills will not translate well into other environements. Emacs and Vi have been the two contenders for the One True Editor since the 1970s—if you learn one (or both) you will be able to use the resulting skills everywhere. My personal preference is to use Vi, but Emacs has the advantage of using the same editing commands as the shell and gdb command-line interfaces.

9.1. Writing C programs with Emacs

To start Emacs, type emacs at the command line. If you are actually sitting at a Zoo node it should put up a new window. If not, Emacs will take over the current window. If you have never used Emacs before, you should immediately type C-h t (this means hold down the Control key, type h, then type t without holding down the Control key). This will pop you into the Emacs built-in tutorial.

9.1.1. My favorite Emacs commands

General note: C-x means hold down Control and press x; M-x means hold down Alt (Emacs calls it "Meta") and press x. For M-x you can also hit Esc and then x.

C-h: Get help. Everything you could possibly want to know about Emacs is available through this command. Some common versions: C-h t puts up the tutorial, C-h b lists every command available in the current mode, C-h k tells you what a particular sequence of keystrokes does, and C-h l tells you what the last 50 or so characters you typed were (handy if Emacs just garbled your file and you want to know what command to avoid in the future).
C-x u: Undo. Undoes the last change you made to the current buffer. Type it again to undo more things. A lifesaver. Note that it can only undo back to the time you first loaded the file into Emacs--- if you want to be able to back out of bigger changes, use git (described below).
C-x C-s: Save. Saves changes to the current buffer out to its file on disk.
C-x C-f: Edit a different file.
C-x C-c: Quit out of Emacs. This will ask you if you want to save any buffers that have been modified. You probably want to answer yes (y) for each one, but you can answer no (n) if you changed some file inside Emacs but want to throw the changes away.
C-f: Go forward one character.
C-b: Go back one character.
C-n: Go to the next line.
C-p: Go to the previous line.
C-a: Go to the beginning of the line.
C-k: Kill the rest of the line starting with the current position. Useful Emacs idiom: C-a C-k.
C-y: "Yank." Get back what you just killed.
TAB: Re-indent the current line. In C mode this will indent the line according to Emacs's notion of how C should be indented.
M-x compile: Compile a program. This will ask you if you want to save out any unsaved buffers and then run a compile command of your choice (see the section on compiling programs below). The exciting thing about M-x compile is that if your program has errors in it, you can type C-x ` to jump to the next error, or at least where gcc thinks the next error is.

9.2. Using Vi instead of Emacs

If you don't find yourself liking Emacs very much, you might want to try Vim instead. Vim is a vastly enhanced reimplementation of the classic vi editor, which I personally find easier to use than Emacs. Type vimtutor to run the tutorial. You can always get out by hitting the Escape key a few times and then typing :qa! .

For more details, see UsingVim.

10. Compiling programs

10.1. Using gcc

A C program will typically consist of one or more files whose names end with .c. To compile foo.c, you can type gcc foo.c. Assuming foo.c contains no errors egregious enough to be detected by the extremely forgiving C compiler, this will produce a file named a.out that you can then execute by typing ./a.out.

If you want to debug your program using gdb or give it a different name, you will need to use a longer command line. Here's one that compiles foo.c to foo (run it using ./foo) and includes the information that gdb needs: gcc -g3 -o foo foo.c

By default, gcc doesn't check everything that might be wrong with your program. But if you give it a few extra arguments, it will warn you about many (but not all) potential problems: gcc -g3 -Wall -std=c99 -pedantic -o foo foo.c

10.2. Using make

For complicated programs involving multiple source files, you are probably better off using make than calling gcc directly. Make is a "rule-based expert system" that figures out how to compile programs given a little bit of information about their components.

For example, if you have a file called foo.c, try typing make foo and see what happens.

In general you will probably want to write a Makefile, which is named Makefile or makefile and tells make how to compile programs in the same directory. Here's a typical Makefile:

# Any line that starts with a sharp is a comment and is ignored
# by Make.

# These lines set variables that control make's default rules.
# We STRONGLY recommend putting "-Wall -std=c99 -pedantic" in your CFLAGS.
CC=gcc
CFLAGS=-g3 -Wall -std=c99 -pedantic

# The next line is a dependency line.
# It says that if somebody types "make all"
# make must first make "hello-world".
# By default the left-hand-side of the first dependency is what you
# get if you just type "make" with no arguments.
all: hello-world

# How do we make hello-world?
# The dependency line says you need to first make hello-world.o
# and hello-library.o
hello-world: hello-world.o hello-library.o
        # Subsequent lines starting with a TAB character give
        # commands to execute.  Note the use of the CC and CFLAGS
        # variables.
        $(CC) $(CFLAGS) -o hello-world hello-world.o hello-library.o
        echo "I just built hello-world!  Hooray!"

# We can also declare that several things depend on one thing.
# Here we are saying that hello-world.o and hello-library.o
#  should be rebuilt whenever hello-library.h changes.
# There are no commands attached to this dependency line, so
#  make will have to figure out how to do that somewhere else
#  (probably from the builtin .c -> .o rule).
hello-world.o hello-library.o: hello-library.h

# Command lines can do more than just build things.  For example,
# "make test" will rebuild hello-world (if necessary) and then run it.
test: hello-world
        ./hello-world

# This lets you type "make clean" and get rid of anything you can
# rebuild.  The -f tells rm not to complain about files that aren't
# there.
clean:
        rm -f hello-world *.o

Given a Makefile, make looks at each dependency line and asks: (a) does the target on the left hand side exist, and (b) is it older than the files it depends on. If so, it looks for a set of commands for rebuilding the target, after first rebuilding any of the files it depends on; the commands it runs will be underneath some dependency line where the target appears on the left-hand side. It has built-in rules for doing common tasks like building .o files (which contain machine code) from .c files (which contain C source code). If you have a fake target like all above, it will try to rebuild everything all depends on because there is no file named all (one hopes).

10.2.1. Make gotchas

Make really really cares that the command lines start with a TAB character. TAB looks like eight spaces in Emacs and other editors, but it isn't the same thing. If you put eight spaces in (or a space and a TAB), Make will get horribly confused and give you an incomprehensible error message about a "missing separator". This misfeature is so scary that I avoided using make for years because I didn't understand what was going on. Don't fall into that trap--- make really is good for you, especially if you ever need to recompile a huge program when only a few source files have changed.

If you use GNU Make (on a zoo node), note that beginning with version 3.78, GNU Make prints a message that hints at a possible SPACEs-vs-TAB problem, like this:

$ make
Makefile:23:*** missing separator (did you mean TAB instead of 8 spaces?).  Stop.

If you need to repair a Makefile that uses spaces, one way of converting leading spaces into TABs is to use the unexpand program:

$ mv Makefile Makefile.old
$ unexpand Makefile.old > Makefile

11. Debugging

The standard debugger on the Zoo is gdb. See C/Debugging.

12. Version control

When you are programming, you will make mistakes. If you program long enough, these will eventually include true acts of boneheadedness like accidentally deleting all of your source files. You are also likely to spend some of your time trying out things that don't work, at the end of which you'd like to go back to the last version of your program that did work. All these problems can be solved by using a version control system.

There are six respectable version control systems installed on the Zoo: rcs, cvs, svn, bzr, hg, and git. If you are familiar with any of them, you should use that. If you have to pick one from scratch, I recommend using git. For details, see UsingGit, or look at the tutorials available at http://git-scm.org.

13. Submitting assignments

The submit command is is found in /c/cs223/bin on the Zoo. Here is the documentation (adapted from comments in the script):

submit    assignment-number file(s)
unsubmit  assignment-number file(s)
check     assignment-number
makeit    assignment-number [file]
protect   assignment-number file(s)
unprotect assignment-number file(s)
retrieve  assignment-number file[s]
testit    assignment-number test

The submit program can be invoked in eight different ways:

    /c/cs223/bin/submit  1  Makefile tokenize.c unique.c time.log

submits the named source files as your solution to Homework #1;

    /c/cs223/bin/check  2

lists the files that you have submitted for Homework #2;

    /c/cs223/bin/unsubmit  3  error.submit bogus.solution

deletes the named files that you had submitted previously for Homework #3
(i.e., withdraws them from submission, which is useful if you accidentally
submit the wrong file);

    /c/cs223/bin/makeit  4  tokenize unique

runs "make" on the files that you submitted previously for Homework #4;

    /c/cs223/bin/protect  5  tokenize.c time.log

protects the named files that you submitted previously for Homework #5 (so
they cannot be deleted accidentally); and

    /c/cs223/bin/unprotect  6  unique.c time.log

unprotects the named files that you submitted previously for Homework #6
(so they can be deleted); and

     /c/cs223/bin/retrieve  7  Csquash.c

retrieves copies of the named files that you submitted previously for Homework #7

     /c/cs223/bin/testit    8  BigTest

runs the test script /c/cs223/Hwk8/test.BigTest.

The submit program will only work if there is a directory with your name and login on it under /c/cs223/class. If there is no such directory, you need to make sure that you have correctly signed up for CS223 using the web form. Note that it may take up to an hour for this directory to appear after you sign up.

CategoryProgrammingNotes

14. C/Variables

15. Machine memory

Basic model: machine memory consists of many bytes of storage, each of which has an address which is itself a sequence of bits. Though the actual memory architecture of a modern computer is complex, from the point of view of a C program we can think of as simply a large address space that the CPU can store things in (and load things from), provided it can supply an address to the memory. Because we don't want to have to type long strings of bits all the time, the C compiler lets us give names to particular regions of the address space, and will even find free space for us to use.

16. Variables

A variable is a name given in a program for some region of memory. Each variable has a type, which tells the compiler how big the region of memory corresponding to it is and how to treat the bits stored in that region when performing various kinds of operations (e.g. integer variables are added together by very different circuitry than floating-point variables, even though both represent numbers as bits). In modern programming languages, a variable also has a scope (a limit on where the name is meaningful, which allows the same name to be used for different variables in different parts of the program) and an extent (the duration of the variable's existence, controlling when the program allocates and deallocates space for it).

16.1. Variable declarations

Before you can use a variable in C, you must declare it. Variable declarations show up in three places:

Outside a function. These declarations declare global variables that are visible throughout the program (i.e. they have global scope). Use of global variables is almost always a mistake.
In the argument list in the header of a function. These variables are parameters to the function. They are only visible inside the function body (local scope), exist only from when the function is called to when the function returns (bounded extent—note that this is different from what happens in some garbage-collected languages like Scheme), and get their initial values from the arguments to the function when it is called.
At the start of any block delimited by curly braces. Such variables are visible only within the block (local scope again) and exist only when the containing function is active (bounded extent). The convention in C is has generally been to declare all such local variables at the top of a function; this is different from the convention in C++ or Java, which encourage variables to be declared when they are first used. This convention may be less strong in C99 code, since C99 adopts the C++ rule of allowing variables to be declared anywhere (which can be particularly useful for index variables in for loops).

Variable declarations consist of a type name followed by one or more variable names separated by commas and terminated by a semicolon (except in argument lists, where each declaration is terminated by a comma). I personally find it easiest to declare variables one per line, to simplify documenting them. It is also possible for global and local variables (but not function arguments) to assign an initial value to a variable by putting in something like = 0 after the variable name. It is good practice to put a comment after each variable declaration that explains what the variable does (with a possible exception for conventionally-named loop variables like i or j in short functions). Below is an example of a program with some variable declarations in it:

   1 #include <stdio.h>
   2 #include <ctype.h>
   3 
   4 /* This program counts the number of digits in its input. */
   5 
   6 /*
   7  *This global variable is not used; it is here only to demonstrate
   8  * what a global variable declaration looks like.
   9  */
  10 unsigned long SpuriousGlobalVariable = 127;
  11 
  12 int
  13 main(int argc, char **argv)
  14 {
  15     int c;              /* character read */
  16     int count = 0;      /* number of digits found */
  17 
  18     while((c = getchar()) != EOF) {
  19         if(isdigit(c)) {
  20             count++;
  21         }
  22     }
  23 
  24     printf("%d\n", count);
  25 
  26     return 0;
  27 }

variables.c

16.2. Variable names

The evolution of variable names in different programming languages:

11101001001001: Physical addresses represented as bits.
#FC27: Typical assembly language address represented in hexadecimal to save typing (and because it's easier for humans to distinguish #A7 from #B6 than to distinguish 10100111 from 10110110.)
A1$: A string variable in BASIC, back in the old days where BASIC variables were one uppercase letter, optionally followed by a number, optionally followed by $ for a string variable and % for an integer variable. These type tags were used because BASIC interpreters didn't have a mechanism for declaring variable types.
IFNXG7: A typical FORTRAN variable name, back in the days of 6-character all-caps variable names. The I at the start means it's an integer variable. The rest of the letters probably abbreviate some much longer description of what the variable means. The default type based on the first letter was used because FORTRAN programmers were lazy, but it could be overridden by an explicit declaration.
i, j, c, count, top_of_stack, accumulatedTimeInFlight: Typical names from modern C programs. There is no type information contained in the name; the type is specified in the declaration and remembered by the compiler elsewhere. Note that there are two different conventions for representing multi-word names: the first is to replace spaces with underscores, and the second is to capitalize the first letter of each word (possibly excluding the first letter), a style called "camel case" (CamelCase). You should pick one of these two conventions and stick to it.
prgcGradeDatabase: An example of Hungarian notation, a style of variable naming in which the type of the variable is encoded in the first few character. The type is now back in the variable name again. This is not enforced by the compiler: even though iNumberOfStudents is supposed to be an int, there is nothing to prevent you from declaring float iNumberOfStudents if you expect to have fractional students for some reason. See http://web.umr.edu/~cpp/common/hungarian.html or HungarianNotation. Not clearly an improvement on standard naming conventions, but it is popular in some programming shops.

In C, variable names are called identifiers.¹ An identifier in C must start with a lower or uppercase letter or the underscore character _. Typically variables starting with underscores are used internally by system libraries, so it's dangerous to name your own variables this way. Subsequent characters in an identifier can be letters, digits, or underscores. So for example a, ____a___a_a_11727_a, AlbertEinstein, aAaAaAaAaAAAAAa, and ______ are all legal identifiers in C, but $foo and 01 are not.

The basic principle of variable naming is that a variable name is a substitute for the programmer's memory. It is generally best to give identifiers names that are easy to read and describe what the variable is used for, i.e., that are self-documenting. None of the variable names in the preceding list are any good by this standard. Better names would be total_input_characters, dialedWrongNumber, or stepsRemaining. Non-descriptive single-character names are acceptable for certain conventional uses, such as the use of i and j for loop iteration variables, or c for an input character. Such names should only be used when the scope of the variable is small, so that it's easy to see all the places where it is used at once.

C identifiers are case-sensitive, so aardvark, AArDvARK, and AARDVARK are all different variables. Because it is hard to remember how you capitalized something before, it is important to pick a standard convention and stick to it. The traditional convention in C goes like this:

Ordinary variables and functions are lowercased or camel-cased, e.g. count, countOfInputBits.
User-defined types (and in some conventions global variables) are capitalized, e.g. Stack, TotalBytesAllocated.
Constants created with #define or enum are put in all-caps: MAXIMUM_STACK_SIZE, BUFFER_LIMIT.

17. Using variables

Ignoring pointers (C/Pointers) for the moment, there are essentially two things you can do to a variable: you can assign a value to it using the = operator, as in:

   1     x = 2;      /* assign 2 to x */
   2     y = 3;      /* assign 3 to y */

or you can use its value in an expression:

   1     x = y+1;    /* assign y+1 to x */

The assignment operator is an ordinary operator, and assignment expressions can be used in larger expressions:

    x = (y=2)*3; /* sets y to 2 and x to 6 */

This feature is usually only used in certain standard idioms, since it's confusing otherwise.

There are also shorthand operators for expressions of the form variable = variable operator expression. For example, writing x += y is equivalent to writing x = x + y, x /= y is the same as x = x / y, etc.

For the special case of adding or subtracting 1, you can abbreviate still further with the ++ and -- operators. These come in two versions, depending on whether you want the result of the expression (if used in a larger expression) to be the value of the variable before or after the variable is incremented:

   1     x = 0;
   2     y = x++;    /* sets x to 1 and y to 0 (the old value) */
   3     y = ++x;    /* sets x to 2 and y to 2 (the new value) */

The intuition is that if the ++ comes before the variable, the increment happens before the value of the variable is read (a preincrement; if it comes after, it happens after the value is read (a postincrement). This is confusing enough that it is best not to use the value of preincrement or postincrement operations except in certain standard idioms. But using x++ by itself as a substitute for x = x+1 is perfectly acceptable style.

CategoryProgrammingNotes

18. C/IntegerTypes

19. Integer types

In order to declare a variable, you have to specify a type, which controls both how much space the variable takes up and how the bits stored within it are interpreted in arithmetic operators.

The standard C integer types are:²

Name	Typical size	Signed by default?
char	8 bits	Unspecified
short	16 bits	Yes
int	32 bits	Yes
long	32 bits	Yes
long long	64 bits	Yes

The typical size is for 32-bit architectures like the Intel i386. Some 64-bit machines might have 64-bit ints and longs, and some prehistoric computers had 16-bit ints. Particularly bizarre architectures might have even wilder bit sizes, but you are not likely to see this unless you program vintage 1970s supercomputers. Some compilers also support a long long type that is usually twice the length of a long (e.g. 64 bits on i386 machines); this may or may not be available if you insist on following the ANSI specification strictly. The general convention is that int is the most convenient size for whatever computer you are using and should be used by default.

Whether a variable is signed or not controls how its values are interpreted. In signed integers, the first bit is the sign bit and the rest are the value in 2's complement notation; so for example a signed char with bit pattern 11111111 would be interpreted as the numerical value -1 while an unsigned char with the same bit pattern would be 255. Most integer types are signed unless otherwise specified; an n-bit integer type has a range from -2^n-1 to 2^n-1-1 (e.g. -32768 to 32767 for a short.) Unsigned variables, which can be declared by putting the keyword unsigned before the type, have a range from 0 to 2ⁿ-1 (e.g. 0 to 65535 for an unsigned short).

For chars, whether the character is signed (-128..127) or unsigned (0..255) is at the whim of the compiler. If it matters, declare your variables as signed char or unsigned char. For storing actual characters that you aren't doing arithmetic on, it shouldn't matter.

19.1. C99 fixed-width types

C99 provides a stdint.h header file that defines integer types with known size independent of the machine architecture. So in C99, you can use int8_t instead of signed char to guarantee a signed type that holds exactly 8 bits, or uint64_t instead of unsigned long long to get a 64-bit unsigned integer type. The full set of types typically defined are int8_t, int16_t, int32_t, and int64_t for signed integers and the same starting with uint for signed integers. There are also types for integers that contain the fewest number of bits greater than some minimum (e.g., int_least16_t is a signed type with at least 16 bits, chosen to minimize space) or that are the fastest type with at least the given number of bits (e.g., int_fast16_t is a signed type with at least 16 bits, chosen to minimize time).

These are all defined using typedef; the main advantage of using stdint.h over defining them yourself is that if somebody ports your code to a new architecture, stdint.h should take care of choosing the right types automatically. The disadvantage is that, like many C99 features, stdint.h is not universally available on all C compilers.

If you need to print types defined in stdint.h, the larger inttypes.h header defines macros that give the corresponding format strings for printf.

20. Integer constants

Constant integer values in C can be written in any of four different ways:

In the usual decimal notation, e.g. 0, 1, -127, 9919291, 97.
In octal or base 8, when the leading digit is 0, e.g. 01 for 1, 010 for 8, 0777 for 511, 0141 for 97.
In hexadecimal or base 16, when prefixed with 0x. The letters a through f are used for the digits 10 through 15. For example, 0x61 is another way to write 97.
Using a character constant, which is a single ASCII character or an escape sequence inside single quotes. The value is the ASCII value of the character: 'a' is 97.³ Unlike languages with separate character types, C characters are identical to integers; you can (but shouldn't) calculate 97² by writing 'a'*'a'. You can also store a character anywhere.

Except for character constants, you can insist that an integer constant is unsigned or long by putting a u or l after it. So 1ul is an unsigned long version of 1. By default integer constants are (signed) ints. For long long constants, use ll, e.g., the unsigned long long constant 0xdeadbeef01234567ull. It is also permitted to write the l as L, which can be less confusing if the l looks too much like a 1.

21. Integer operators

21.1. Arithmetic operators

The usual + (addition), - (negation or subtraction), and * (multiplication) operators work on integers pretty much the way you'd expect. The only caveat is that if the result lies outside of the range of whatever variable you are storing it in, it will be truncated instead of causing an error:

   1     unsigned char c;
   2 
   3     c = -1;             /* sets c = 255 */
   4     c = 255 + 255;      /* sets c = 254 */
   5     c = 256 * 1772717;  /* sets c = 0 */

This can be a source of subtle bugs if you aren't careful. The usual giveaway is that values you thought should be large positive integers come back as random-looking negative integers.

Division (/) of two integers also truncates: 2/3 is 0, 5/3 is 1, etc. For positive integers it will always round down.

Prior to C99, if either the numerator or denominator is negative, the behavior was unpredictable and depended on what your processor does---in practice this meant you should never use / if one or both arguments might be negative. The C99 standard specified that integer division always removes the fractional part, effectively rounding toward 0; so (-3)/2 is -1, 3/-2 is -1, and (-3)/-2 is 1.

There is also a remainder operator % with e.g. 2%3 = 2, 5%3 = 2, 27 % 2 = 1, etc. The sign of the modulus is ignored, so 2%-3 is also 2. The sign of the dividend carries over to the remainder: (-3)%2 and (-3)%(-2) are both 1. The reason for this rule is that it guarantees that y == x*(y/x) + y%x is always true.

21.2. Bitwise operators

In addition to the arithmetic operators, integer types support bitwise logical operators that apply some Boolean operation to all the bits of their arguments in parallel. What this means is that the i-th bit of the output is equal to some operation applied to the i-th bit(s) of the input(s). The bitwise logical operators are ~ (bitwise negation: used with one argument as in ~0 for the all-1's binary value), & (bitwise AND), '|' (bitwise OR), and '^' (bitwise XOR, i.e. sum mod 2). These are mostly used for manipulating individual bits or small groups of bits inside larger words, as in the expression x & 0x0f, which strips off the bottom four bits stored in x.

Examples:

`x`	`y`	expression	value
0011	0101	`x&y`	0001
0011	0101	`x\|y`	0111
0011	0101	`x^y`	0101
0011	0101	`~x`	1100

The shift operators << and >> shift the bit sequence left or right: x << y produces the value x⋅2^y (ignoring overflow); this is equivalent to shifting every bit in x y positions to the left and filling in y zeros for the missing positions. In the other direction, x >> y produces the value ⌊x⋅2^-y⌋, by shifting x y positions to the right. The behavior of the right shift operator depends on whether x is unsigned or signed; for unsigned values, it shifts in zeros from the left end always; for signed values, it shifts in additional copies of the leftmost bit (the sign bit). This makes x >> y have the same sign as x if x is signed.

If y is negative, it reverses the direction of the shift; so x << -2 is equivalent to x >> 2.

Examples (unsigned char x):

`x`	`y`	`x << y`	`x >> y`
00000001	1	00000010	00000000
11111111	3	11111000	00011111
10111001	-2	00101110	11100100

Examples (signed char x):

`x`	`y`	`x << y`	`x >> y`
00000001	1	00000010	00000000
11111111	3	11111000	11111111
10111001	-2	11101110	11100100

Shift operators are often used with bitwise logical operators to set or extract individual bits in an integer value. The trick is that (1 << i) contains a 1 in the i-th least significant bit and zeros everywhere else. So x & (1<<i) is nonzero if and only if x has a 1 in the i-th place. This can be used to print out an integer in binary format (which standard printf won't do):

   1 void
   2 print_binary(unsigned int n)
   3 {
   4     unsigned int mask = 0;
   5 
   6     /* this grotesque hack creates a bit pattern 1000... */
   7     /* regardless of the size of an unsigned int */
   8     mask = ~mask ^ (~mask >> 1);
   9 
  10     for(; mask != 0; mask >>= 1) {
  11         putchar((n & mask) ? '1' : '0');
  12     }
  13 }

(See test_print_binary.c for a program that uses this.)

In the other direction, we can set the i-th bit of x to 1 by doing x | (1 << i) or to 0 by doing x & ~(1 << i). See C/BitExtraction for applications of this to build arbitrarily-large bit vectors.

21.3. Logical operators

To add to the confusion, there are also three logical operators that work on the truth-values of integers, where 0 is defined to be false and anything else is defined by be true. These are && (logical AND), ||, (logical OR), and ! (logical NOT). The result of any of these operators is always 0 or 1 (so !!x, for example, is 0 if x is 0 and 1 if x is anything else). The && and || operators evaluate their arguments left-to-right and ignore the second argument if the first determines the answer (this is the only place in C where argument evaluation order is specified); so

   1     0 && execute_programmer();
   2     1 || execute_programmer();

is in a very weak sense perfectly safe code to run.

Watch out for confusing & with &&. The expression 1 & 2 evaluates to 0, but 1 && 2 evaluates to 1. The statement 0 & execute_programmer(); is also unlikely to do what you want.

Yet another logical operator is the ternary operator ?:, where x ? y : z equals the value of y if x is nonzero and z if x is zero. Like && and ||, it only evaluates the arguments it needs to:

   1     fileExists(badFile) ? deleteFile(badFile) : createFile(badFile);

Most uses of ?: are better done using an if-then-else statement (C/Statements).

21.4. Relational operators

Logical operators usually operate on the results of relational operators or comparisons: these are == (equality), != (inequality), < (less than), > (greater than), <= (less than or equal to) and >= (greater than or equal to). So, for example,

    if(size >= MIN_SIZE && size <= MAX_SIZE) {
        puts("just right");
    }

tests if size is in the (inclusive) range [MIN_SIZE..MAX_SIZE].

Beware of confusing == with =. The code

   1     /* DANGER! DANGER! DANGER! */
   2     if(x = 5) {
   3         ...

is perfectly legal C, and will set x to 5 rather than testing if it's equal to 5. Because 5 happens to be nonzero, the body of the if statement will always be executed. This error is so common and so dangerous that gcc will warn you about any tests that look like this if you use the -Wall option. Some programmers will go so far as to write the test as 5 == x just so that if their finger slips, they will get a syntax error on 5 = x even without special compiler support.

22. Input and output

To input or output integer values, you will need to convert them from or to strings. Converting from a string is easy using the atoi or atol functions declared in stdlib.h; these take a string as an argument and return an int or long, respectively.⁴

Output is usually done using printf (or sprintf if you want to write to a string without producing output). Use the %d format specifier for ints, shorts, and chars that you want the numeric value of, %ld for longs, and %lld for long longs.

A contrived program that uses all of these features is given below:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 /* This program can be used to how atoi etc. handle overflow. */
   5 /* For example, try "overflow 1000000000000". */
   6 int
   7 main(int argc, char **argv)
   8 {
   9     char c;
  10     int i;
  11     long l;
  12     long long ll;
  13     
  14     if(argc != 2) {
  15         fprintf(stderr, "Usage: %s n\n", argv[0]);
  16         return 1;
  17     }
  18     
  19     c = atoi(argv[1]);
  20     i = atoi(argv[1]);
  21     l = atol(argv[1]);
  22     ll = atoll(argv[1]);
  23 
  24     printf("char: %d  int: %d  long: %ld  long long: %lld", c, i, l, ll);
  25 
  26     return 0;
  27 }

overflow.c

23. Alignment

Modern CPU architectures typically enforce alignment restrictions on multi-byte values, which mean that the address of an int or long typically has to be a multiple of 4. This is an effect of the memory being organized as groups of 32 bits that are written in parallel instead of 8 bits. Such restrictions are not obvious when working with integer-valued variables directly, but will come up when we talk about pointers in C/Pointers.

CategoryProgrammingNotes

24. C/InputOutput

Input and output from C programs is typically done through the standard I/O library, whose functions etc. are declared in stdio.h. A detailed descriptions of the functions in this library is given in Appendix B of KernighanRitchie. We'll talk about some of the more useful functions and about how input-output (I/O) works on Unix-like operating systems in general.

25. Character streams

The standard I/O library works on character streams, objects that act like long sequences of incoming or outgoing characters. What a stream is connected to is often not apparent to a program that uses it; an output stream might go to a terminal, to a file, or even to another program (appearing there as an input stream).

Three standard streams are available to all programs: these are stdin (standard input), stdout (standard output), and stderr (standard error). Standard I/O functions that do not take a stream as an argument will generally either read from stdin or write to stdout. The stderr stream is used for error messages. It is kept separate from stdout so that you can see these messages even if you redirect output to a file:

$ ls no-such-file > /tmp/dummy-output
ls: no-such-file: No such file or directory

26. Reading and writing single characters

To read a single character from stdin, use getchar:

   1     int c;
   2 
   3     c = getchar();

The getchar routine will return the special value EOF (usually -1; short for end of file) if there are no more characters to read, which can happen when you hit the end of a file or when the user types the end-of-file key control-D to the terminal. Note that the return value of getchar is declared to be an int since EOF lies outside the normal character range.

To write a single character to stdout, use putchar:

   1     putchar('!');

Even though putchar can only write single bytes, it takes an int as an argument. Any value outside the range 0..255 will be truncated to its last byte, as in the usual conversion from int to unsigned char.

Both getchar and putchar are wrappers for more general routines getc and putc that allow you to specify which stream you are using. To illustrate getc and putc, here's how we might define getchar and putchar if they didn't exist already:

   1 int
   2 getchar2(void)
   3 {
   4     return getc(stdin);
   5 }
   6 
   7 int
   8 putchar2(int c)
   9 {
  10     return putc(c, stdout);
  11 }

Note that putc, putchar2 as defined above, and the original putchar all return an int rather than void; this is so that they can signal whether the write succeeded. If the write succeeded, putchar or putc will return the value written. If the write failed (say because the disk was full), then putc or putchar will return EOF.

Here's another example of using putc to make a new function putcerr that writes a character to stderr:

   1 int
   2 putcerr(int c)
   3 {
   4     return putc(c, stderr);
   5 }

A rather odd feature of the C standard I/O library is that if you don't like the character you just got, you can put it back using the ungetc function. The limitations on ungetc are that (a) you can only push one character back, and (b) that character can't be EOF. The ungetc function is provided because it makes certain high-level input tasks easier; for example, if you want to parse a number written as a sequence of digits, you need to be able to read characters until you hit the first non-digit. But if the non-digit is going to be used elsewhere in your program, you don't want to eat it. The solution is to put it back using ungetc.

Here's a function that uses ungetc to peek at the next character on stdin without consuming it:

   1 /* return the next character from stdin without consuming it */
   2 int
   3 peekchar(void)
   4 {
   5     int c;
   6 
   7     c = getchar();
   8     if(c != EOF) ungetc(c, stdin);      /* puts it back */
   9     
  10     return c;
  11 }

27. Formatted I/O

Reading and writing data one character at a time can be painful. The C standard I/O library provides several convenient routines for reading and writing formatted data. The most commonly used one is printf, which takes as arguments a format string followed by zero or more values that are filled in to the format string according to patterns appearing in it.

Here are some typical printf statements:

   1     printf("Hello\n");          /* print "Hello" followed by a newline */
   2     printf("%c", c);            /* equivalent to putchar(c) */
   3     printf("%d", n);            /* print n (an int) formatted in decimal */
   4     printf("%u", n);            /* print n (an unsigned int) formatted in decimal */
   5     printf("%o", n);            /* print n (an unsigned int) formatted in octal */
   6     printf("%x", n);            /* print n (an unsigned int) formatted in hexadecimal */
   7     printf("%f", x);            /* print x (a float or double) */
   8 
   9     /* print total (an int) and average (a double) on two lines with labels */
  10     printf("Total: %d\nAverage: %f\n", total, average);

For a full list of formatting codes see Table B-1 in KernighanRitchie, or run man 3 printf.

The inverse of printf is scanf. The scanf function reads formatted data from stdin according to the format string passed as its first argument and stuffs the results into variables whose addresses are given by the later arguments. This requires prefixing each such argument with the & operator, which takes the address of a variable.

Format strings for scanf are close enough to format strings for printf that you can usually copy them over directly. However, because scanf arguments don't go through argument promotion (where all small integer types are converted to int and floats are converted to double), you have to be much more careful about specifying the type of the argument correctly.

   1     scanf("%c", &c);            /* like c = getchar(); c must be a char */
   2     scanf("%d", &n);            /* read an int formatted in decimal */
   3     scanf("%u", &n);            /* read an unsigned int formatted in decimal */
   4     scanf("%o", &n);            /* read an unsigned int formatted in octal */
   5     scanf("%x", &n);            /* read an unsigned int formatted in hexadecimal */
   6     scanf("%f", &x);            /* read a float */
   7     scanf("%lf", &x);           /* read a double */
   8 
   9     /* read total (an int) and average (a float) on two lines with labels */
  10     /* (will also work if input is missing newlines or uses other whitespace, see below) */
  11     scanf("Total: %d\nAverage: %f\n", &total, &average);

The scanf routine eats whitespace (spaces, tabs, newlines, etc.) in its input whenever it sees a conversion specification or a whitespace character in its format string. Non-whitespace characters that are not part of conversion specifications must match exactly. To detect if scanf parsed everything successfully, look at its return value; it returns the number of values it filled in, or EOF if it hits end-of-file before filling in any values.

The printf and scanf routines are wrappers for fprintf and fscanf, which take a stream as their first argument, e.g.:

   1     fprintf(stderr, "BUILDING ON FIRE, %d%% BURNT!!!\n", percentage);

Note the use of "%%" to print a single percent in the output.

28. Rolling your own I/O routines

Since we can write our own functions in C, if we don't like what the standard routines do, we can build our own on top of them. For example, here's a function that reads in integer values without leading minus signs and returns the result. It uses the peekchar routine we defined above, as well as the isdigit routine declared in ctype.h.

   1 /* read an integer written in decimal notation from stdin until the first
   2  * non-digit and return it.  Returns 0 if there are no digits. */
   3 int
   4 readNumber(void)
   5 {
   6     int accumulator;    /* the number so far */
   7     int c;              /* next character */
   8 
   9     accumulator = 0;
  10 
  11     while((c = peekchar()) != EOF && isdigit(c)) {
  12         c = getchar();                  /* consume it */
  13         accumulator *= 10;              /* shift previous digits over */
  14         accumulator += (c - '0');       /* add decimal value of new digit */
  15     }
  16 
  17     return accumulator;
  18 }

Here's another implementation that does almost the same thing:

   1 int
   2 readNumber2(void)
   3 {
   4     int n;
   5 
   6     if(scanf("%u", &n) == 1) {
   7         return n;
   8     } else {
   9         return 0;
  10     }
  11 }

The difference is that readNumber2 will consume any whitespace before the first digit, which may or may not be what we want.

More complex routines can be used to parse more complex input. For example, here's a routine that uses readNumber to parse simple arithmetic expressions, where each expression is either a number or of the form (expression+expression) or (expression*expression). The return value is the value of the expression after adding together or multiplying all of its subexpressions. (A complete program including this routine and the others defined earlier that it uses can be found in calc.c.)

   1 #define EXPRESSION_ERROR (-1)
   2 
   3 /* read an expression from stdin and return its value */
   4 /* returns EXPRESSION_ERROR on error */
   5 int
   6 readExpression(void)
   7 {
   8     int e1;             /* value of first sub-expression */
   9     int e2;             /* value of second sub-expression */
  10     int c;
  11     int op;             /* operation: '+' or '*' */
  12 
  13     c = peekchar();
  14 
  15     if(c == '(') {
  16         c = getchar();
  17 
  18         e1 = readExpression();
  19         op = getchar();
  20         e2 = readExpression();
  21 
  22         c = getchar();  /* this had better be ')' */
  23         if(c != ')') return EXPRESSION_ERROR;
  24 
  25         /* else */
  26         switch(op) {
  27         case '*':
  28             return e1*e2;
  29             break;
  30         case '+':
  31             return e1+e2;
  32             break;
  33         default:
  34             return EXPRESSION_ERROR;
  35             break;
  36         }
  37     } else if(isdigit(c)) {
  38         return readNumber();
  39     } else {
  40         return EXPRESSION_ERROR;
  41     }
  42 }

Because this routine calls itself recursively as it works its way down through the input, it is an example of a recursive descent parser. Parsers for more complicated languages (e.g. C) are usually not written by hand like this, but are instead constructed mechanically using a Parser generator.

29. File I/O

Reading and writing files is done by creating new streams attached to the files. The function that does this is fopen. It takes two arguments: a filename, and a flag that controls whether the file is opened for reading or writing. The return value of fopen has type FILE * and can be used in putc, getc, fprintf, etc. just like stdin, stdout, or stderr. When you are done using a stream, you should close it using fclose.

Here's a program that reads a list of numbers from a file whose name is given as argv[1] and prints their sum:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 int
   5 main(int argc, char **argv)
   6 {
   7     FILE *f;
   8     int x;
   9     int sum;
  10 
  11     if(argc < 2) {
  12         fprintf(stderr, "Usage: %s filename\n", argv[0]);
  13         exit(1);
  14     }
  15 
  16     f = fopen(argv[1], "r");
  17     if(f == 0) {
  18         /* perror is a standard C library routine */
  19         /* that prints a message about the last failed library routine */
  20         /* prepended by its argument */
  21         perror(filename);
  22         exit(2);
  23     }
  24 
  25     /* else everything is ok */
  26     sum = 0;
  27     while(fscanf("%d", &x) == 1) {
  28         sum += x;
  29     }
  30 
  31     printf("%d\n", sum);
  32 
  33     /* not strictly necessary but it's polite */
  34     fclose(f);
  35 
  36     return 0;
  37 }

To write to a file, open it with fopen(filename, "w"). Note that as soon as you call fopen with the "w" flag, any previous contents of the file are erased. If you want to append to the end of an existing file, use "a" instead. You can also add + onto the flag if you want to read and write the same file (this will probably involve using fseek).

Some operating systems (Windows) make a distinction between text and binary files. For text files, use the same arguments as above. For binary files, add a b, e.g. fopen(filename, "wb") to write a binary file.

   1 /* leave a greeting in the current directory */
   2 
   3 #include <stdio.h>
   4 #include <stdlib.h>
   5 
   6 #define FILENAME "hello.txt"
   7 #define MESSAGE "hello world"
   8 
   9 int
  10 main(int argc, char **argv)
  11 {
  12     FILE *f;
  13 
  14     f = fopen(FILENAME, "w");
  15     if(f == 0) {
  16         perror(FILENAME);
  17         exit(1);
  18     }
  19 
  20     /* unlike puts, fputs doesn't add a newline */
  21     fputs(MESSAGE, f);
  22     putc('\n', f);
  23 
  24     fclose(f);
  25 
  26     return 0;
  27 }

helloFile.c

CategoryProgrammingNotes

30. C/Statements

Contents

Simple statements
Compound statements

The bodies of C functions (including the main function) are made up of statements. These can either be simple statements that do not contain other statements, or compound statements that have other statements inside them. Control structures are compound statements like if/then/else, while, for, and do..while that control how or whether their component statements are executed.

31. Simple statements

The simplest kind of statement in C is an expression (followed by a semicolon, the terminator for all simple statements). Its value is computed and discarded. Examples:

   1     x = 2;              /* an assignment statement */
   2     x = 2+3;            /* another assignment statement */
   3     2+3;                /* has no effect---will be discarded by smart compilers */
   4     puts("hi");         /* a statement containing a function call */
   5     root2 = sqrt(2);    /* an assignment statement with a function call */

Most statements in a typical C program are simple statements of this form.

Other examples of simple statements are the jump statements return, break, continue, and goto. A return statement specifies the return value for a function (if there is one), and when executed it causes the function to exit immediately. The break and continue statements jump immediately to the end of a loop (or switch; see below) or the next iteration of a loop; we'll talk about these more when we talk about loops. The goto statement jumps to another location in the same function, and exists for the rare occasions when it is needed. Using it in most circumstances is a sin.

32. Compound statements

Compound statements come in two varieties: conditionals and loops.

32.1. Conditionals

These are compound statements that test some condition and execute one or another block depending on the outcome of the condition. The simplest is the if statement:

   1     if(houseIsOnFire) {
   2         /* ouch! */
   3         scream();
   4         runAway();
   5     }

The body of the if statement is executed only if the expression in parentheses at the top evaluates to true (which in C means any value that is not 0).

The braces are not strictly required, and are used only to group one or more statements into a single statement. If there is only one statement in the body, the braces can be omitted:

   1     if(programmerIsLazy) omitBraces();

This style is recommended only for very simple bodies. Omitting the braces makes it harder to add more statements later without errors.

   1     if(underAttack)
   2         launchCounterAttack();   /* executed only when attacked */
   3         hideInBunker();          /* ### DO NOT INDENT LIKE THIS ### executed always */

In the example above, the lack of braces means that the hideInBunker() statement is not part of the if statement, despite the misleading indentation. This sort of thing is why I generally always put in braces in an if.

An if statement may have an else clause, whose body is executed if the test is false (i.e. equal to 0).

   1     if(happy) {
   2         smile();
   3     } else {
   4         frown();
   5     }

A common idiom is to have a chain of if and else if branches that test several conditions:

   1     if(temperature < 0) {
   2         puts("brrr");
   3     } else if(temperature < 100) {
   4         puts("hooray");
   5     } else {
   6         puts("ouch!");
   7     }

This can be inefficient if there are a lot of cases, since the tests are applied sequentially. For tests of the form <expression> == <small constant>, the switch statement may provide a faster alternative. Here's a typical switch statement:

   1     /* print plural of cow, maybe using the obsolete dual number construction */
   2     switch(numberOfCows) {
   3     case 1:
   4         puts("cow");
   5         break;
   6     case 2:
   7         puts("cowen");
   8         break;
   9     default:
  10         puts("cows");
  11         break;
  12     }

This prints the string "cow" if there is one cow, "cowen" if there are two cowen, and "cows" if there are any other number of cows. The switch statement evaluates its argument and jumps to the matching case label, or to the default label if none of the cases match. Cases must be constant integer values.

The break statements inside the block jump to the end of the block. Without them, executing the switch with numberOfCows equal to 1 would print all three lines. This can be useful in some circumstances where the same code should be used for more than one case:

   1     switch(c) {
   2     case 'a':
   3     case 'e':
   4     case 'i':
   5     case 'o':
   6     case 'u':
   7         type = VOWEL;
   8         break;
   9     default:
  10         type = CONSONANT;
  11         break;
  12     }

or when a case "falls through" to the next:

   1     switch(countdownStart) {
   2     case 3:
   3         puts("3");
   4     case 2:
   5         puts("2");
   6     case 1:
   7         puts("1")
   8     case 0:
   9         puts("KABLOOIE!");
  10         break;
  11     default:
  12         puts("I can't count that high!");
  13         break;
  14     }

Note that it is customary to include a break on the last case even though it has no effect; this avoids problems later if a new case is added. It is also customary to include a default case even if the other cases supposedly exhaust all the possible values, as a check against bad or unanticipated inputs.

   1     switch(oliveSize) {
   2     case JUMBO():
   3         eatOlives(SLOWLY);
   4         break;
   5     case COLLOSSAL:
   6         eatOlives(QUICKLY);
   7         break;
   8     case SUPER_COLLOSSAL:
   9         eatOlives(ABSURDLY);
  10         break;
  11     default:
  12         /* unknown size! */
  13         abort();
  14         break;
  15     }

Though switch statements are better than deeply nested if/else-if constructions, it is often even better to organize the different cases as data rather than code. We'll see examples of this when we talk about function pointers C/FunctionPointers.

Nothing in the C standards prevents the case labels from being buried inside other compound statements. One rather hideous application of this fact is Duff's device.

32.2. Loops

There are three kinds of loops in C.

32.2.1. The while loop

A while loop tests if a condition is true, and if so, executes its body. It then tests the condition is true again, and keeps executing the body as long as it is. Here's a program that deletes every occurence of the letter e from its input.

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     int c;
   7 
   8     while((c = getchar()) != EOF) {
   9         switch(c) {
  10         case 'e':
  11         case 'E':
  12             break;
  13         default:
  14             putchar(c);
  15             break;
  16         }
  17     }
  18 
  19     return 0;
  20 }

Note that the expression inside the while argument both assigns the return value of getchar to c and tests to see if it is equal to EOF (which is returned when no more input characters are available). This is a very common idiom in C programs. Note also that even though c holds a single character, it is declared as an int. The reason is that EOF (a constant defined in stdio.h) is outside the normal character range, and if you assign it to a variable of type char it will be quietly truncated into something else. Because C doesn't provide any sort of exception mechanism for signalling unusual outcomes of function calls, designers of library functions often have to resort to extending the output of a function to include an extra value or two to signal failure; we'll see this a lot when the null pointer shows up in C/Pointers.

32.2.2. The do..while loop

The do..while statement is like the while statement except the test is done at the end of the loop instead of the beginning. This means that the body of the loop is always executed at least once.

Here's a loop that does a random walk until it gets back to 0 (if ever). If we changed the do..while loop to a while loop, it would never take the first step, because pos starts at 0.

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <time.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     int pos = 0;       /* position of random walk */
   9 
  10     srandom(time(0));  /* initialize random number generator */
  11 
  12     do {
  13         pos += random() & 0x1 ? +1 : -1;
  14         printf("%d\n", pos);
  15     } while(pos != 0);
  16 
  17     return 0;
  18 }

random_walk.c

The do..while loop is used much less often in practice than the while loop. Note that it is always possible to convert a do..while loop to a while loop by making an extra copy of the body in front of the loop.

32.2.3. The for loop

The for loop is a form of SyntacticSugar that is used when a loop iterates over a sequence of values stored in some variable (or variables). Its argument consists of three expressions: the first initializes the variable and is called once when the statement is first reached. The second is the test to see if the body of the loop should be executed; it has the same function as the test in a while loop. The third sets the variable to its next value. Some examples:

   1     /* count from 0 to 9 */
   2     for(i = 0; i < 10; i++) {
   3         printf("%d\n", i);
   4     }
   5     
   6     /* and back from 10 to 0 */
   7     for(i = 10; i >= 0; i--) {
   8         printf("%d\n", i);
   9     }
  10 
  11     /* this loop uses some functions to move around */
  12     for(c = firstCustomer(); c != END_OF_CUSTOMERS; c = customerAfter(c)) {
  13         helpCustomer(c);
  14     }
  15 
  16     /* this loop prints powers of 2 that are less than n*/
  17     for(i = 1; i < n; i *= 2) {
  18         printf("%d\n", i);
  19     }
  20 
  21     /* this loop does the same thing with two variables by using the comma operator */
  22     for(i = 0, power = 1; power < n; i++, power *= 2) {
  23         printf("2^%d = %d\n", i, power);
  24     }
  25 
  26     /* Here are some nested loops that print a times table */
  27     for(i = 0; i < n; i++) {
  28         for(j = 0; j < n; j++) {
  29             printf("%d*%d=%d ", i, j, i*j);
  30         }
  31         putchar('\n');
  32     }

A for loop can always be rewritten as a while loop.

   1     for(i = 0; i < 10; i++) {
   2         printf("%d\n", i);
   3     }
   4 
   5     /* is exactly the same as */
   6 
   7     i = 0;
   8     while(i < 10) {
   9         printf("%d\n", i);
  10         i++;
  11     }

32.2.4. Loops with break, continue, and goto

The break statement immediately exits the innermmost enclosing loop or switch statement.

   1     for(i = 0; i < n; i++) {
   2         openDoorNumber(i);
   3         if(boobyTrapped()) {
   4             break;
   5         }
   6     }

The continue statement skips to the next iteration. Here is a program with a loop that iterates through all the integers from -10 through 10, skipping 0:

   1 #include <stdio.h>
   2 
   3 /* print a table of inverses */
   4 #define MAXN (10)
   5 
   6 int
   7 main(int argc, char **argv)
   8 {
   9     int n;
  10 
  11     for(n = -MAXN; n <= MAXN; n++) {
  12         if(n == 0) continue;
  13         printf("1.0/%3d = %+f\n", n, 1.0/n);
  14     }
  15 
  16     return 0;
  17 }

inverses.c

Occasionally, one would like to break out of more than one nested loop. The way to do this is with a goto statement.

   1     for(i = 0; i < n; i++) {
   2         for(j = 0; j < n; j++) {
   3             doSomethingTimeConsumingWith(i, j);
   4             if(checkWatch() == OUT_OF_TIME) {
   5                 goto giveUp;
   6             }
   7         }
   8     }
   9 giveUp:
  10     puts("done");

The target for the goto is a label, which is just an identifier followed by a colon and a statement (the empty statement ; is ok).

The goto statement can be used to jump anywhere within the same function body, but breaking out of nested loops is widely considered to be its only genuinely acceptable use in normal code.

32.3. Choosing where to put a loop exit

Choosing where to put a loop exit is usually pretty obvious: you want it after any code that you want to execute at least once, and before any code that you want to execute only if the termination test fails.

If you know in advance what values you are going to be iterating over, you will most likely be using a for loop:

   1 for(i = 0; i < n; i++) {
   2     a[i] = 0;
   3 }

Most of the rest of the time, you will want a while loop:

   1 while(!done()) {
   2     doSomething();
   3 }

The do..while loop comes up mostly when you want to try something, then try again if it failed:

   1 do {
   2     result = fetchWebPage(url);
   3 } while(result == 0);

Finally, leaving a loop in the middle using break can be handy if you have something extra to do before trying again:

   1 for(;;) {
   2     result = fetchWebPage(url);
   3     if(result != 0) {
   4         break;
   5     }
   6     /* else */
   7     fprintf(stderr, "fetchWebPage failed with error code %03d\n", result);
   8     sleep(retryDelay);  /* wait before trying again */
   9 }

(Note the empty for loop header means to loop forever; while(1) also works.)

CategoryProgrammingNotes

33. C/FloatingPoint

Real numbers are represented in C by the floating point types float, double, and long double. Just as the integer types can't represent all integers because they fit in a bounded number of bytes, so also the floating-point types can't represent all real numbers. The difference is that the integer types can represent values within their range exactly, while floating-point types almost always give only an approximation to the correct value, albeit across a much larger range. The three floating point types differ in how much space they use (32, 64, or 80 bits on x86 CPUs; possibly different amounts on other machines), and thus how much precision they provide. Most math library routines expect and return doubles (e.g., sin is declared as double sin(double), but there are usually float versions as well (float sinf(float)).

34. Floating point basics

The core idea of floating-point representations (as opposed to fixed point representations as used by, say, ints), is that a number x is written as m*b^e where m is a mantissa or fractional part, b is a base, and e is an exponent. On modern computers the base is almost always 2, and for most floating-point representations the mantissa will be scaled to be between 1 and b. This is done by adjusting the exponent, e.g.

1 = 1*2⁰

2 = 1*2¹

0.375 = 1.5*2^-2

etc.

The mantissa is usually represented in base b, as a binary fraction. So (in a very low-precision format), 1 would be 1.000*2⁰, 2 would be 1.000*2¹, and 0.375 would be 1.100*2^-2, where the first 1 after the decimal point counts as 1/2, the second as 1/4, etc. Note that for a properly-scaled (or normalized) floating-point number in base 2 the digit before the decimal point is always 1. For this reason it is usually dropped (although this requires a special representation for 0).

Negative values are typically handled by adding a sign bit that is 0 for positive numbers and 1 for negative numbers.

35. Floating-point constants

Any number that has a decimal point in it will be interpreted by the compiler as a floating-point number. Note that you have to put at least one digit after the decimal point: 2.0, 3.75, -12.6112. You can specific a floating point number in scientific notation using e for the exponent: 6.022e23.

36. Operators

Floating-point types in C support most of the same arithmetic and relational operators as integer types; x > y, x / y, x + y all make sense when x and y are floats. If you mix two different floating-point types together, the less-precise one will be extended to match the precision of the more-precise one; this also works if you mix integer and floating point types as in 2 / 3.0. Unlike integer division, floating-point division does not discard the fractional part (although it may produce round-off error: 2.0/3.0 gives 0.66666666666666663, which is not quite exact). Be careful about accidentally using integer division when you mean to use floating-point division: 2/3 is 0. Casts can be used to force floating-point division (see below).

Some operators that work on integers will not work on floating-point types. These are % (use modf from the math library if you really need to get a floating-point remainder) and all of the bitwise operators ~, <<, >>, &, ^, and |.

37. Conversion to and from integer types

Mixed uses of floating-point and integer types will convert the integers to floating-point.

You can convert floating-point numbers to and from integer types explicitly using casts. A typical use might be:

   1 /* return the average of a list */
   2 double
   3 average(int n, int a[])
   4 {
   5     int sum = 0;
   6     int i;
   7 
   8     for(i = 0; i < n; i++) {
   9         sum += a[i];
  10     }
  11 
  12     return (double) sum / n;
  13 }

If we didn't put in the (double) to convert sum to a double, we'd end up doing integer division, which would truncate the fractional part of our average.

In the other direction, we can write:

   1    i = (int) f;

to convert a float f to int i. This conversion loses information by throwing away the fractional part of f: if f was 3.2, i will end up being just 3.

38. The IEEE-754 floating-point standard

The IEEE-754 floating-point standard is a standard for representing and manipulating floating-point quantities that is followed by all modern computer systems. It defines several standard representations of floating-point numbers, all of which have the following basic pattern (the specific layout here is for 32-bit floats):

bit  31 30    23 22                    0
     S  EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM

The bit numbers are counting from the least-significant bit. The first bit is the sign (0 for positive, 1 for negative). The following 8 bits are the exponent in excess-127 binary notation; this means that the binary pattern 01111111 = 127 represents an exponent of 0, 1000000 = 128, represents 1, 01111110 = 126 represents -1, and so forth. The mantissa fits in the remaining 24 bits, with its leading 1 stripped off as described above.

Certain numbers have a special representation. Because 0 cannot be represented in the standard form (there is no 1 before the decimal point), it is given the special representation 0 00000000 00000000000000000000000. (There is also a -0 = 1 00000000 00000000000000000000000, which looks equal to +0 but prints differently.) Numbers with exponents of 11111111 = 255 = 2¹²⁸ represent non-numeric quantities such as "not a number" (NaN), returned by operations like (0.0/0.0) and positive or negative infinity. A table of some typical floating-point numbers (generated by the program float.c) is given below:

         0 =                        0 = 0 00000000 00000000000000000000000
        -0 =                       -0 = 1 00000000 00000000000000000000000
     0.125 =                    0.125 = 0 01111100 00000000000000000000000
      0.25 =                     0.25 = 0 01111101 00000000000000000000000
       0.5 =                      0.5 = 0 01111110 00000000000000000000000
         1 =                        1 = 0 01111111 00000000000000000000000
         2 =                        2 = 0 10000000 00000000000000000000000
         4 =                        4 = 0 10000001 00000000000000000000000
         8 =                        8 = 0 10000010 00000000000000000000000
     0.375 =                    0.375 = 0 01111101 10000000000000000000000
      0.75 =                     0.75 = 0 01111110 10000000000000000000000
       1.5 =                      1.5 = 0 01111111 10000000000000000000000
         3 =                        3 = 0 10000000 10000000000000000000000
         6 =                        6 = 0 10000001 10000000000000000000000
       0.1 =      0.10000000149011612 = 0 01111011 10011001100110011001101
       0.2 =      0.20000000298023224 = 0 01111100 10011001100110011001101
       0.4 =      0.40000000596046448 = 0 01111101 10011001100110011001101
       0.8 =      0.80000001192092896 = 0 01111110 10011001100110011001101
     1e+12 =             999999995904 = 0 10100110 11010001101010010100101
     1e+24 =   1.0000000138484279e+24 = 0 11001110 10100111100001000011100
     1e+36 =   9.9999996169031625e+35 = 0 11110110 10000001001011111001110
       inf =                      inf = 0 11111111 00000000000000000000000
      -inf =                     -inf = 1 11111111 00000000000000000000000
       nan =                      nan = 0 11111111 10000000000000000000000

What this means in practice is that a 32-bit floating-point value (e.g. a float) can represent any number between 1.17549435e-38 and 3.40282347e+38, where the e separates the (base 10) exponent. Operations that would create a smaller value will underflow to 0 (slowly—IEEE 754 allows "denormalized" floating point numbers with reduced precision for very small values) and operations that would create a larger value will produce inf or -inf instead.

For a 64-bit double, the size of both the exponent and mantissa are larger; this gives a range from 1.7976931348623157e+308 to 2.2250738585072014e-308, with similar behavior on underflow and overflow.

Intel processors internally use an even larger 80-bit floating-point format for all operations. Unless you declare your variables as long double, this should not be visible to you from C except that some operations that might otherwise produce overflow errors will not do so, provided all the variables involved sit in registers (typically the case only for local variables and function parameters).

39. Error

In general, floating-point numbers are not exact: they are likely to contain round-off error because of the truncation of the mantissa to a fixed number of bits. This is particularly noticeable for large values (e.g. 1e+12 in the table above), but can also be seen in fractions with values that aren't powers of 2 in the denominator (e.g. 0.1). Round-off error is often invisible with the default float output formats, since they produce fewer digits than are stored internally, but can accumulate over time, particularly if you subtract floating-point quantities with values that are close (this wipes out the mantissa without wiping out the error, making the error much larger relative to the number that remains).

The easiest way to avoid accumulating error is to use high-precision floating-point numbers (this means using double instead of float). On modern CPUs there is little or no time penalty for doing so, although storing doubles instead of floats will take twice as much space in memory.

Note that a consequence of the internal structure of IEEE 754 floating-point numbers is that small integers and fractions with small numerators and power-of-2 denominators can be represented exactly—indeed, the IEEE 754 standard carefully defines floating-point operations so that arithmetic on such exact integers will give the same answers as integer arithmetic would (except, of course, for division that produces a remainder). This fact can sometimes be exploited to get higher precision on integer values than is available from the standard integer types; for example, a double can represent any integer between -2⁵³ and 2⁵³ exactly, which is a much wider range than the values from 2^-31^ to 2^31^-1 that fit in a 32-bit int or long. (A 64-bit long long does better.) So double` should be considered for applications where large precise integers are needed (such as calculating the net worth in pennies of a billionaire.)

One consequence of round-off error is that it is very difficult to test floating-point numbers for equality, unless you are sure you have an exact value as described above. It is generally not the case, for example, that (0.1+0.1+0.1) == 0.3 in C. This can produce odd results if you try writing something like for(f = 0.0; f <= 0.3; f += 0.1): it will be hard to predict in advance whether the loop body will be executed with f = 0.3 or not. (Even more hilarity ensues if you write for(f = 0.0; f != 0.3; f += 0.1), which after not quite hitting 0.3 exactly keeps looping for much longer than I am willing to wait to see it stop, but which I suspect will eventually converge to some constant value of f large enough that adding 0.1 to it has no effect.) Most of the time when you are tempted to test floats for equality, you are better off testing if one lies within a small distance from the other, e.g. by testing fabs(x-y) <= fabs(EPSILON * y), where EPSILON is usually some application-dependent tolerance. This isn't quite the same as equality (for example, it isn't transitive), but it usually closer to what you want.

40. Reading and writing floating-point numbers

Any numeric constant in a C program that contains a decimal point is treated as a double by default. You can also use e or E to add a base-10 exponent (see the table for some examples of this.) If you want to insist that a constant value is a float for some reason, you can append F on the end, as in 1.0F.

For I/O, floating-point values are most easily read and written using scanf (and its relatives fscanf and sscanf) and printf. For printf, there is an elaborate variety of floating-point format codes; the easiest way to find out what these do is experiment with them. For scanf, pretty much the only two codes you need are "%lf", which reads a double value into a double *, and "%f", which reads a float value into a float *. Both these formats are exactly the same in printf, since a float is promoted to a double before being passed as an argument to printf (or any other function that doesn't declare the type of its arguments). But you have to be careful with the arguments to scanf or you will get odd results as only 4 bytes of your 8-byte double are filled in, or—even worse—8 bytes of your 4-byte float are.

41. Non-finite numbers in C

The values nan, inf, and -inf can't be written in this form as floating-point constants in a C program, but printf will generate them and scanf seems to recognize them. With some machines and compilers you may be able to use the macros INFINITY and NAN from <math.h> to generate infinite quantities. The macros isinf and isnan can be used to detect such quantities if they occur.

42. The math library

(See also KernighanRitchie Appendix B4.)

Many mathematical functions on floating-point values are not linked into C programs by default, but can be obtained by linking in the math library. Examples would be the trigonometric functions sin, cos, and tan (plus more exotic ones), sqrt for taking square roots, pow for exponentiation, log and exp for base-e logs and exponents, and fmod for when you really want to write x%y but one or both variables is a double. The standard math library functions all take doubles as arguments and return double values; most implementations also provide some extra functions with similar names (e.g., sinf) that use floats instead, for applications where space or speed is more important than accuracy.

There are two parts to using the math library. The first is to include the line

   1 #include <math.h>
   2

somewhere at the top of your source file. This tells the preprocessor to paste in the declarations of the math library functions found in /usr/include/math.h.

The second step is to link to the math library when you compile. This is done by passing the flag -lm to gcc after your C program source file(s). A typical command might be:

gcc -o program program.c -lm

If you don't do this, you will get errors from the compiler about missing functions. The reason is that the math library is not linked in by default, since for many system programs it's not needed.

CategoryProgrammingNotes

43. C/Functions

A function, procedure, or subroutine encapsulates some complex computation as a single operation. Typically, when we call a function, we pass as arguments all the information this function needs, and any effect it has will be reflected in either its return value or (in some cases) in changes to values pointed to by the arguments. Inside the function, the arguments are copied into local variables, which can be used just like any other local variable---they can even be assigned to without affecting the original argument.

44. Function definitions

A typical function definition looks like this:

   1 /* Returns the square of the distance between two points separated by 
   2    dx in the x direction and dy in the y direction. */
   3 int
   4 distSquared(int dx, int dy)
   5 {
   6     return dx*dx + dy*dy;
   7 }

The part outside the braces is called the function declaration; the braces and their contents is the function body.

Like most complex declarations in C, once you delete the type names the declaration looks like how the function is used: the name of the function comes before the parentheses and the arguments inside. The ints scattered about specify the type of the return value of the function (Line 3) and of the parameters (Line 4); these are used by the compiler to determine how to pass values in and out of the function and (usually for more complex types, since numerical types will often convert automatically) to detect type mismatches.

If you want to define a function that doesn't return anything, declare its return type as void. You should also declare a parameter list of void if the function takes no arguments.

   1 /* Prints "hi" to stdout */
   2 void
   3 helloWorld(void)
   4 {
   5     puts("hi");
   6 }

It is not strictly speaking an error to omit the second void here. Putting void in for the parameters tells the compiler to enforce that no arguments are passed in. If we had instead declared helloWorld as

   1 /* Prints "hi" to stdout */
   2 void
   3 helloWorld()    /* DANGER! */
   4 {
   5     puts("hi");
   6 }

it would be possible to call it as

   1     helloWorld("this is a bogus argument");

without causing an error. The reason is that a function declaration with no arguments means that the function can take an unspecified number of arguments, and it's up to the user to make sure they pass in the right ones. There are good historical reasons for what may seem like obvious lack of sense in the design of the language here, and fixing this bug would break most C code written before 1989. But you shouldn't ever write a function declaration with an empty argument list, since you want the compiler to know when something goes wrong.

45. Calling a function

A function call consists of the function followed by its arguments (if any) inside parentheses, separated by comments. For a function with no arguments, call it with nothing between the parentheses. A function call that returns a value can be used in an expression just like a variable. A call to a void function can only be used as an expression by itself:

   1     totalDistance += distSquared(x1 - x2, y1 - y2);
   2     helloWorld();
   3     greetings += helloWorld();  /* ERROR */

46. The return statement

To return a value from a function, write a return statement, e.g.

   1     return 172;

The argument to return can be any expression. Unlike the expression in, say, an if statement, you do not need to wrap it in parentheses. If a function is declared void, you can do a return with no expression, or just let control reach the end of the function.

Executing a return statement immediately terminates the function. This can be used like break to get out of loops early.

   1 /* returns 1 if n is prime, 0 otherwise */
   2 int
   3 isPrime(int n)
   4 {
   5     int i;
   6 
   7     if (n < 2) return 0;   /* special case for 0, 1, negative n */
   8  
   9     for(i = 2; i < n; i++) {
  10         if (n % i == 0) {
  11             /* found a factor */
  12             return 0;
  13         }
  14     }
  15 
  16     /* no factors */
  17     return 1;
  18 }

47. Function declarations and modules

By default, functions have global scope: they can be used anywhere in your program, even in other files. If a file doesn't contain a declaration for a function someFunc before it is used, the compiler will assume that it is declared like int someFunc() (i.e., return type int and unknown arguments). This can produce infuriating complaints later when the compiler hits the real declaration and insists that your function someFunc should be returning an int and you are a bonehead for declaring it otherwise.

The solution to such insulting compiler behavior errors is to either (a) move the function declaration before any functions that use it; or (b) put in a declaration without a body before any functions that use it, in addition to the declaration that appears in the function definition. (Note that this violates the no separate but equal rule, but the compiler should tell you when you make a mistake.) Option (b) is generally preferred, and is the only option when the function is used in a different file.

To make sure that all declarations of a function are consistent, the usual practice is to put them in an include file. For example, if distSquared is used in a lot of places, we might put it in its own file distSquared.c:

   1 #include "distSquared.h"
   2 
   3 int
   4 distSquared(int dx, int dy)
   5 {
   6     return dx*dx + dy*dy;
   7 }

This file uses #include to include a copy of this file, distSquared.h:

   1 /* Returns the square of the distance between two points separated by 
   2    dx in the x direction and dy in the y direction. */
   3 int distSquared(int dx, int dy);

Note that the declaration in distSquared.h doesn't have a body; instead, it's terminated by a semicolon like a variable declaration. It's also worth noting that we moved the documenting comment to distSquared.h: the idea is that distSquared.h is the public face of this (very small one-function) module, and so the explanation of how to use the function should be there.

The reason distSquared.c includes distSquared.h is to get the compiler to verify that the declarations in the two files match. But to use the distSquared function, we also put #include "distSquared.h" at the top of the file that uses it:

   1 #include "distSquared.h"
   2 
   3 #define THRESHOLD (100)
   4 
   5 int
   6 tooClose(int x1, int y1, int x2, int y2)
   7 {
   8     return distSquared(x1 - x2, y1 - y2) < THRESHOLD;
   9 }

The #include on line 1 uses double quotes instead of angle brackets; this tells the compiler to look for distSquared.h in the current directory instead of the system include directory (typically /usr/include).

48. Static functions

By default, all functions are global; they can be used in any file of your program whether or not a declaration appears in a header file. To restrict access to the current file, declare a function static, like this:

   1 static void
   2 helloHelper(void)
   3 {
   4     puts("hi!");
   5 }
   6 
   7 void
   8 hello(int repetitions)
   9 {
  10     int i;
  11 
  12     for(i = 0; i < repetitions; i++) {
  13         helloHelper();
  14     }
  15 }

The function hello will be visible everywhere. The function helloHelper will only be visible in the current file.

It's generally good practice to declare a function static unless you intend to make it available, since not doing so can cause namespace conflicts, where the presence of two functions with the same name either prevent the program from linking or---even worse---cause the wrong function to be called. The latter can happen with library functions, since C allows the programmer to override library functions by defining a new function with the same name. I once had a program fail in a spectacularly incomprehensible way because I'd written a select function without realizing that select is a core library function in C.

49. Local variables

A function may contain definitions of local variables, which are visible only inside the function and which survive only until the function returns. These may be declared at the start of any block (group of statements enclosed by braces), but it is conventional to declare all of them at the outermost block of the function.

   1 /* Given n, compute n! = 1*2*...*n */
   2 /* Warning: will overflow on 32-bit machines if n > 12 */
   3 int
   4 factorial(int n)
   5 {
   6     int i;
   7     int product;
   8 
   9     if(n < 2) return n;
  10     /* else */
  11 
  12     product = 1;
  13 
  14     for(i = 2; i <= n; i++) {
  15         product *= i;
  16     }
  17 
  18     return product;
  19 }

50. Mechanics of function calls

Several things happen under the hood when a function is called. Since a function can be called from several different places, the CPU needs to store its previous state to know where to go back. It also needs to allocate space for function arguments and local variables.

Some of this information will be stored in registers, memory locations built into the CPU itself, but most will go on the stack, a region of memory that on typical machines grows downward, even though the most recent additions to the stack are called the "top" of the stack. The location of the top of the stack is stored in the CPU in a special register called the stack pointer.

So a typical function call looks like this internally:

The current instruction pointer or program counter value, which gives the address of the next line of machine code to be executed, is pushed onto the stack.
Any arguments to the function are copied either into specially designated registers or onto new locations on the stack. The exact rules for how to do this vary from one CPU architecture to the next, but a typical convention might be that the first four arguments or so are copied into registers and the rest (if any) go on the stack.
The instruction pointer is set to the first instruction in the code for the function.
The function allocates additional space on the stack to hold its local variables (if any) and to save copies of the values of any registers it wants to use (so that it can restore their contents before returning to its caller).
The function body is executed until it hits a return statement.
Returning from the function is the reverse of invoking it: any saved registers are restored from the stack, the return value is copied to a standard register, and the values of the instruction pointer and stack pointer are restored to what they were before the function call.

From the programmer's perspective, the important point is that both the arguments and the local variables inside a function are stored in freshly-allocated locations that are thrown away after the function exits. So after a function call the state of the CPU is restored to its previous state, except for the return value. Any arguments that are passed to a function are passed as copies, so changing the values of the function arguments inside the function has no effect on the caller. Any information stored in local variables is lost.

Under rare circumstances, it may be useful to have a variable local to a function that persists from one function call to the next. You can do so by declaring the variable static. For example, here is a function that counts how many times it has been called:

   1 /* return the number of times the function has been called */
   2 int
   3 counter(void)
   4 {
   5     static count = 0;
   6 
   7     return ++count;
   8 }

Static local variables are stored outside the stack with global variables, and have unbounded extent. But they are only visible inside the function that declares them. This makes them slightly less dangerous than global variables---there is no fear that some foolish bit of code elsewhere will quietly change their value---but it is still the case that they usually aren't what you want. It is also likely that operations on static variables will be slightly slower than operations on ordinary ("automatic") variables, since making them persistent means that they have to be stored in (slow) main memory instead of (fast) registers.

CategoryProgrammingNotes

51. C/Pointers

Contents

Memory and addresses
Pointer variables
The null pointer
Pointers and functions
Pointer arithmetic and arrays
Void pointers
Run-time storage allocation
The restrict keyword

52. Memory and addresses

Memory in a typical modern computer is divided into two classes: a small number of registers, which live on the CPU chip and perform specialized functions like keeping track of the location of the next machine code instruction to execute or the current stack frame, and main memory, which (mostly) lives outside the CPU chip and which stores the code and data of a running program. When the CPU wants to fetch a value from a particular location in main memory, it must supply an address: a 32-bit or 64-bit unsigned integer on typical current architectures, referring to one of up to 2³² or 2⁶⁴ distinct 8-bit locations in the memory. These integers can be manipulated like any other integer; in C, they appear as pointers, a family of types that can be passed as arguments, stored in variables, returned from functions, etc.

53. Pointer variables

53.1. Declaring a pointer variable

The convention is C is that the declaration of a complex type looks like its use. To declare a pointer-valued variable, write a declaration for the thing that it points to, but include a * before the variable name:

   1     int *pointerToInt;
   2     double *pointerToDouble;
   3     char *pointerToChar;
   4     char **pointerToPointerToChar;

53.2. Assigning to pointer variables

Declaring a pointer-valued variable allocates space to hold the pointer but not to hold anything it points to. Like any other variable in C, a pointer-valued variable will initially contain garbage---in this case, the address of a location that might or might not contain something important. To initialize a pointer variable, you have to assign to it the address of something that already exists. Typically this is done using the & (address-of) operator:

   1     int n;              /* an int variable */
   2     int *p;             /* a pointer to an int */
   3 
   4     p = &n;             /* p now points to n */

53.3. Using a pointer

Pointer variables can be used in two ways: to get their value (a pointer), e.g. if you want to assign an address to more than one pointer variable:

   1     int n;              /* an int variable */
   2     int *p;             /* a pointer to an int */
   3     int *q;             /* another pointer to an int */
   4 
   5     p = &n;             /* p now points to n */
   6     q = p;              /* q now points to n as well */

But more often you will want to work on the value stored at the location pointed to. You can do this by using the * (dereference) operator, which acts as an inverse of the address-of operator:

   1     int n;              /* an int variable */
   2     int *p;             /* a pointer to an int */
   3 
   4     p = &n;             /* p now points to n */
   5 
   6     *p = 2;             /* sets n to 2 */
   7     *p = *p + *p;       /* sets n to 4 */

The * operator binds very tightly, so you can usually use *p anywhere you could use the variable it points to without worrying about parentheses. However, a few operators, such as --, ++, and . (used in C/Structs) bind tighter, requiring parantheses if you want the * to take precedence.

   1     (*p)++;             /* increment the value pointed to by p */
   2     *p++;               /* WARNING: increments p itself */

53.4. Printing pointers

You can print a pointer value using printf with the %p format specifier. To do so, you should convert the pointer to type void * first using a cast (see below for void * pointers), although on machines that don't have different representations for different pointer types, this may not be necessary.

Here is a short program that prints out some pointer values:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 int G = 0;   /* a global variable, stored in BSS segment */
   5 
   6 int
   7 main(int argc, char **argv)
   8 {
   9     static int s;  /* static local variable, stored in BSS segment */
  10     int a;         /* automatic variable, stored on stack */
  11     int *p;        /* pointer variable for malloc below */
  12 
  13     /* obtain a block big enough for one int from the heap */
  14     p = malloc(sizeof(int));
  15 
  16     printf("&G   = %p\n", (void *) &G);
  17     printf("&s   = %p\n", (void *) &s);
  18     printf("&a   = %p\n", (void *) &a);
  19     printf("&p   = %p\n", (void *) &p);
  20     printf("p    = %p\n", (void *) p);
  21     printf("main = %p\n", (void *) main);
  22 
  23     free(p);
  24 
  25     return 0;
  26 }

looking_at_pointers.c

When I run this on a Mac OS X 10.6 machine after compiling with gcc, the output is:

&G   = 0x100001078
&s   = 0x10000107c
&a   = 0x7fff5fbff2bc
&p   = 0x7fff5fbff2b0
p    = 0x100100080
main = 0x100000e18

The interesting thing here is that we can see how the compiler chooses to allocate space for variables based on their storage classes. The global variable G and the static local variable s both persist between function calls, so they get placed in the BSS segment (see .bss) that starts somewhere around 0x100000000, typically after the code segment containing the actual code of the program. Local variables a and p are allocated on the stack, which grows down from somewhere near the top of the address space. The block that malloc returns that p points to is allocated off the heap, a region of memory that may also grow over time and starts after the BSS segment. Finally, main appears at 0x100000e18; this is in the code segment, which is a bit lower in memory than all the global variables.

54. The null pointer

The special value 0, known as the null pointer may be assigned to a pointer of any type. It may or may not be represented by the actual address 0, but it will act like 0 in all contexts (e.g., it has the value false in an if or while statement). Null pointers are often used to indicate missing data or failed functions. Attempting to dereference a null pointer can have catastrophic effects, so it's important to be aware of when you might be supplied with one.

55. Pointers and functions

A simple application of pointers is to get around C's limit on having only one return value from a function. Because C arguments are copied, assigning a value to an argument inside a function has no effect on the outside. So the doubler function below doesn't do much:

   1 #include <stdio.h>
   2 
   3 /* doesn't work */
   4 void
   5 doubler(int x)
   6 {
   7     x *= 2;
   8 }
   9 
  10 int
  11 main(int argc, char **argv)
  12 {
  13     int y;
  14 
  15     y = 1;
  16 
  17     doubler(y);                 /* no effect on y */
  18 
  19     printf("%d\n", y);          /* prints 1 */
  20 
  21     return 0;
  22 }

bad_doubler.c

However, if instead of passing the value of y into doubler we pass a pointer to y, then the doubler function can reach out of its own stack frame to manipulate y itself:

   1 #include <stdio.h>
   2 
   3 /* doesn't work */
   4 void
   5 doubler(int *x)
   6 {
   7     *x *= 2;
   8 }
   9 
  10 int
  11 main(int argc, char **argv)
  12 {
  13     int y;
  14 
  15     y = 1;
  16 
  17     doubler(&y);                /* sets y to 2 */
  18 
  19     printf("%d\n", y);          /* prints 2 */
  20 
  21     return 0;
  22 }

good_doubler.c

Generally, if you pass the value of a variable into a function (with no &), you can be assured that the function can't modify your original variable. When you pass a pointer, you should assume that the function can and will change the variable's value. If you want to write a function that takes a pointer argument but promises not to modify the target of the pointer, use const, like this:

   1 void
   2 printPointerTarget(const int *p)
   3 {
   4     printf("%d\n", *p);
   5 }

The const qualifier tells the compiler that the target of the pointer shouldn't be modified. This will cause it to return an error if you try to assign to it anyway:

   1 void
   2 printPointerTarget(const int *p)
   3 {
   4     *p = 5;  /* produces compile-time error */
   5     printf("%d\n", *p);
   6 }

Passing const pointers is mostly used when passing large structures to functions, where copying a 32-bit pointer is cheaper than copying the thing it points to.

If you really want to modify the target anyway, C lets you "cast away const":

   1 void
   2 printPointerTarget(const int *p)
   3 {
   4     *((int *) p) = 5;  /* produces compile-time error */
   5     printf("%d\n", *p);
   6 }

There is usually no good reason to do this; the one exception might be if the target of the pointer represents an AbstractDataType, and you want to modify its representation during some operation to optimize things somehow in a way that will not be visible outside the abstraction barrier, making it appear to leave the target constant.

Note that while it is safe to pass pointers down into functions, it is very dangerous to pass pointers up. The reason is that the space used to hold any local variable of the function will be reclaimed when the function exits, but the pointer will still point to the same location, even though something else may now be stored there. So this function is very dangerous:

   1 int *
   2 dangerous(void)
   3 {
   4     int n;
   5 
   6     return &n;          /* NO! */
   7 }
   8 
   9 ...
  10 
  11     *dangerous() = 12;  /* writes 12 to some unknown location */

An exception is when you can guarantee that the location pointed to will survive even after the function exits, e.g. when the location is dynamically allocated using malloc (see below) or when the local variable is declared static:

   1 int *
   2 returnStatic(void)
   3 {
   4     static int n;
   5 
   6     return &n;
   7 }
   8 
   9 ...
  10 
  11     *returnStatic() = 12;       /* writes 12 to the hidden static variable */

Usually returning a pointer to a static local variable is not good practice, since the point of making a variable local is to keep outsiders from getting at it. If you find yourself tempted to do this, a better approach is to allocate a new block using malloc (see below) and return a pointer to that. The downside of the malloc method is that the caller has to promise to call free on the block later, or you will get a storage leak.

56. Pointer arithmetic and arrays

Because pointers are just numerical values, one can do arithmetic on them. Specifically, it is permitted to

Add an integer to a pointer or subtract an integer from a pointer. The effect of p+n where p is a pointer and n is an integer is to compute the address equal to p plus n times the size of whatever p points to (this is why int * pointers and char * pointers aren't the same).
Subtract one pointer from another. The two pointers must have the same type (e.g. both int * or both char *). The result is an integer value, equal to the numerical difference between the addresses divided by the size of the objects pointed to.
Compare two pointers using ==, !=, <, >, <=, or >=.
Increment or decrement a pointer using ++ or --.

The main application of pointer arithmetic in C is in arrays. An array is a block of memory that holds one or more objects of a given type. It is declared by giving the type of object the array holds followed by the array name and the size in square brackets:

   1     int a[50];          /* array of 50 ints */
   2     char *cp[100];      /* array of 100 pointers to char */

Declaring an array allocates enough space to hold the specified number of objects (e.g. 200 bytes for a above and 400 for cp---note that a char * is an address, so it is much bigger than a char). The number inside the square brackets must be a constant whose value can be determined at compile time.

The array name acts like a constant pointer to the zeroth element of the array. It is thus possible to set or read the zeroth element using *a. But because the array name is constant, you can't assign to it:

   1     *a = 12;            /* sets zeroth element to 12 */
   2 
   3     a = &n;             /* #### DOESN'T WORK #### */

More common is to use square brackets to refer to a particular element of the array. The expression a[n] is defined to be equivalent to *(a+n); the index n (an integer) is added to the base of the array (a pointer), to get to the location of the n-th element of a. The implicit * then dereferences this location so that you can read its value (in a normal expression) or assign to it (on the left-hand side of an assignment operator). The effect is to allow you to use a[n] just as you would any other variable of type int (or whatever type a was declared as).

Note that C doesn't do any sort of bounds checking. Given the declaration int a[50];, only indices from a[0] to a[49] can be used safely. However, the compiler will not blink at a[-12] or a[10000]. If you read from such a location you will get garbage data; if you write to it, you will overwrite god-knows-what, possibly trashing some other variable somewhere else in your program or some critical part of the stack (like the location to jump to when you return from a function). It is up to you as a programmer to avoid such buffer overruns, which can lead to very mysterious (and in the case of code that gets input from a network, security-damaging) bugs. The valgrind program can help detect such overruns in some cases (see C/valgrind).

Another curious feature of the definition of a[n] as identical to *(a+n) is that it doesn't actually matter which of the array name or the index goes inside the braces. So all of a[0], *a, and 0[a] refer to the zeroth entry in a. Unless you are deliberately trying to obfuscate your code, it's best to write what you mean.

56.1. Arrays and functions

Because array names act like pointers, they can be passed into functions that expect pointers as their arguments. For example, here is a function that computes the sum of all the values in an array a of size n:

   1 /* return the sum of the values in a, an array of size n */
   2 int
   3 sumArray(int in, const int *a)
   4 {
   5     int i;
   6     int sum;
   7 
   8     sum = 0;
   9     for(i = 0; i < n; i++) {
  10         sum += a[i];
  11     }
  12 
  13     return sum;
  14 }

Note the use of const to promise that sumArray won't modify the contents of a.

Another way to write the function header is to declare a as an array of unknown size:

   1 /* return the sum of the values in a, an array of size n */
   2 int
   3 sumArray(int n, const int a[])
   4 {
   5     ...
   6 }

This has exactly the same meaning to the compiler as the previous definition. Even though normally the declarations int a[10] and int *a mean very different things (the first one allocates space to hold 10 ints, and prevents assigning a new value to a), in a function argument int a[] is just SyntacticSugar for int *a. You can even modify what a points to inside sumArray by assigning to it. This will allow you to do things that you usually don't want to do, like write this hideous routine:

   1 /* return the sum of the values in a, an array of size n */
   2 int
   3 sumArray(int n, const int a[])
   4 {
   5     const int *an;      /* pointer to first element not in a */
   6     int sum;
   7 
   8     sum = 0;
   9     an = a+n;
  10 
  11     while(a < an) {
  12         sum += *a++;
  13     }
  14 
  15     return sum;
  16 }

56.2. Multidimensional arrays

Arrays can themselves be members of arrays. The result is a multidimensional array, where a value in row i and column j is accessed by a[i][j].

Declaration is similar to one-dimensional arrays:

   1 int a[6][4];    /* declares an array of 6 rows of 4 ints each */

This declaration produces an array of 24 int values, packed contiguously in memory. The interpretation is that a is an array of 6 objects, each of which is an array of 4 ints.

If we imagine the array to contain increasing values like this:

 0  1  2  3  4  5
 6  7  8  9 10 11
12 13 14 15 16 17

the actual positions in memory will look like this:

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
 ^                 ^                 ^
a[0]              a[1]              a[2]

To look up a value, we do the usual array-indexing magic. Suppose we want to find a[1][4]. The name a acts as a pointer to the base of the array.The name a[1] says to skip ahead 1 times the size of the things pointed to by a, which are arrays of 6 ints each, for a total size of 24 bytes assuming 4-byte ints. For a[1][4], we start at a[1] and move forward 4 times the size of the thing pointed to by a[1], which is an int; this puts us 24+16 bytes from a, the position of 10 in the picture above.

Like other array declarations, the size must be specified at compile time in pre-C99 C. If this is not desirable, a similar effect can be obtained by allocating each row separately using malloc and building a master list of pointers to rows, of type int **. The downside of this approach is that the array is no longer contiguous (which may affect cache performance) and it requires reading a pointer to find the location of a particular value, instead of just doing address arithmetic starting from the base address of the array. But elements can still be accessed using the a[i][j] syntax. An example of this approach is given in malloc2d.c.

56.3. Variable-length arrays

C99 adds the feature of variable-length arrays, where the size of the array is determined at run-time. These can only appear as local variables in procedures (automatic variables) or in argument lists. In the case of variable-length arrays in argument lists, it is also necessary that the length of the array be computable from previous arguments.

For example, we could make the length of the array explicit in our sumArray function:

   1 /* return the sum of the values in a, an array of size n */
   2 int
   3 sumArray(int n, const int a[n])
   4 {
   5     int i;
   6     int sum;
   7 
   8     sum = 0;
   9     for(i = 0; i < n; i++) {
  10         sum += a[i];
  11     }
  12 
  13     return sum;
  14 }

This doesn't accompish much, because the length of the array is not used. However, it does become useful if we have a two-dimensional array, as otherwise there is no way to compute the length of each row:

   1 int
   2 sumMatrix(int rows, int cols, const int m[rows][cols])
   3 {
   4     int i;
   5     int j;
   6     int sum;
   7 
   8     sum = 0;
   9     for(i = 0; i < rows; i++) {
  10         for(j = 0; j < cols; j++) {
  11             sum += a[i][j];
  12         }
  13     }
  14 
  15     return sum;
  16 }

Here the fact that each row of m is known to be an array of cols many ints makes the implicit pointer computation in a[i][j] actually work. It is considerably more difficult to to this in ANSI C; the simplest approach is to pack m into a one-dimensional array and do the address computation explicitly:

   1 int
   2 sumMatrix(int rows, int cols, const int a[])
   3 {
   4     int i;
   5     int j;
   6     int sum;
   7 
   8     sum = 0;
   9     for(i = 0; i < rows; i++) {
  10         for(j = 0; j < cols; j++) {
  11             sum += a[i*cols + j];
  12         }
  13     }
  14 
  15     return sum;
  16 }

Variable-length arrays can sometimes be used for run-time storage allocation, as an alternative to malloc and free (see below). A variable-length array allocated as a local variable will be deallocated when the containing scope (usually a function body, but maybe just a compound statement marked off by braces) exits. One consequence of this is that you can't return a variable-length array from a function.

Here is an example of code using this feature:

   1 /* reverse an array in place */
   2 void
   3 reverseArray(int n, int a[n])
   4 {
   5     /* algorithm: copy to a new array in reverse order */
   6     /* then copy back */
   7 
   8     int i;
   9     int copy[n];
  10 
  11     for(i = 0; i < n; i++) {
  12         /* the -1 is needed to that a[0] goes to a[n-1] etc. */
  13         copy[n-i-1] = a[i];
  14     }
  15 
  16     for(i = 0; i < n; i++) {
  17         a[i] = copy[i];
  18     }
  19 }

While using variable-length arrays for this purpose can simplify code in some cases, as a general programming practice it is extremely dangerous. The reason is that, unlike allocations through malloc, variable-length array allocations are typically allocated on the stack (which is often more constrainted than the heap) and have no way of reporting failure. So if there isn't enough room for your variable-length array, odds are you won't find out until a segmentation fault occurs somewhere later in your code when you try to use it.

(As an additional annoyance, gdb is confused by two-dimensional variable-length arrays.)

Here's a safer version of the above routine, using malloc and free.

   1 /* reverse an array in place */
   2 void
   3 reverseArray(int n, int a[n])
   4 {
   5     /* algorithm: copy to a new array in reverse order */
   6     /* then copy back */
   7 
   8     int i;
   9     int *copy;
  10 
  11     copy = (int *) malloc(n * sizeof(int));
  12     assert(copy);  /* or some other error check */
  13 
  14     for(i = 0; i < n; i++) {
  15         /* the -1 is needed to that a[0] goes to a[n-1] etc. */
  16         copy[n-i-1] = a[i];
  17     }
  18 
  19     for(i = 0; i < n; i++) {
  20         a[i] = copy[i];
  21     }
  22 
  23     free(copy);
  24 }

57. Void pointers

A special pointer type is void *, a "pointer to void". Such pointers are declared in the usual way:

   1     void *nothing;      /* pointer to nothing */

Unlike ordinary pointers, you can't derefence a void * pointer or do arithmetic on it, because the compiler doesn't know what type it points to. However, you are allowed to use a void * as a kind of "raw address" pointer value that you can store arbitrary pointers in. It is permitted to assign to a void * variable from an expression of any pointer type; conversely, a void * pointer value can be assigned to a pointer variable of any type.

If you need to use a void * pointer as a pointer of a particular type in an expression, you can cast it to the appropriate type by prefixing it with a type name in parentheses, like this:

   1     int a[50];          /* typical array of ints */
   2     void *p;            /* dangerous void pointer */
   3 
   4     a[12] = 17;         /* save that valuable 17 */
   5     p = a;              /* p now holds base address of a */
   6 
   7     printf("%d\n", ((int *) p)[12]);  /* get 17 back */

Usually if you have to start writing casts, it's a sign that you are doing something wrong, and you run the danger of violating the type system---say, by tricking the compiler into treating a block of bits that are supposed to be an int as four chars. But violating the type system like this will be necessary for some applications, because even the weak type system in C turns out to be too restrictive for writing certain kinds of "generic" code that work on values of arbitrary types.

58. Run-time storage allocation

C does not permit arrays to be declared with variable sizes. C also doesn't let local variables outlive the function they are declared in. Both features can be awkward if you want to build data structures at run time that have unpredictable (perhaps even changing) sizes and that are intended to persist longer than the functions that create them. To build such structures, the standard C library provides the malloc routine, which asks the operating system for a block of space of a given size (in bytes). With a bit of pushing and shoving, this can be used to obtain a block of space that for all practical purposes acts just like an array.

To use malloc, you must include stdlib.h at the top of your program. The declaration for malloc is

   1 void *malloc(size_t);

where size_t is an integer type (often unsigned long). Calling malloc with an argument of n allocates and returns a pointer to the start of a block of n bytes if possible. If the system can't give you the space you asked for (maybe you asked for more space than it has), malloc returns a null pointer. It is good practice to test the return value of malloc whenever you call it.

Because the return type of malloc is void *, its return value can be assigned to any variable with a pointer type. Computing the size of the block you need is your responsibility---and you will be punished for any mistakes with difficult-to-diagnose buffer overrun errors---but this task is made slightly easier by the built-in sizeof operator that allows you to compute the size in bytes of any particular data type. A typical call to malloc might thus look something like this:

   1 #include <stdlib.h>
   2 
   3 /* allocate and return a new integer array with n elements */
   4 /* calls abort() if there isn't enough space */
   5 int *
   6 makeIntArray(int n)
   7 {
   8     int *a;
   9 
  10     a = malloc(sizeof(int) * n);
  11 
  12     if(a == 0) abort();                 /* die on failure */
  13 
  14     return a;
  15 }

When you are done with a malloc'd region, you should return the space to the system using the free routine, also defined in stdlib.h. If you don't do this, your program will quickly run out of space. The free routine takes a void * as its argument and returns nothing. It is good practice to write a matching destructor that de-allocates an object for each constructor (like makeIntArray) that makes one.

   1 void
   2 destroyIntArray(int *a)
   3 {
   4     free(a);
   5 }

It is a serious error to do anything at all with a block after it has been freed.

It is also possible to grow or shrink a previously allocated block. This is done using the realloc function, which is declared as

   1 void *realloc(void *oldBlock, size_t newSize);

The realloc function returns a pointer to the resized block. It may or may not allocate a new block; if there is room, it may leave the old block in place and return its argument. But it may allocate a new block and copy the contents of the old block, so you should assume that the old pointer has been freed.

Here's a typical use of realloc to build an array that grows as large as it needs to be:

   1 /* read numbers from stdin until there aren't any more */
   2 /* returns an array of all numbers read, or null on error */
   3 /* returns the count of numbers read in *count */
   4 int *
   5 readNumbers(int *count /* RETVAL */)
   6 {
   7     int mycount;        /* number of numbers read */
   8     int size;           /* size of block allocated so far */
   9     int *a;             /* block */
  10     int n;              /* number read */
  11 
  12     mycount = 0;
  13     size = 1;
  14 
  15     a = malloc(sizeof(int) * size);     /* allocating zero bytes is tricky */
  16     if(a == 0) return 0;
  17 
  18     while(scanf("%d", &n) == 1) {
  19         /* is there room? */
  20         while(mycount >= size) {
  21             /* double the size to avoid calling realloc for every number read */
  22             size *= 2;
  23             a = realloc(a, sizeof(int) * size);
  24             if(a == 0) return 0;
  25         }
  26 
  27         /* put the new number in */
  28         a[mycount++] = n;
  29     }
  30 
  31     /* now trim off any excess space */
  32     a = realloc(a, sizeof(int) * mycount);
  33     /* note: if a == 0 at this point we'll just return it anyway */
  34 
  35     /* save out mycount */
  36     *count = mycount;
  37 
  38     return a;
  39 }

Because errors involving malloc and its friends can be very difficult to spot, it is recommended to test any program that uses malloc using valgrind if possible. (See C/valgrind).

(See also C/DynamicStorageAllocation for some old notes on this subject.)

59. The restrict keyword

In C99, it is possible to declare that a pointer variable is the only way to reach its target as long as it is in scope. This is not enforced by the compiler; instead, it is a promise from the programmer to the compiler that any data reached through this point will not be changed by other parts of the code, which allows the compiler to optimize code in ways that are not possible if pointers might point to the same place (a phenomenon called pointer aliasing). For example, in the short routine:

   1 // write 1 + *src to *dst and return *src
   2 int
   3 copyPlusOne(int * restrict dst, int * restrict src)
   4 {
   5     *dst = *src + 1;
   6     return *src;
   7 }

the output of gcc -std=c99 -O3 -S includes one more instruction if the restrict qualifiers are removed. The reason is that if dst and src may point to the same location, src needs to be re-read for the return statement, in case it changed, but if they don't, the compiler can re-use the previous value it already has in one of the CPU registers.

For most code, this feature is useless, and potentially dangerous if someone calls your routine with aliased pointers. However, it may sometimes be possible to increase performance of time-critical code by adding a restrict keyword. The cost is that the code might no longer work if called with aliased pointers.

CategoryProgrammingNotes

60. C/Strings

61. String processing in general

Processing strings of characters is one of the oldest application of mechanical computers, arguably predating numerical computation by at least fifty years. Assuming you've already solved the problem of how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two standard ways to represent strings:

As a delimited string, where the end of a string is marked by a special character. The advantages of this method are that only one extra byte is needed to indicate the length of an arbitrarily long string, that strings can be manipulated by simple pointer operations, and in some cases that common string operations that involve processing the entire string can be performed very quickly. The disadvantage is that the delimiter can't appear inside any string, which limits what kind of data you can store in a string.
As a counted string, where the string data is prefixed or supplemented with an explicit count of the number of characters in the string. The advantage of this representation is that a string can hold arbitrary data (including delimiter characters) and that one can quickly jump to the end of the string without having to scan its entire length. The disadvantage is that maintaining a separate count typically requires more space than adding a one-byte delimiter (unless you limit your string length to 255 characters) and that more care needs to be taken to make sure that the count is correct.

62. C strings

Because delimited strings are more lightweight, C went for delimited strings. A string is a sequence of characters terminated by a null character '\0'. Note that the null character is not the same as a null pointer, although both appear to have the value 0 when used in integer contexts. A string is represented by a variable of type char *, which points to the zeroth character of the string. The programmer is responsible for allocating and managing space to store strings, except for explicit string constants, which are stored in a special non-writable string space by the compiler.

If you want to use counted strings instead, you can build your own using a struct (see C/Structs). Most scripting languages written in C (e.g. Perl, Python_programming_language, PHP, etc.) use this approach internally. (Tcl is an exception, which is one of many good reasons not to use Tcl).

63. String constants

A string constant in C is represented by a sequence of characters within double quotes. Standard C character escape sequences like \n (newline), \r (carriage return), \a (bell), \0x17 (character with hexadecimal code 0x17), \\ (backslash), and \" (double quote) can all be used inside string constants. The value of a string constant has type const char *, and can be assigned to variables and passed as function arguments or return values of this type.

Two string constants separated only by whitespace will be concatenated by the compiler as a single constant: "foo" "bar" is the same as "foobar". This feature is not used in normal code much, but shows up sometimes in macros (see C/Macros).

64. String buffers

The problem with string constants is that you can't modify them. If you want to build strings on the fly, you will need to allocate space for them. The traditional approach is to use a buffer, an array of chars. Here is a particularly painful hello-world program that builds a string by hand:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     char hi[3];
   7 
   8     hi[0] = 'h';
   9     hi[1] = 'i';
  10     hi[2] = '\0';
  11 
  12     puts(hi);
  13 
  14     return 0;
  15 }

hi.c

Note that the buffer needs to have size at least 3 in order to hold all three characters. A common error in programming with C strings is to forget to leave space for the null at the end (or to forget to add the null, which can have comical results depending on what you are using your surprisingly long string for).

65. Operations on strings

Unlike many programming languages, C provides only a rudimentary string-processing library. The reason is that many common string-processing tasks in C can be done very quickly by hand.

For example, suppose we want to copy a string from one buffer to another. The library function strcpy declared in string.h will do this for us (and is usually the right thing to use), but if it didn't exist we could write something very close to it using a famous C idiom.

   1 void
   2 strcpy2(char *dest, const char *src)
   3 {
   4     /* This line copies characters one at a time from *src to *dest. */
   5     /* The postincrements increment the pointers (++ binds tighter than *) */
   6     /*  to get to the next locations on the next iteration through the loop. */
   7     /* The loop terminates when *src == '\0' == 0. */
   8     /* There is no loop body because there is nothing to do there. */
   9     while(*dest++ = *src++);
  10 }

The externally visible difference between strcpy2 and the original strcpy is that strcpy returns a char * equal to its first argument. It is also likely that any implementation of strcpy found in a recent C library takes advantage of the width of the memory data path to copy more than one character at a time.

Most C programmers will recognize the while(*dest++ = *src++); from having seen it before, although experienced C programmers will generally be able to figure out what such highly abbreviated constructions mean. Exposure to such constructions is arguably a form of hazing.

Because C pointers act exactly like array names, you can also write strcpy2 using explicit array indices. The result is longer but may be more readable if you aren't a C fanatic.

   1 char *
   2 strcpy2a(char *dest, const char *src)
   3 {
   4     int ;
   5 
   6     i = 0;
   7     for(i = 0; src[i] != '\0'; i++) {
   8         dest[i] = src[i];
   9     }
  10 
  11     /* note that the final null in src is not copied by the loop */
  12     dest[i] = '\0';
  13 
  14     return dest;
  15 }

An advantage of using a separate index in strcpy2a is that we don't trash dest, so we can return it just like strcpy does. (In fairness, strcpy2 could have saved a copy of the original location of dest and done the same thing.)

Note that nothing in strcpy2, strcpy2a, or the original strcpy will save you if dest points to a region of memory that isn't big enough to hold the string at src, or if somebody forget to tack a null on the end of src (in which case strcpy will just keep going until it finds a null character somewhere). As elsewhere, it's your job as a programming to make sure there is enough room. Since the compiler has no idea what dest points to, this means that you have to remember how much room is available there yourself.

If you are worried about overrunning dest, you could use strncpy instead. The strncpy function takes a third argument that gives the maximum number of characters to copy; however, if src doesn't contain a null character in this range, the resulting string in dest won't either. Usually the only practical application to strncpy is to extract the first k characters of a string, as in

   1 /* copy the substring of src consisting of characters at positions
   2     start..end-1 (inclusive) into dest */
   3 /* If end-1 is past the end of src, copies only as many characters as 
   4     available. */
   5 /* If start is past the end of src, the results are unpredictable. */
   6 /* Returns a pointer to dest */
   7 char *
   8 copySubstring(char *dest, const char *src, int start, int end)
   9 {
  10     /* copy the substring */
  11     strncpy(dest, src + start, end - start);
  12 
  13     /* add null since strncpy probably didn't */
  14     dest[end - start] = '\0';
  15 
  16     return dest;
  17 }

Another quick and dirty way to extract a substring of a string you don't care about (and can write to) is to just drop a null character in the middle of the sacrificial string. This is generally a bad idea unless you are certain you aren't going to need the original string again, but it's a surprisingly common practice among C programmers of a certain age.

A similar operation to strcpy is strcat. The difference is that strcat concatenates src on to the end of dest; so that if dest previous pointed to "abc" and src to "def", dest will now point to "abcdef". Like strcpy, strcat returns its first argument. A no-return-value version of strcat is given below.

   1 void
   2 strcat2(char *dest, const char *src)
   3 {
   4     while(*dest) dest++;
   5     while(*dest++ = *src++);
   6 }

Decoding this abomination is left as an exercise for the reader. There is also a function strncat which has the same relationship to strcat that strncpy has to strcpy.

As with strcpy, the actual implementation of strcat may be much more subtle, and is likely to be faster than rolling your own.

66. Finding the length of a string

Because the length of a string is of fundamental importance in C (e.g., when deciding if you can safely copy it somewhere else), the standard C library provides a function strlen that counts the number of non-null characters in a string. Here's a possible implementation:

   1 int
   2 strlen(const char *s)
   3 {
   4     int i;
   5 
   6     for(i = 0; *s; i++, s++);
   7 
   8     return i;
   9 }

Note the use of the comma operator in the increment step. The comma operator applied to two expressions evaluates both of them and discards the value of the first; it is usually used only in for loops where you want to initialize or advance more than one variable at once.

Like the other string routines, using strlen requires including string.h.

66.1. The strlen tarpit

A common mistake is to put a call to strlen in the header of a loop; for example:

   1 /* like strcpy, but only copies characters at indices 0, 2, 4, ...
   2    from src to dest */
   3 char *
   4 copyEvenCharactersBadVersion(char *dest, const char *src)
   5 {
   6     int i;
   7     int j;
   8 
   9     /* BAD: Calls strlen on every pass through the loop */
  10     for(i = 0, j = 0; i < strlen(src); i += 2, j++) {
  11         dest[j] = src[i];
  12     }
  13 
  14     dest[j] = '\0';
  15 
  16     return dest;
  17 }

The problem is that strlen has to scan all of src every time the test is done, which adds time proportional to the length of src to each iteration of the loop. So copyEvenCharactersBadVersion takes time proportional to the square of the length of src.

Here's a faster version:

   1 /* like strcpy, but only copies characters at indices 0, 2, 4, ...
   2    from src to dest */
   3 char *
   4 copyEvenCharacters(char *dest, const char *src)
   5 {
   6     int i;
   7     int j;
   8     int len;    /* length of src */
   9 
  10     len = strlen(src);
  11 
  12     /* GOOD: uses cached value of strlen(src) */
  13     for(i = 0, j = 0; i < len; i += 2, j++) {
  14         dest[j] = src[i];
  15     }
  16 
  17     dest[j] = '\0';
  18 
  19     return dest;
  20 }

Because it doesn't call strlen all the time, this version of copyEvenCharacters will run much faster than the original even on small strings, and several million times faster if src is a megabyte long.

67. Comparing strings

If you want to test if strings s1 and s2 contain the same characters, writing s1 == s2 won't work, since this tests instead whether s1 and s2 point to the same address. Instead, you should use strcmp, declared in string.h. The strcmp function walks along both of its arguments until it either hits a null on both and returns 0, or hits two different characters, and returns a positive integer if the first string's character is bigger and a negative integer if the second string's character is bigger (a typical implementation will just subtract the two characters). A possible but slow implementation might look like this:

   1 int
   2 strcmp(const char *s1, const char *s2)
   3 {
   4     while(*s1 && *s2 && *s1 == *s2) {
   5         s1++;
   6         s2++;
   7     }
   8 
   9     return *s1 - *s2;
  10 }

(The reason this implementation is slow on modern hardware is that it only compares the strings one character at a time; it is almost always faster to compare four characters at once on a 32-bit architecture, although doing so requires no end of trickiness to detect the end of the strings. It is also likely that whatever C library you are using contains even faster hand-coded assembly language versions of strcmp and the other string routines for most of the CPU architectures you are likely to use. Under some circumstances, the compiler when running with the optimizer turned on may even omit a function call entirely and just patch the appropriate assembly-language code directly into whatever routine calls strcmp, strlen etc. As a programmer, you should not be able to detect that any of these optimizations are happening, but they are another reason to use standard C language or library features when you can.)

To use strcmp to test equality, test if the return value is 0:

   1     if(strcmp(s1, s2) == 0) {
   2         /* strings are equal */
   3         ...
   4     }

You may sometimes see this idiom instead:

   1     if(!strcmp(s1, s2)) {
   2         /* strings are equal */
   3         ...
   4     }

My own feeling is that the first version is more clear, since !strcmp always suggested to me that you were testing for not some property (e.g. not equal). But if you think of strcmp as telling you when two strings are different rather than when they are equal, this may not be so confusing.

68. Formatted output to strings

You can write formatted output to a string buffer with sprintf just like you can write it to stdout with printf or to a file with fprintf. Make sure when you do so that there is enough room in the buffer you are writing to, or the usual bad things will happen.

69. Dynamic allocation of strings

When allocating space for a copy of a string s using malloc, the required space is strlen(s)+1. Don't forget the +1, or bad things may happen.⁵ Because allocating space for a copy of a string is such a common operation, many C libraries provide a strdup function that does exactly this. If you don't have one (it's not required by the C standard), you can write your own like this:

   1 /* return a freshly-malloc'd copy of s */
   2 /* or 0 if malloc fails */
   3 /* It is the caller's responsibility to free the returned string when done. */
   4 char *
   5 strdup(const char *s)
   6 {
   7     char *s2;
   8 
   9     s2 = malloc(strlen(s)+1);
  10 
  11     if(s2 != 0) {
  12         strcpy(s2, s);
  13     }
  14 
  15     return s2;
  16 }

Exercise: Write a function strcat_alloc that returns a freshly-malloc'd string that concatenates its two arguments. Exactly how many bytes do you need to allocate?

70. argc and argv

Now that we know about strings, we can finally do something with argv. Recall that argv in main is declared as char **; this means that it is a pointer to a pointer to a char, or in this case the base address of an array of pointers to char, where each such pointer references a string. These strings correspond to the command-line arguments to your program, with the program name itself appearing in argv[0]⁶ The count argc counts all arguments including argv[0]; it is 1 if your program is called with no arguments and larger otherwise.

Here is a program that prints its arguments. If you get confused about what argc and argv do, feel free to compile this and play with it:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     int i;
   7 
   8     printf("argc = %d\n\n", argc);
   9 
  10     for(i = 0; i < argc; i++) {
  11 	printf("argv[%d] = %s\n", i, argv[i]);
  12     }
  13 
  14     return 0;
  15 }

print_args.c

Like strings, C terminates argv with a null: the value of argv[argc] is always 0 (a null pointer to char). In principle this allows you to recover argc if you lose it.

CategoryProgrammingNotes

71. C/Structs

72. Structs

A struct is a way to define a type that consists of one or more other types pasted together. Here's a typical struct definition:

   1 struct string {
   2     int length;
   3     char *data;
   4 };

This defines a new type struct string that can be used anywhere you would use a simple type like int or float. When you declare a variable with type struct string, the compiler allocates enough space to hold both an int and a char * (8 bytes on a typical 32-bit machine). You can get at the individual components using the . operator, like this:

   1 struct string {
   2     int length;
   3     char *data;
   4 };
   5 
   6 int
   7 main(int argc, char **argv)
   8 {
   9     struct string s;
  10 
  11     s.length = 4;
  12     s.data = "this string is a lot longer than you think";
  13 
  14     puts(s.data);
  15 
  16     return 0;
  17 }

struct_example.c

Variables of type struct can be assigned to, passed into functions, returned from functions, and tested for equality, just like any other type. Each such operation is applied componentwise; for example, s1 = s2; is equivalent to s1.length = s2.length; s1.data = s2.data; and s1 == s2 is equivalent to s1.length == s2.length && s1.data = s2.data.

These operations are not used as often as you might think: typically, instead of copying around entire structures, C programs pass around pointers, as is done with arrays. Pointers to structs are common enough in C that a special syntax is provided for dereferencing them.⁷ Suppose we have:

   1     struct string s;            /* a struct */
   2     struct string *sp;          /* a pointer to a struct */
   3 
   4     s.length = 4;
   5     s.data = "another overly long string";
   6 
   7     sp = &s;

We can then refer to elements of the struct string that sp points to (i.e. s) in either of two ways:

   1     puts((*sp).data);
   2     puts(sp->data);

The second is more common, since it involves typing fewer parentheses. It is an error to write *sp.data in this case; since . binds tighter than *, the compiler will attempt to evaluate sp.data first and generate an error, since sp doesn't have a data field.

Pointers to structs are commonly used in defining AbstractDataTypes, since it is possible to declare that a function returns e.g. a struct string * without specifying the components of a struct string. (All pointers to structs in C have the same size and structure, so the compiler doesn't need to know the components to pass around the address.) Hiding the components discourages code that shouldn't look at them from doing so, and can be used, for example, to enforce consistency between fields.

For example, suppose we wanted to define a struct string * type that held counted strings that could only be accessed through a restricted interface that prevented (for example) the user from changing the string or its length. We might create a file myString.h that contained the declarations:

   1 /* make a struct string * that holds a copy of s */
   2 struct string *makeString(const char *s);
   3 
   4 /* destroy a struct string * */
   5 void destroyString(struct string *);
   6 
   7 /* return the length of a struct string * */
   8 int stringLength(struct string *);
   9 
  10 /* return the character at position index in the struct string * */
  11 /* or returns -1 if index is out of bounds */
  12 int stringCharAt(struct string *s, int index);

myString.h

and then the actual implementation in myString.c would be the only place where the components of a struct string were defined:

   1 #include <stdlib.h>
   2 #include <string.h>
   3 
   4 #include "myString.h"
   5 
   6 struct string {
   7     int length;
   8     char *data;
   9 };
  10 
  11 struct string *
  12 makeString(const char *s)
  13 {
  14     struct string *s2;
  15 
  16     s2 = malloc(sizeof(struct string));
  17     if(s2 == 0) return 0;
  18 
  19     s2->length = strlen(s);
  20 
  21     s2->data = malloc(s2->length);
  22     if(s2->data == 0) {
  23 	free(s2);
  24 	return 0;
  25     }
  26 
  27     strncpy(s2->data, s, s2->length);
  28 
  29     return s2;
  30 }
  31 
  32 void
  33 destroyString(struct string *s)
  34 {
  35     free(s->data);
  36     free(s);
  37 }
  38 
  39 int
  40 stringLength(struct string *s)
  41 {
  42     return s->length;
  43 }
  44 
  45 int
  46 stringCharAt(struct string *s, int index)
  47 {
  48     if(index < 0 || index >= s->length) {
  49 	return -1;
  50     } else {
  51 	return s->data[index];
  52     }
  53 }

myString.c

In practice, we would probably go even further and replace all the struct string * types with a new name declared with typedef.

73. Unions

A union is just like a struct, except that instead of allocating space to store all the components, the compiler only allocates space to store the largest one, and makes all the components refer to the same address. This can be used to save space if you know that only one of several components will be meaningful for a particular object. An example might be a type representing an object in a LISP-like language like Scheme:

   1 struct lispObject {
   2     int type;           /* type code */
   3     union {
   4         int     intVal;
   5         double  floatVal;
   6         char *  stringVal;
   7         struct {
   8             struct lispObject *car;
   9             struct lispObject *cdr;
  10         } consVal;
  11     } u;
  12 };

Now if you wanted to make a struct lispObject that held an integer value, you might write

   1     lispObject o;
   2 
   3     o.type = TYPE_INT;
   4     o.u.intVal = 27;

where TYPE_INT had presumably been defined somewhere. Note that nothing then prevents you from writing

   1     x = 2.7 * o.u.floatVal;

but the effects will be strange, since it's likely that the bit pattern representing 27 as an int represents something very different as a double. Avoiding such mistakes is your responsibility, which is why most uses of union occur inside larger structs that contain enough information to figure out which variant of the union applies.

74. Bit fields

It is possible to specify the exact number of bits taken up by a member of a struct of integer type. This is seldom useful, but may in principle let you pack more information in less space, e.g.:

   1 struct color {
   2     unsigned int red   : 2;
   3     unsigned int green : 2;
   4     unsigned int blue  : 2;
   5     unsigned int alpha : 2;
   6 };

defines a struct that (probably) occupies only one byte, and supplies four 2-bit fields, each of which can hold values in the range 0-3.

CategoryProgrammingNotes

75. AbstractDataTypes

76. Abstraction

One of the hard parts about computer programming is that, in general, programs are bigger than brains. Unless you have an unusally capacious brain, it is unlikely that you will be able to understand even a modestly large program in its entirety. So in order to be able to write and debug large programs, it is important to be able to break it up into pieces, where each piece can be treated as a tool whose use and description is simpler (and therefor fits in your brain better) than its actual code. Then you can forget about what is happening inside that piece, and just treat it as an easily-understood black box from the outside.

This process of wrapping functionality up in a box and forgetting about its internals is called abstraction, and it is the single most important concept in computer science. In these notes we will describe a particular kind of abstraction, the construction of abstract data types or ADTs. Abstract data types are data types whose implementation is not visible to their user; from the outside, all the user knows about an ADT is what operations can be performed on it and what those operations are supposed to do.

ADTs have an outside and an inside. The outside is called the interface; it consists of the minimal set of type and function declarations needed to use the ADT. The inside is called the implementation; it consists of type and function definitions, and sometime auxiliary data or helper functions, that are not visible to users of the ADT.

77. Example of an abstract data type

Too much abstraction at once can be hard to take, so let's look at a concrete example of an abstract data type. This ADT will represent an infinite sequence of ints. Each instance of the Sequence type supports a single operation seq_next that returns the next int in the sequence. We will also need to provide one or more constructor functions to generate new Sequences, and a destructor function to tear them down.

Here is an example of a typical use of a Sequence:

   1 void
   2 seq_print(Sequence s, int limit)
   3 {
   4     int i;
   5 
   6     for(i = seq_next(s); i < limit; i = seq_next(s)) {
   7         printf("%d\n", i);
   8     }
   9 }

Note that seq_print doesn't need to know anything at all about what a Sequence is or how seq_next works in order to print out all the values in the sequence until it hits one greater than or equal to limit. This is a good thing--- it means that we can use with any implementation of Sequence we like, and we don't have to change it if Sequence or seq_next changes.

77.1. Interface

In C, the interface of an abstract data type will usually be declared in a header file, which is included both in the file that implements the ADT (so that the compiler can check that the declarations match up with the actual definitions in the implementation. Here's a header file for sequences:

77.1.1. sequence.h

   1 /* opaque struct: hides actual components of struct sequence,
   2  * which are defined in sequence.c */
   3 typedef struct sequence *Sequence;
   4 
   5 /* constructors */
   6 /* all our constructors return a null pointer on allocation failure */
   7 
   8 /* returns a Sequence representing init, init+1, init+2, ... */
   9 Sequence seq_create(int init);
  10 
  11 /* returns a Sequence representing init, init+step, init+2*step, ... */
  12 Sequence seq_create_step(int init, int step);
  13 
  14 /* destructor */
  15 /* destroys a Sequence, recovering all interally-allocated data */
  16 void seq_destroy(Sequence);
  17 
  18 /* accessor */
  19 /* returns the first element in a sequence not previously returned */
  20 int seq_next(Sequence);

sequence.h

Here we have defined two different constructors for Sequences, one of which gives slightly more control over the sequence than the other. If we were willing to put more work into the implementation, we could imagine building a very complicated Sequence type that supported a much wider variety of sequences (for example, sequences generated by functions or sequences read from files); but we'll try to keep things simple for now. We can always add more functionality later, since the users won't notice if the Sequence type changes internally.

77.2. Implementation

The implementation of an ADT in C is typically contained in one (or sometimes more than one) .c file. This file can be compiled and linked into any program that needs to use the ADT. Here is our implementation of Sequence:

77.2.1. sequence.c

   1 #include <stdlib.h>
   2 
   3 #include "sequence.h"
   4 
   5 struct sequence {
   6     int next;   /* next value to return */
   7     int step;   /* how much to increment next by */
   8 };
   9 
  10 Sequence
  11 seq_create(int init)
  12 {
  13     return seq_create_step(init, 1);
  14 }
  15 
  16 Sequence
  17 seq_create_step(int init, int step)
  18 {
  19     Sequence s;
  20 
  21     s = malloc(sizeof(*s));
  22     if(s == 0) return 0;
  23     s->next = init;
  24     s->step = step;
  25     return s;
  26 }
  27 
  28 void
  29 seq_destroy(Sequence s)
  30 {
  31     free(s);
  32 }
  33 
  34 int
  35 seq_next(Sequence s)
  36 {
  37     int ret;            /* saves the old value before we increment it */
  38 
  39     ret = s->next;
  40     s->next += s->step;
  41 
  42     return ret;
  43 }

sequence.c

Things to note here: the definition of struct sequence appears only in this file; this means that only the functions defined here can (easily) access the next and step components. This protects Sequences to a limited extent from outside interference, and defends against users who might try to "violate the abstraction boundary" by examining the components of a Sequence directly. It also means that if we change the components or meaning of the components in struct sequence, we only have to fix the functions defined in sequence.c.

77.3. Compiling and linking

Now that we have sequence.h and sequence.c, how do we use them? Let's suppose we have a simple main program:

77.3.1. main.c

   1 #include <stdio.h>
   2 
   3 #include "sequence.h"
   4 
   5 
   6 void
   7 seq_print(Sequence s, int limit)
   8 {
   9     int i;
  10 
  11     for(i = seq_next(s); i < limit; i = seq_next(s)) {
  12         printf("%d\n", i);
  13     }
  14 }
  15 
  16 
  17 int
  18 main(int argc, char **argv)
  19 {
  20     Sequence s;
  21     Sequence s2;
  22 
  23     puts("Stepping by 1:");
  24 
  25     s = seq_create(0);
  26     seq_print(s, 5);
  27     seq_destroy(s);
  28 
  29     puts("Now stepping by 3:");
  30 
  31     s2 = seq_create_step(1, 3);
  32     seq_print(s2, 20);
  33     seq_destroy(s2);
  34 
  35     return 0;
  36 }

main.c

We can compile main.c and sequence.c together into a single binary with the command gcc main.c sequence.c. Or we can build a Makefile which will compile the two files separately and then link them. Using make may be more efficient, especially for large programs consisting of many components, since if we make any changes make will only recompile those files we have changed. So here is our Makefile:

77.3.2. Makefile

   1 CC=gcc
   2 CFLAGS=-g3 -ansi -pedantic -Wall
   3 
   4 all: seqprinter
   5 
   6 seqprinter: main.o sequence.o
   7         $(CC) $(CFLAGS) -o $@ $^
   8 
   9 test: seqprinter
  10         ./seqprinter
  11 
  12 # these rules say to rebuild main.o and sequence.o if sequence.h changes
  13 main.o: main.c sequence.h
  14 sequence.o: sequence.c sequence.h
  15 
  16 clean:
  17         $(RM) -f seqprinter *.o

Makefile

And now running make test produces this output. Notice how the built-in make variables $@ and $^ expand out to the left-hand side and right-hand side of the dependency line for building seqprinter.

$ make test
gcc -g3 -ansi -pedantic -Wall   -c -o main.o main.c
gcc -g3 -ansi -pedantic -Wall   -c -o sequence.o sequence.c
gcc -g3 -ansi -pedantic -Wall -o seqprinter main.o sequence.o
./seqprinter
Stepping by 1:
0
1
2
3
4
Now stepping by 3:
1
4
7
10
13
16
19

78. Designing abstract data types

Now we've seen how to implement an abstract data type. How do we choose when to use when, and what operations to give it? Let's try answering the second question first.

78.1. Parnas's Principle

Parnas's Principle is a statement of the fundamental idea of information hiding, which says that abstraction boundaries should be as narrow as possible:

The developer of a software component must provide the intended user with all the information needed to make effective use of the services provided by the component, and should provide no other information.
The developer of a software component must be provided with all the information necessary to carry out the given responsibilities assigned to the component, and should be provided with no other information.

(David Parnas, "On the Criteria to Be Used in Decomposing Systems into Modules," Communications of the ACM, 15(12): 1059--1062, 1972.)

For ADTs, this means we should provide as few functions for accessing and modifying the ADT as we can get away with. The Sequence type we defined early has a particularly narrow interface; the developer of Sequence (whoever is writing sequence.c) needs to know nothing about what its user wants except for the arguments passed in to seq_create or seq_create_step, and the user only needs to be able to call seq_next. More complicated ADTs might provide larger sets of operations, but in general we know that an ADT provides a successful abstraction when the operations are all "natural" ones given our high-level description. If we find ourselves writing a lot of extra operations to let users tinker with the guts of our implementation, that may be a sign that either we aren't taking our abstraction barrier seriously enough, or that we need to put the abstraction barrier in a different place.

78.2. When to build an abstract data type

The short answer: Whenever you can.

A better answer: The best heuristic I know for deciding what ADTs to include in a program is to write down a description of how your program is going to work. For each noun or noun phrase in the description, either identify a built-in data type to implement it or design an abstract data type.

For example: a grade database maintains a list of students, and for each student it keeps a list of grades. So here we might want data types to represent:

A list of students,
A student,
A list of grades,
A grade.

If grades are simple, we might be able to make them just be ints (or maybe doubles); to be on the safe side, we should probably create a Grade type with a typedef. The other types are likely to be more complicated. Each student might have in addition to his or her grades a long list of other attributes, such as a name, an email address, etc. By wrapping students up as abstract data types we can extend these attributes if we need to, or allow for very general implementations (say, by allowing a student to have an arbitrary list of keyword-attribute pairs). The two kinds of lists are likely to be examples of sequence types; we'll be seeing a lot of ways to implement these as the course progresses. If we want to perform the same kinds of operations on both lists, we might want to try to implement them as a single list data type, which then is specialized to hold either students or grades; this is not always easy to do in C, but we'll see examples of how to do this, too.

Whether or not this set of four types is the set we will finally use, writing it down gives us a place to start writing our program. We can start writing interface files for each of the data types, and then evolve their implementations and the main program in parallel, adjusting the interfaces as we find that we have provided too little (or too much) data for each component to do what it must.

CategoryProgrammingNotes

79. C/Definitions

One of the goals of programming is to make your code readable by other programmers (including your future self). An important tool for doing so is to give good names to everything. Not only can such a name document what it names, it can also be used to hide implementation details that are not interesting or that may change later.

80. Naming types

Suppose that you want to represent character strings as

   1 struct string {
   2     int length;
   3     char *data;         /* malloc'd block */
   4 };
   5 
   6 int string_length(const struct string *s);

If you later change the representation to, say, traditional null-terminated char * strings or some even more complicated type (union string **some_string[2];), you will need to go back and replace ever occurrence of struct string * in every program that uses it with the new type. Even if you don't expect to change the type, you may still get tired of typing struct string * all the time, especially if your fingers slip and give you struct string sometimes.

The solution is to use a typedef, which defines a new type name:

   1 typedef struct string *String;
   2 
   3 int string_length(String s);

The syntax for typedef looks like a variable declaration preceded by typedef, except that the variable is replaced by the new type name that acts like whatever type the defined variable would have had. You can use a name defined with typedef anywhere you could use a normal type name, as long as it is later in the source file than the typedef definition. Typically typedefs are placed in a header file (.h file) that is then included anywhere that needs them.

You are not limited to using typedefs only for complex types. For example, if you were writing numerical code and wanted to declare overtly that a certain quantity was not just any double but actually a length in meters, you could write

   1 typedef double LengthInMeters;
   2 typedef double AreaInSquareMeters;
   3 
   4 AreaInSquareMeters rectangleArea(LengthInMeters height, LengthInMeters width);

Unfortunately, C does not do type enforcement on typedef'd types: it is perfectly acceptable to the compiler if you pass a value of type AreaInSquareMeters as the first argument to rectangleArea, since by the time it checks it has replaced by AreaInSquareMeters and LengthInMeters by double. So this feature is not as useful as it might be, although it does mean that you can write rectangleArea(2.0, 3.0) without having to do anything to convert 2.0 and 3.0 to type LengthInMeters.

81. Naming constants

Suppose that you have a function (call it getchar) that needs to signal that sometimes it didn't work. The usual way is to return a value that the function won't normally return. Now, you could just tell the user what value that is:

   1 /* get a character (as an `int` ASCII code) from `stdin` */
   2 /* return -1 on end of file */
   3 int getchar(void);

and now the user can write

   1     while((c = getchar()) != -1) {
   2         ...
   3     }

But then somebody reading the code has to remember that -1 means "end of file" and not "signed version of \0xff" or "computer room on fire, evacuate immediately." It's much better to define a constant EOF that happens to equal -1, because among other things if you change the special return value from getchar later then this code will still work (assuming you fixed the definition of EOF):

   1     while((c = getchar()) != EOF) {
   2         ...
   3     }

So how do you declare a constant in C? The traditional approach is to use the C preprocessor, the same tool that gets run before the compiler to expand out #include directives. To define EOF, the file /usr/include/stdio.h includes the text

   1 #define EOF (-1)
   2

What this means is that whenever the characters EOF appear in a C program as a separate word (e.g. in 1+EOF*3 but not in wOEFully_long_variable_name), then the preprocessor will replace them with the characters (-1). The parentheses around the -1 are customary to ensure that the -1 gets treated as a separate constant and not as part of some larger expression.

In general, any time you have a non-trivial constant in a program, it should be #defined. Examples are things like array dimensions, special tags or return values from functions, maximum or minimum values for some quantity, or standard mathematical constants (e.g., /usr/include/math.h defines M_PI as pi to umpteen digits). This allows you to write

   1     char buffer[MAX_FILENAME_LENGTH+1];
   2     
   3     area = M_PI*r*r;
   4 
   5     if(status == COMPUTER_ROOM_ON_FIRE) {
   6         evacuate();
   7     }

instead of

   1     char buffer[513];
   2     
   3     area = 3.141592319*r*r;
   4 
   5     if(status == 136) {
   6         evacuate();
   7     }

which is just an invitation to errors (including one the one on line 3).

Like typedefs, #defines that are intended to be globally visible are best done in header files; in large programs you will want to #include them in many source files. The usual convention is to write #defined names in all-caps to remind the user that they are macros and not real variables.

82. Naming values in sequences

C provides the enum construction for the special case where you want to have a sequence of named integer constants, but you don't care what their actual values are, as in

   1 enum color { RED, BLUE, GREEN, MAUVE, TURQUOISE };

This will assign the value 0 to RED, 1 to BLUE, and so on. These values are effectively of type int, although you can declare variables, arguments, and return values as type enum color to indicate their intended interpretation Despite declaring a variable enum color c (say), the compiler will still allow c to hold arbitrary values of type int; see enums_are_ints.c for some silly examples of this.

It is also possible to specify particular values for particular enumerated constants, as in

   1 enum color { RED = 37, BLUE = 12, GREEN = 66, MAUVE = 5, TURQUOISE };

Anything that doesn't get a value starts with one plus the previous value; so the above definition would set TURQUOISE to 6.

In practice, enums are seldom used, and you will more commonly see a stack of #defines:

   1 #define RED     (0)
   2 #define BLUE    (1)
   3 #define GREEN   (2)
   4 #define MAUVE   (3)
   5 #define TURQUOISE (4)
   6

The reason for this is partly historical—enum arrived late in the evolution of C—but partly practical: a table of #defines makes it much easier to figure out which color is represented by 3, without having to count through a list. But if you never plan to use the numerical values, enum is a better choice.

83. Other uses of #define

It is also possible to use #define to define preprocessor macros that take parameters; this will be discussed in C/Macros.

CategoryProgrammingNotes

84. C/Debugging

Contents

Debugging in general
Assertions
gdb
1. My favorite gdb commands
2. Debugging strategies
Valgrind
Not recommended: debugging output

85. Debugging in general

Basic method of all debugging:

Know what your program is supposed to do.
Detect when it doesn't.
Fix it.

A tempting mistake is to skip step 1, and just try randomly tweaking things until the program works. Better is to see what the program is doing internally, so you can see exactly where and when it is going wrong. A second temptation is to attempt to intuit where things are going wrong by staring at the code or the program's output. Avoid this temptation as well: let the computer tell you what it is really doing inside your program instead of guessing.

86. Assertions

Every non-trivial C program should include <assert.h>, which gives you the assert macro (see KernighanRitchie Appendix B6). The assert macro tests if a condition is true and halts your program with an error message if it isn't:

   1 #include <assert.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     assert(2+2 == 5);
   7     return 0;
   8 }

no.c

$ gcc -o no no.c
$ ./no
no: no.c:6: main: Assertion `2+2 == 5' failed.

Line numbers and everything, even if you compile with the optimizer turned on. Much nicer than a mere segmentation fault, and if you run it under the debugger, the debugger will stop exactly on the line where the assert failed so you can poke around and see why.

87. gdb

The standard debugger on Linux is called gdb. This lets you run your program under remote control, so that you can stop it and see what is going on inside.

Let's look at a contrived example. Suppose you have the following program bogus.c:

   1 #include <stdio.h>
   2 
   3 /* Print the sum of the integers from 1 to 1000 */
   4 int
   5 main(int argc, char **argv)
   6 {
   7     int i;
   8     int sum;
   9 
  10     sum = 0;
  11     for(i = 0; i -= 1000; i++) {
  12         sum += i;
  13     }
  14     printf("%d\n", sum);
  15     return 0;
  16 }

bogus.c

Let's compile and run it and see what happens:

$ gcc -g3 -o bogus bogus.c
$ ./bogus
-34394132
$

That doesn't look like the sum of 1 to 1000. So what went wrong? If we were clever, we might notice that the test in the for loop is using the mysterious -= operator instead of the <= operator that we probably want. But let's suppose we're not so clever right now—it's four in the morning, we've been working on bogus.c for twenty-nine straight hours, and there's a -= up there because in our befuddled condition we know in our bones that it's the right operator to use. We need somebody else to tell us that we are deluding ourselves, but nobody is around this time of night. So we'll have to see what we can get the computer to tell us.

The first thing to do is fire up gdb, the debugger. This runs our program in stop-motion, letting us step through it a piece at a time and watch what it is actually doing. In the example below gdb is run from the command line. You can also run it directly from Emacs with M-x gdb, which lets Emacs track and show you where your program is in the source file with a little arrow, or (if you are logged in directly on a Zoo machine) by running ddd, which wraps gdb in a graphical user interface.

$ gdb bogus
GNU gdb 4.17.0.4 with Linux/x86 hardware watchpoint and FPU support
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
(gdb) run
Starting program: /home/accts/aspnes/tmp/bogus 
-34394132

Program exited normally.

So far we haven't learned anything. To see our program in action, we need to slow it down a bit. We'll stop it as soon as it enters main, and step through it one line at a time while having it print out the values of the variables.

(gdb) break main
Breakpoint 1 at 0x8048476: file bogus.c, line 9.
(gdb) run
Starting program: /home/accts/aspnes/tmp/bogus 

Breakpoint 1, main (argc=1, argv=0xbffff9ac) at bogus.c:9
9           sum = 0;
(gdb) display sum
1: sum = 1
(gdb) n
10          for(i = 0; i -= 1000; i++)
1: sum = 0
(gdb) display i
2: i = 0
(gdb) n
11              sum += i;
2: i = -1000
1: sum = 0
(gdb) n
10          for(i = 0; i -= 1000; i++)
2: i = -1000
1: sum = -1000
(gdb) n
11              sum += i;
2: i = -1999
1: sum = -1000
(gdb) n
10          for(i = 0; i -= 1000; i++)
2: i = -1999
1: sum = -2999
(gdb) quit
The program is running.  Exit anyway? (y or n) y
$

Here we are using break main to tell the program to stop as soon as it enters main, display to tell it to show us the value of the variables i and sum whenever it stops, and n (short for next) to execute the program one line at a time.

When stepping through a program, gdb displays the line it will execute next as well as any variables you've told it to display. This means that any changes you see in the variables are the result of the previous displayed line. Bearing this in mind, we see that i drops from 0 to -1000 the very first time we hit the top of the for loop and drops to -1999 the next time. So something bad is happening in the top of that for loop, and if we squint at it a while we might begin to suspect that i -= 1000 is not the nice simple test we might have hoped it was.

87.1. My favorite gdb commands

help

Get a description of gdb's commands.

run

Runs your program. You can give it arguments that get passed in to your program just as if you had typed them to the shell. Also used to restart your program from the beginning if it is already running.

quit

Leave gdb, killing your program if necessary.

break

Set a breakpoint, which is a place where gdb will automatically stop your program. Some examples:

break somefunction stops before executing the first line somefunction.
break 117 stops before executing line number 117.

list

Show part of your source file with line numbers (handy for figuring out where to put breakpoints). Examples:

list somefunc lists all lines of somefunc.
list 117-123 lists lines 117 through 123.

next

Execute the next line of the program, including completing any procedure calls in that line.

step

Execute the next step of the program, which is either the next line if it contains no procedure calls, or the entry into the called procedure.

finish

Continue until you get out of the current procedure (or hit a breakpoint). Useful for getting out of something you stepped into that you didn't want to step into.

cont

(Or continue). Continue until (a) the end of the program, (b) a fatal error like a Segmentation Fault or Bus Error, or (c) a breakpoint. If you give it a numeric argument (e.g., cont 1000) it will skip over that many breakpoints before stopping.

print

Print the value of some expression, e.g. print i.

display

Like print, but runs automatically every time the program stops. Useful for watching values that change often.

87.2. Debugging strategies

In general, the idea behind debugging is that a bad program starts out sane, but after executing for a while it goes bananas. If you can find the exact moment in its execution where it first starts acting up, you can see exactly what piece of code is causing the problem and have a reasonably good chance of being able to fix it. So a typical debugging strategy is to put in a breakpoint (using break) somewhere before the insanity hits, "instrument" the program (using display) so that you can watch it going insane, and step through it (using next, step, or breakpoints and cont) until you find the point of failure. Sometimes this process requires restarting the program (using run) if you skip over this point without noticing it immediately.

For large or long-running programs, it often makes sense to do binary search to find the point of failure. Put in a breakpoint somewhere (say, on a function that is called many times or at the top of a major loop) and see what the state of the program is after going through the breakpoint 1000 times (using something like cont 1000). If it hasn't gone bonkers yet, try restarting and going through 2000 times. Eventually you bracket the error as occurring (for example) somewhere between the 4000th and 8000th occurrence of the breakpoint. Now try stepping through 6000 times; if the program is looking good, you know the error occurs somewhere between the 6000th and 8000th breakpoint. A dozen or so more experiments should be enough isolate the bug to a specific line of code.

The key to all debugging is knowing what your code is supposed to do. If you don't know this, you can't tell the lunatic who thinks he's Napoleon from lunatic who really is Napoleon. If you're confused about what your code is supposed to be doing, you need to figure out what exactly you want it to do. If you can figure that out, often it will be obvious what is going wrong. If it isn't obvious, you can always go back to gdb.

88. Valgrind

The valgrind program can be used to detect some (but not all) common errors in C programs that use pointers and dynamic storage allocation. On the Zoo, you can run valgrind on your program by putting valgrind at the start of the command line:

valgrind ./my-program arg1 arg2 < test-input

This will run your program and produce a report of any allocations and de-allocations it did. It will also warn you about common errors like using unitialized memory, dereferencing pointers to strange places, writing off the end of blocks allocated using malloc, or failing to free blocks.

You can suppress all of the output except errors using the -q option, like this:

valgrind -q ./my-program arg1 arg2 < test-input

You can also turn on more tests, e.g.

valgrind -q --tool=memcheck --leak-check=yes ./my-program arg1 arg2 < test-input

See valgrind --help for more information about the (many) options, or look at the documentation at http://valgrind.org/ for detailed information about what the output means. For some common valgrind messages, see the examples section below.

88.1. Compilation flags

You can run valgrind on any program (try valgrind ls); it does not require special compilation. However, the output of valgrind will be more informative if you compile your program with debugging information turned on using the -g or -g3 flags (this is also useful if you plan to watch your program running using gdb). See HowToUseTheComputingFacilities for more information about debugging and debugging flags.

88.2. Automated testing

Unless otherwise specified, automated testing of your program will be done using the script in /c/cs223/bin/vg; this runs /c/cs223/bin/valgrind with the --tool=memcheck, --leak-check=yes, and -q options, throws away your program's output, and replaces it with valgrind's output. If you have a program named ./prog, running /c/cs223/bin/vg ./prog should produce no output.

88.3. Examples of some common valgrind errors

Here are some examples of valgrind output. In each case the example program is compiled with -g3 so that valgrind can report line numbers from the source code.

88.3.1. Uninitialized values

Consider this unfortunate program, which attempts to compare two strings, one of which we forgot to ensure was null-terminated:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     char a[2];
   7 
   8     a[0] = 'a';
   9 
  10     if(!strcmp(a, "a")) {
  11         puts("a is \"a\"");
  12     }
  13 
  14     return 0;
  15 }

uninitialized.c

Run without valgrind, we see no errors, because we got lucky and it turned out our hand-built string was null-terminated anyway:

$ ./uninitialized 
a is "a"

But valgrind is not fooled:

$ valgrind -q ./uninitialized 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x4026663: strcmp (mc_replace_strmem.c:426)
==4745==    by 0x8048435: main (uninitialized.c:10)
==4745== 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x402666C: strcmp (mc_replace_strmem.c:426)
==4745==    by 0x8048435: main (uninitialized.c:10)
==4745== 
==4745== Conditional jump or move depends on uninitialised value(s)
==4745==    at 0x8048438: main (uninitialized.c:10)
==4745==

Here we get a lot of errors, but they are all complaining about the same call to strcmp. Since it's unlikely that strcmp itself is buggy, we have to assume that we passed some uninitialized location into it that it is looking at. The fix is to add an assignment a[1] = '\0' so that no such location exists.

88.3.2. Bytes definitely lost

Here is a program that calls malloc but not free:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 int
   5 main(int argc, char **argv)
   6 {
   7     char *s;
   8 
   9     s = malloc(26);
  10 
  11     return 0;
  12 }

missing_free.c

With no extra arguments, valgrind will not look for this error. But if we turn on --leak-check=yes, it will complain:

$ valgrind -q --leak-check=yes ./missing_free
==4776== 26 bytes in 1 blocks are definitely lost in loss record 1 of 1
==4776==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==4776==    by 0x80483F8: main (missing_free.c:9)
==4776==

Here the stack trace in the output shows where the bad block was allocated: inside malloc (specifically the paranoid replacement malloc supplied by valgrind), which was in turn called by main in line 9 of missing_free.c. This lets us go back and look at what block was allocated in that line and try to trace forward to see why it wasn't freed. Sometimes this is as simple as forgetting to include a free statement anywhere, but in more complicated cases it may be because I somehow lose the pointer to the block by overwriting the last variable that points to it or by embedding it in some larger structure whose components I forget to free individually.

88.3.3. Invalid write or read operations

These are usually operations that you do off the end of a block from malloc or on a block that has already been freed.

An example of the first case:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     char *s;
   9 
  10     s = malloc(1);
  11     s[0] = 'a';
  12     s[1] = '\0';
  13 
  14     puts(s);
  15 
  16     return 0;
  17 }

invalid_operations.c

==7141== Invalid write of size 1
==7141==    at 0x804843B: main (invalid_operations.c:12)
==7141==  Address 0x419a029 is 0 bytes after a block of size 1 alloc'd
==7141==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==7141==    by 0x8048428: main (invalid_operations.c:10)
==7141== 
==7141== Invalid read of size 1
==7141==    at 0x4026063: __GI_strlen (mc_replace_strmem.c:284)
==7141==    by 0x409BCE4: puts (ioputs.c:37)
==7141==    by 0x8048449: main (invalid_operations.c:14)
==7141==  Address 0x419a029 is 0 bytes after a block of size 1 alloc'd
==7141==    at 0x4024F20: malloc (vg_replace_malloc.c:236)
==7141==    by 0x8048428: main (invalid_operations.c:10)
==7141==

An example of the second:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     char *s;
   9 
  10     s = malloc(2);
  11     free(s);
  12 
  13     s[0] = 'a';
  14     s[1] = '\0';
  15 
  16     puts(s);
  17 
  18     return 0;
  19 }

freed_block.c

==7144== Invalid write of size 1
==7144==    at 0x804846D: main (freed_block.c:13)
==7144==  Address 0x419a028 is 0 bytes inside a block of size 2 free'd
==7144==    at 0x4024B3A: free (vg_replace_malloc.c:366)
==7144==    by 0x8048468: main (freed_block.c:11)
==7144== 
==7144== Invalid write of size 1
==7144==    at 0x8048477: main (freed_block.c:14)
==7144==  Address 0x419a029 is 1 bytes inside a block of size 2 free'd
==7144==    at 0x4024B3A: free (vg_replace_malloc.c:366)
==7144==    by 0x8048468: main (freed_block.c:11)
==7144== 
==7144== Invalid read of size 1
==7144==    at 0x4026058: __GI_strlen (mc_replace_strmem.c:284)
==7144==    by 0x409BCE4: puts (ioputs.c:37)
==7144==    by 0x8048485: main (freed_block.c:16)
[... more lines of errors deleted ...]

In both cases the problem is that we are operating on memory that is not guaranteed to be allocated to us. For short programs like these, we may get lucky and have the program work anyway. But we still want to avoid bugs like this because we might not get lucky.

How do we know which case is which? If I write off the end of an existing block, I'll see something like Address 0x419a029 is 0 bytes after a block of size 1 alloc'd, telling me that I am working on an address after a block that is still allocated. When I try to write to a freed block, the message changes to Address 0x419a029 is 1 bytes inside a block of size 2 free'd, where the free'd part tells me I freed something I probably shouldn't have. Fixing the first class of bugs is usually just a matter of allocating a bigger block (but don't just do this without figuring out why you need a bigger block, or you'll just be introducing random mutations into your code that may cause other problems elsewhere). Fixing the second class of bugs usually involves figuring out why you freed this block prematurely. In some cases you may need to re-order what you are doing so that you don't free a block until you are completely done with it.

89. Not recommended: debugging output

A tempting but usually bad approach to debugging is to put lots of printf statements in your code to show what is going on. The problem with this compared to using assert is that there is no built-in test to see if the output is actually what you'd expect. The problem compared to gdb is that it's not flexible: you can't change your mind about what is getting printed out without editing the code. A third problem is that the output can be misleading: in particular, printf output is usually buffered, which means that if your program dies suddenly there may be output still in the buffer that is never flushed to stdout. This can be very confusing, and can lead you to believe that your program fails earlier than it actually does.

If you really need to use printf or something like it for debugging output, here are a few rules of thumb to follow to mitigate the worst effects:

Use fprintf(stderr, ...) instead of printf(...); this allows you to redirect your program's regular output somewhere that keeps it separate from the debugging output (but beware of misleading interleaving of the two streams—buffering may mean that output to stdout and stderr appears to arrive out of order). It also helps that output to stderr is usually unbuffered, avoiding the problem of lost output.
If you must output to stdout, put fflush(stdout) after any output operation you suspect is getting lost in the buffer. The fflush function forces any buffered output to be emitted immediately.
Keep all arguments passed to printf as simple as possible and beware of faults in your debugging code itself. If you write printf("a[key] == %d\n", a[key]) and key is some bizarre value, you will never see the result of this printf because your program will segfault while evaluating a[key]. Naturally, this is more likely to occur if the argument is a[key]->size[LEFTOVERS].cleanupFunction(a[key]) than if it's just a[key], and if it happens it will be harder to figure out where in this complex chain of array indexing and pointer dereferencing the disaster happened. Better is to wait for your program to break in gdb, and use the print statement on increasingly large fragments of the offending expression to see where the bogus array index or surprising null pointer is hiding.

CategoryProgrammingNotes

90. AsymptoticNotation

91. Definitions

O(f(n)): A function g(n) is in O(f(n)) ("big O of f(n)") if there exist constants c and N such that |g(n)| ≤ c |f(n)| for n > N.
Ω(f(n)): A function g(n) is in Ω(f(n)) ("big Omega of f(n)") if there exist constants c and N such that |g(n)| >= c |f(n)| for n > N.
Θ(f(n)): A function g(n) is in Θ(f(n)) ("big Theta of f(n)") if there exist constants c₁, c₂, and N such that c₁|f(n)| ≤ |g(n)| ≤ c₂|f(n)| for n > N.
o(f(n)): A function g(n) is in o(f(n)) ("little o of f(n)") if for every c > 0 there exists an N such that |g(n)| ≤ c |f(n)| for n > N. This is equivalent to saying that lim_{n → ∞} g(n)/f(n) = 0.
ω(f(n)): A function g(n) is in ω(f(n) ("little omega of f(n)") if for every c > 0 there exists an N such that |g(n)| >= c |f(n)| for n > N. This is equivalent to saying that lim_{n → ∞} |g(n)|/|f(n)| diverges to infinity.

92. Motivating the definitions

Why would we use this notation?

Constant factors vary from one machine to another. The c factor hides this. If we can show that an algorithm runs in O(n²) time, we can be confident that it will continue to run in O(n²) time no matter how fast (or how slow) our computers get in the future.
For the N threshold, there are several excuses:
- Any problem can theoretically be made to run in O(1) time for any finite subset of the possible inputs (e.g. all inputs expressible in 50 MiB or less), by prefacing the main part of the algorithm with a very large TableLookup. So it's meaningless to talk about the relative performance of different algorithms for bounded inputs.
- If f(n) > 0 for all n, then we can get rid of N (or set it to zero) by making c larger enough. But some f(n) take on zero—or undefined—values for interesting n (e.g., f(n) = n² is zero when n is zero, and f(n) = log(n) is undefined for n = 0 and zero for n = 1). Allowing the minimum N lets us write O(n²) or O(log n) for classes of functions that we would otherwise have to write more awkwardly as something like O(n²+1) or O(log (n+2)).
- Putting the n > N rule in has a natural connection with the definition of a limit, where the limit as n goes to infinity of g(n) is defined to be x if for each epsilon > 0 there is an N such that |g(n)-x| < epsilon for n > N. Among other things, this permits the limit test that says g(n) = O(f(n)) if the limit as n goes to infinity of g(n)/f(n) exists and is finite.

93. Proving asymptotic bounds

Most of the time when we use asymptotic notation, we compute bounds using stock theorems like O(f(n)) + O(g(n)) = O(max(f(n), g(n)) or O(c f(n)) = O(f(n)). But sometimes we need to unravel the definitions to see whether a given function fits in a give class, or to prove these utility theorems to begin with. So let's do some examples of how this works.

The function n is in O(n³).
Proof
We must find c, N such that for all n > N, |n| < c|n³|. Since n³ is much bigger than n for most values of n, we'll pick c to be something convenient to work with, like 1. So now we need to choose N so that for all n > N, |n| < |n³|. It is not the case that |n| < |n³| for all n (try plotting n vs n³ for n < 1) but if we let N = 1, then we have n > 1, and we just need to massage this into n³ > n. There are a couple of ways to do this, but the quickest is probably to observe that squaring and multiplying by n (a positive quantity) are both increasing functions, which means that from n > 1 we can derive n² > 1² = 1 and then n² * n = n³ > 1 * n = n.
The function n³ is not in O(n).
Proof
Here we need to negate the definition of O(n), a process that turns all existential quantifiers into universal quantifiers and vice versa. So what we need to show is that for all c, N, there exists some n > N for which |n³| is not less than c |n|. We can ignore any c ≤ 0 since |n| is always positive. So fix some c > 0 and N. We must find an n > N for which n³ > c n. Solving for n in this inequality gives n > c^1/2; so setting n > max(N, c^1/2) finishes the proof.
If f₁(n) is in O(g(n)) and f₂(n) is in O(g(n)), then f₁(n)+f₂(n) is in O(g(n)).
Proof
Since f₁(n) is in O(g(n)), there exist constants c₁, N₁ such that for all n > N₁, |f₁(n)| < c |g(n)|. Similarly there exist c₂, N₂ such that for all n > N₂, |f₂(n)| < c |g(n)|. To show f₁(n)+f₂(n) in O(g(n)), we must find constants c and N such that for all n > N, |f₁(n)+f₂(n)| < c |g(n)|. Let's let c = c₁+c₂. Then if n is greater than max(N₁, N₂), it is greater than both N₁ and N₂, so we can add together |f₁| < c₁|g| and |f₂| < c₂|g| to get |f₁+f₂| ≤ |f₁| + |f₂| < (c₁+c₂) |g| = c |g|.

94. Asymptotic notation hints

94.1. Remember the difference between big-O, big-Ω, and big-Θ

Use big-O when you have an upper bound on a function, e.g. the zoo never got more than O(1) new gorillas per year, so there were at most O(t) gorillas at the zoo in year t.
Use big-Ω when you have a lower bound on a function, e.g. every year the zoo got at least one new gorilla, so there were at least Ω(t) gorillas at the zoo in year t.
Use big-Θ when you know the function exactly to within a constant-factor error, e.g. every year the zoo got exactly five new gorillas, so there were Θ(t) gorillas at the zoo in year t.

For the others, use little-o and ω when one function becomes vanishingly small relative to the other, e.g. new gorillas arrived rarely and with declining frequency, so there were o(t) gorillas at the zoo in year t. These are not used as much as big-O, big-Ω, and big-Θ in the algorithms literature.

94.2. Simplify your asymptotic terms as much as possible

O(f(n)) + O(g(n)) = O(f(n)) when g(n) = O(f(n)). If you have an expression of the form O(f(n) + g(n)), you can almost always rewrite it as O(f(n)) or O(g(n)) depending on which is bigger. The same goes for Ω or Θ.
O(c f(n)) = O(f(n)) if c is a constant. You should never have a constant inside a big O. This includes bases for logarithms: since log_a x = log_b x / log_b a, you can always rewrite O(lg n), O(ln n), or O(log_1.4467712 n) as just O(log n).
But watch out for exponents and products: O(3ⁿ n^3.1178 log^1/3 n) is already as simple as it can be.

94.3. Remember the limit trick

If you are confused whether e.g. log n is O(f(n)), try computing the limit as n goes to infinity of (log n)/n, and see if it's a constant (zero is ok). You may need to use L'Hôpital's Rule to evaluate such limits if they aren't obvious.

95. Variations in notation

As with many tools in mathematics, you may see some differences in how asymptotic notation is defined and used.

95.1. Absolute values

Some authors leave out the absolute values. For example, BiggsBook defines f(n) as being in O(g(n)) if f(n) ≤ c g(n) for sufficiently large n. If f(n) and g(n) are non-negative, this is not an unreasonable definition. But it produces odd results if either can be negative: for example, by this definition, -n¹⁰⁰⁰ is in O(n²). (Some authors define O(), Ω(), Θ() only for non-negative functions, avoiding this problem.)

The most common definition (which we will use) says that f(n) is in O(g(n)) if |f(n)| ≤ c |g(n)| for sufficiently large n; by this definition -n¹⁰⁰⁰ is not in O(n²), though it is in O(n¹⁰⁰⁰). This definition was designed for error terms in asymptotic expansions of functions, where the error term might represent a positive or negative error.

You can usually assume that algorithm running times are non-negative, so dropping the absolute value signs is generally harmless in algorithm analysis, but you should remember the absolute value definition in case you run into O() in other contexts.

95.2. Abusing the equals sign

Formally, we can think of O(g(n)) as a predicate on functions, which is true of all functions f(n) that satisfy f(n) ≤ c g(n) for some c and sufficiently large n. This requires writing that n² is O(n²) where most computer scientists or mathematicians would just write n² = O(n²). Making sense of the latter statement involves a standard convention that is mildly painful to define formally but that greatly simplifies asymptotic analyses.

Let's take a statement like the following:

O(n²) + O(n³) + 1 = O(n³).

What we want this to mean is that the left-hand side can be replaced by the right-hand side without causing trouble. To make this work formally, we define the statement as meaning:

For any f in O(n²) and any g in O(n³), there exists an h in O(n³) such that f(n) + g(n) + 1 = h(n).

In general, any appearance of O(), Ω(), or Θ() on the left-hand side gets a universal quantifier (for all) and any appearance of O(), Ω(), or Θ() on the right-hand side gets an existential quantifier (there exists). So

f(n) + o(f(n)) = Θ(f(n))

becomes

For any g in o(f(n)), there exists an h in Θ(f(n)) such that f(n)+g(n)=h(n).

and

O(f(n))+O(g(n))+1 = O(max(f(n),g(n)))+1

becomes

For any r in O(f(n)) and s in O(g(n)), there exists t in O(max(f(n),g(n)) such that r(n)+s(n)+1=t(n)+1.

The nice thing about this definition is that as long as you are careful about the direction the equals sign goes in, you can treat these complicated pseudo-equations like ordinary equations. For example, since O(n²) + O(n³) = O(n³), we can write

n²/2 + n(n+1)(n+2)/6 = n²/2 + O(n³) = O(n²) + O(n³) = O(n³),

which is much simpler than what it would look like if we had to talk about particular functions being elements of particular sets of functions.

This is an example of abuse of notation, the practice of redefining some standard bit of notation (in this case, equations) to make calculation easier. It's generally a safe practice as long as everybody understands what is happening. But beware of applying facts about unabused equations to the abused ones. Just because O(n²) = O(n³) doesn't mean O(n³) = O(n²)—the big-O equations are not reversible the way ordinary equations are.

More discussion of this can be found in CormenEtAl.

96. More information

It may help to read ProvingInequalities.
http://mathworld.wolfram.com/LandauSymbols.html
Big_O_notation
Fictional giant robot version: The_Big_O.

CategoryAlgorithmNotes CategoryMathNotes CategoryProgrammingNotes

97. LinkedLists

Linked lists are about the simplest data structure beyond arrays. They aren't very efficient for many purposes, but have very good performance for certain specialized applications.

98. Stacks and linked lists

On my desk is a pile of books that I am supposed to put away on my bookshelf. I don't care much about how they are organized, I just want to be able to dump a book quickly so that I can later go through and put all of them back at once. Because it's easiest just to dump each book on the top of the pile, I have effectively built a data structure called a stack, which supports a push operation (add a book to the top of the pile) and a pop operation (take the top book off and return it).

To build a stack in a program, I can adopt a similar strategy. When a new item appears, I will box it up inside a struct that will also include a pointer to the item below it in the stack. Since that item will point to the item below it, and so on, I end up with an arbitrarily long chain of pointers (usually ending in a null pointer).

The cost of push and pop operations are both O(1): they don't depend on the number of elements in the stack, since they are only working on the top element or two. Doing almost anything else (e.g., finding the k-th element of the stack or searching for a particular value) is likely to be much more expensive, O(n) in the worst case. The reason is that unlike an array, a linked list scatters its contents throughout memory, and the only way to get at a particular element is to crawl through all the ones that precede it.

98.1. Implementation

A very concrete implementation of a stack using a linked list might look like this:

   1 #include <stdlib.h>
   2 #include <string.h>
   3 
   4 /* Functional-style stack */
   5 /* All operations consume the old value and return the new value of the stack */
   6 
   7 typedef struct stack {
   8     char *book;         /* malloc'd name of book */
   9     struct stack *next; /* next item in stack, or 0 if there aren't any more */
  10 } *Stack;
  11 
  12 #define EMPTY_STACK (0)
  13 
  14 /* create a new empty stack */
  15 Stack
  16 stackCreate(void)
  17 {
  18     return EMPTY_STACK;
  19 }
  20 
  21 /* push a new book on the stack */
  22 /* copies second argument book (so you don't need to keep it around) */
  23 /* returns 0 on allocation error */
  24 Stack
  25 stackPush(Stack old, const char *book)
  26 {
  27     Stack new;
  28 
  29     new = malloc(sizeof(*new));
  30     if(new == 0) return 0;
  31 
  32     new->next = old;
  33     new->book = strdup(book);
  34 
  35     if(new->book == 0) {
  36         free(new);
  37         return 0;
  38     }
  39 
  40     return new;
  41 }
  42 
  43 /* pops a book off the stack */
  44 /* returns new tail of stack, stores book in *book */
  45 /* *book is malloc'd data that the caller should free */
  46 /* Stores 0 in *book if stack is empty */
  47 Stack
  48 stackPop(Stack old, char **book /*RETVAL*/)
  49 {
  50     Stack new;
  51 
  52     if(old == 0) {
  53         *book = 0;
  54         return 0;
  55     } else {
  56         new = old->next;
  57         *book = old->book;
  58         free(old);
  59         return new;
  60     }
  61 }
  62 
  63 /* frees a stack */
  64 void
  65 stackDestroy(Stack s)
  66 {
  67     char *book;
  68 
  69     while(s) {
  70         s = stackPop(s, &book);
  71         free(book);
  72     }
  73 }

functional_stack.c

The meat of this implementation is in stackPush and stackPop. These act like a constructor/destructor pair for individual stack elements. The stackPush funciton also does the work of linking the new element into the stack and copying the book argument. The choice to copy book is a design decision: it insulates the contents of the stack from any changes that might happen to book after stackPush exits (perhaps book is a buffer that is reused to read a new line from the input after each call to stackPush), but it may be more than what the user needs, and it forces the user to free the returned value in stackPop. An alternative would be to let the user call strdup for themselves.

98.2. A more complicated implementation

The simple implementation of a stack above doesn't really act like a single object; instead, it acts more like a family of constant values you can do arithmetic on, where applying stackPush or stackPop consumes some old value and generates a new one. For many applications, it is more natural to just think of having a single stack that changes over time.

We can do this with another layer of indirection. Instead of having a Stack be a pointer to the first element, we'll have it be a pointer to a pointer to the first element. This requires a little bit more work inside the implementation but looks nicer from the outside.

   1 #include <stdlib.h>
   2 #include <string.h>
   3 
   4 /* Imperative-style stack */
   5 /* All operations modify the stack in place */
   6 
   7 /* a Stack is a pointer to a pointer to the first element of a linked list */
   8 typedef struct stack {
   9     char *book;         /* malloc'd name of book */
  10     struct stack *next; /* next item in stack, or 0 if there aren't any more */
  11 } **Stack;
  12 
  13 /* create a new empty stack */
  14 Stack
  15 stackCreate(void)
  16 {
  17     Stack s;
  18     
  19     s = malloc(sizeof(struct stack *));
  20     *s = 0;
  21 
  22     return s;
  23 }
  24 
  25 /* push a new book on the stack */
  26 /* copies second argument book (so you don't need to keep it around) */
  27 /* returns 0 on allocation error or 1 on success */
  28 int
  29 stackPush(Stack s, const char *book)
  30 {
  31     struct stack *new;
  32 
  33     new = malloc(sizeof(*new));
  34 
  35     new->next = *s;
  36     new->book = strdup(book);
  37 
  38     if(new->book == 0) {
  39         free(new);
  40         return 0;
  41     }
  42 
  43     *s = new;
  44     return 1;
  45 }
  46 
  47 /* pops a book off the stack */
  48 /* returns 0 if stack is empty */
  49 char *
  50 stackPop(Stack s)
  51 {
  52     struct stack *new;
  53     char *book;
  54 
  55     if(*s == 0) {
  56         return 0;
  57     } else {
  58         book = (*s)->book;
  59 
  60         /* we have to save (*s)->next before we free it */
  61         new = (*s)->next;
  62         free(*s);
  63         *s = new;
  64         return book;
  65     }
  66 }
  67 
  68 /* returns true if s is empty */
  69 int
  70 stackEmpty(Stack s)
  71 {
  72     return (*s) == 0;
  73 }
  74 
  75 /* frees a stack */
  76 void
  77 stackDestroy(Stack s)
  78 {
  79     while(!stackEmpty(s)) {
  80         free(stackPop(s));
  81     }
  82 
  83     free(s);
  84 }

object_stack.c

Here we have added a new stackEmpty routine because it's no longer obvious how to check (and we might someday want to change Stack to some other type where testing (*s) == 0 is no longer the right way to do it).

Note that the structure of the linked list is exactly the same in both implementations: the only difference is that the "imperative" version adds an extra level of indirection at the beginning. Yet another alternative is to have a dummy stack element at the beginning and do pushes and pops starting with the second place in the stack (i.e., replace (*s) with s->next in the second implementation). This consumes more space (we end up allocating space to hold s->book even though we don't put anything there), but can make things simpler for some more complicated linked-list structures.

98.3. Building a stack out of an array

When the elements of a stack are small, or when a maximum number of elements is known in advance, it often makes sense to build a stack from an array (with a variable storing the index of the top element) instead of a linked list. The reason is that pushes and pops only require updating the stack pointer instead of calling malloc or free to allocate space, and pre-allocating is almost always faster than allocating as needed. This is the strategy used to store the function call stack in almost all programs (the exception is in languages like Scheme, where the call stack is allocated on the heap because stack frames may outlive the function call that creates them).

99. Looping over a linked list

Looping over a linked list is not hard if you have access to the next pointers. (For a more abstract way to do this see C/Iterators.)

Let's imagine somebody gave us a pointer to the first struct stack in a list; call this pointer first. Then we can write a loop like this that prints the contents of the stack:

   1 void
   2 stackPrint(struct stack *first)
   3 {
   4     struct stack *elt;
   5 
   6     for(elt = first; elt != 0; elt = elt->next) {
   7         puts(elt->book);
   8     }
   9 }

There's not a whole lot to notice here except that for is perfectly happy to iterate over something that isn't a range of integers. The running time is linear in the length of the list (O(n)).

100. Looping over a linked list backwards

What if we want to loop over a linked list backwards? The next pointers all go the wrong way, so we have to save a trail of breadcrumbs to get back. The safest way to do this is to reverse the original list into an auxiliary list:

   1 void
   2 stackPrintReversed(struct stack *first)
   3 {
   4     struct stack *elt;
   5     Stack s2;                   /* uses imperative implementation */
   6 
   7     s2 = stackCreate();
   8 
   9     for(elt = first; elt != 0; elt = elt->next) {
  10         s2 = stackPush(s2, elt->book);
  11     }
  12 
  13     stackPrint(s2);
  14     stackDestroy(s2);
  15 }

Pushing all the elements from the first list onto s2 puts the first element on the bottom, so when we print s2 out, it's in the reverse order of the original stack.

We can also write a recursive function that prints the elements backwards. This function effectively uses the function call stack in place of the extra stack s2 above.

   1 void
   2 stackPrintReversedRecursive(struct stack *first)
   3 {
   4     if(first != 0) {
   5         /* print the rest of the stack */
   6         stackPrintReversedRecursive(first->next);
   7 
   8         /* then print the first element */
   9         puts(first->book);
  10     }
  11 }

The code in stackPrintReversedRecursive is shorter than the code in stackPrintReversed, and it is likely to be faster since it doesn't require allocating a second stack and copying all the elements. But it will only work for small stacks: because the function call stack is really a fixed-size array, if the input to stackPrintReversedRecursive is too big the recursion will go too deep cause a stack overflow.

If we want to do this sort of thing a lot, we should build a doubly-linked list, with a pointer in each element both to the next element and the previous element instead of a singly-linked list (see below for more).

101. Queues

Stacks are last-in-first-out (LIFO) data structures: when we pop, we get the last item we pushed. What if we want a first-in-first-out (FIFO) data structure? Such a data structure is called a queue and can also be implemented by a linked list. The difference is that if we want O(1) time for both the enqueue (push) and dequeue (pop) operations, we must keep around pointers to both ends of the linked list.

Here is a simple queue holding ints:

   1 #include <stdlib.h>
   2 
   3 struct elt {
   4     int item;
   5     struct elt *next;
   6 };
   7 
   8 struct queue {
   9     struct elt *head;   /* first element in queue, or 0 if empty */
  10     struct elt *tail;   /* last element in queue */
  11 };
  12 
  13 typedef struct queue *Queue;
  14 
  15 Queue
  16 queueCreate(void)
  17 {
  18     Queue q;
  19 
  20     q = malloc(sizeof(*q));
  21 
  22     if(q) {
  23         q->head = 0;
  24     }
  25 
  26     return q;
  27 }
  28 
  29 void
  30 enqueue(Queue q, int item)
  31 {
  32     struct elt *e;
  33 
  34     e = malloc(sizeof(*e));
  35     if(e == 0) abort();
  36 
  37     e->item = item;
  38     e->next = 0;
  39     
  40     if(q->head == 0) {
  41         /* special case for empty queue */
  42         q->head = q->tail = e;
  43     } else {
  44         /* patch e in after tail */
  45         q->tail->next = e;
  46         q->tail = e;
  47     }
  48 }
  49 
  50 int
  51 queueEmpty(Queue q)
  52 {
  53     return q->head == 0;
  54 }
  55 
  56 #define EMPTY_QUEUE (-1)
  57 
  58 /* returns EMPTY_QUEUE if queue is empty */
  59 int
  60 dequeue(Queue q)
  61 {
  62     struct elt *e;
  63     int retval;
  64 
  65     if(queueEmpty(q)) {
  66         return EMPTY_QUEUE;
  67     } else {
  68         /* pop first element off */
  69         e = q->head;
  70         q->head = e->next;
  71 
  72         /* save its contents and free it */
  73         retval = e->item;
  74         free(e);
  75 
  76         return retval;
  77     }
  78 }

queue.c

It is a bit trickier to build a queue out of an array than to build a stack. The difference is that while a stack pointer can move up and down, leaving the base of the stack in the same place, a naive implementation of a queue would have head and tail pointers both marching ever onward across the array leaving nothing but empty cells in their wake. While it is possible to have the pointers wrap around to the beginning of the array when they hit the end, if the queue size is unbounded the tail pointer will eventually catch up to the head pointer. At this point (as in a stack that overflows), it is necessary to allocate more space and copy the old elements over.

102. Deques and doubly-linked lists

Suppose we want a data structure that represents a line of elements where we can push or pop elements at either end. Such a data structure is known as a deque (pronounced like "deck"), and can be implemented with all operations taking O(1) time by a doubly-linked list, where each element has a pointer to both its successor and its predecessor.

An ordinary singly-linked list is not good enough. The reason is that even if we keep a pointer to both ends as in a queue, when it comes time to pop an element off the tail, we have no pointer to its predecessor ready to hand; the best we can do is scan from the head until we get to an element whose successor is the tail, which takes O(n) time.

The solution is to build something like the code below. To avoid special cases, we use a lot of 2-element arrays of pointers, so that pushing or popping from either end of the queue is completely symmetric.

   1 typedef struct deque *Deque;
   2 
   3 #define FRONT (0)
   4 #define BACK (1)
   5 
   6 #define NUM_DIRECTIONS (2)
   7 
   8 #define EMPTY_QUEUE (-1)
   9 
  10 Deque dequeCreate(void);
  11 
  12 void dequePush(Deque d, int direction, int value);
  13 
  14 int dequePop(Deque d, int direction);
  15 
  16 int dequeEmpty(Deque d);
  17 
  18 void dequeDestroy(Deque d);

deque.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 
   4 #include "deque.h"
   5 
   6 struct elt {
   7     int value;
   8     struct elt *next[NUM_DIRECTIONS];
   9 };
  10 
  11 struct deque {
  12     struct elt *head[NUM_DIRECTIONS];
  13 };
  14 
  15 Deque
  16 dequeCreate(void)
  17 {
  18     Deque d;
  19 
  20     d = malloc(sizeof(*d));
  21     if(d) {
  22         d->head[FRONT] = d->head[BACK] = 0;
  23     } 
  24 
  25     return d;
  26 }
  27 
  28 void
  29 dequePush(Deque d, int direction, int value)
  30 {
  31     struct elt *e;
  32 
  33     assert(direction == FRONT || direction == BACK);
  34 
  35     e = malloc(sizeof(*e));
  36     if(e == 0) abort();  /* kill process on malloc failure */
  37     
  38     e->next[direction] = d->head[direction];
  39     e->next[!direction] = 0;
  40     e->value = value;
  41 
  42     if(d->head[direction]) {
  43         d->head[direction]->next[!direction] = e;
  44     }
  45 
  46     d->head[direction] = e;
  47 
  48     /* special case for empty deque */
  49     /* e becomes the head on both sides */
  50     if(d->head[!direction] == 0) {
  51         d->head[!direction] = e;
  52     }
  53 }
  54 
  55 int
  56 dequePop(Deque d, int direction)
  57 {
  58     struct elt *e;
  59     int retval;
  60 
  61     assert(direction == FRONT || direction == BACK);
  62 
  63     e = d->head[direction];
  64 
  65     if(e == 0) return EMPTY_QUEUE;
  66 
  67     /* else remove it */
  68     d->head[direction] = e->next[direction];
  69 
  70     if(d->head[direction] != 0) {
  71         d->head[direction]->next[!direction] = 0;
  72     } else { 
  73         /* special case when e is the last element */
  74         /* clear other head field */
  75         d->head[!direction] = 0;
  76     }
  77 
  78     retval = e->value;
  79 
  80     free(e);
  81 
  82     return retval;
  83 }
  84 
  85 int
  86 dequeEmpty(Deque d)
  87 {
  88     return d->head[FRONT] == 0;
  89 }
  90 
  91 void
  92 dequeDestroy(Deque d)
  93 {
  94     while(!dequeEmpty(d)) {
  95         dequePop(d, FRONT);
  96     }
  97 
  98     free(d);
  99 }

deque.c

And here is some test code: test_deque.c, Makefile.

103. Circular linked lists

For some applications, there is no obvious starting or ending point to a list, and a circular list (where the last element points back to the first) may be appropriate. Circular doubly-linked lists can also be used to build deques; a single pointer into the list tracks the head of the deque, with some convention adopted for whether the head is an actual element of the list (at the front, say, with its left neighbor at the back) or a dummy element that is not considered to be part of the list.

The selling point of circular doubly-linked lists as a concrete data structure is that insertions and deletions can be done anywhere in the list with only local information. For example, here are some routines for manipulating a doubly-linked list directly. We'll make our lives easy and assume (for the moment) that the list has no actual contents to keep track of.

   1 #include <stdlib.h>
   2 
   3 /* directions for doubly-linked list next pointers */
   4 #define RIGHT (0)
   5 #define LEFT (1)
   6 
   7 struct elt {
   8     struct elt *next[2];
   9 };
  10 
  11 typedef struct elt *Elt;
  12 
  13 /* create a new circular doubly-linked list with 1 element */
  14 /* returns 0 on allocation error */
  15 Elt
  16 listCreate(void)
  17 {
  18     Elt e;
  19 
  20     e = malloc(sizeof(*e));
  21     if(e) {
  22         e->next[LEFT] = e->next[RIGHT] = e;
  23     }
  24 
  25     return e;
  26 }
  27 
  28 /* remove an element from a list */
  29 /* Make sure you keep a pointer to some other element! */
  30 /* does not free the removed element */
  31 void
  32 listRemove(Elt e)
  33 {
  34     /* splice e out */
  35     e->next[RIGHT]->next[LEFT] = e->next[LEFT];
  36     e->next[LEFT]->next[RIGHT] = e->next[RIGHT];
  37 }
  38     
  39 /* insert an element e into list in direction dir from head */
  40 void
  41 listInsert(Elt head, int dir, Elt e)
  42 {
  43     /* fill in e's new neighbors */
  44     e->next[dir] = head->next[dir];
  45     e->next[!dir] = head;
  46 
  47     /* make neigbhors point back at e */
  48     e->next[dir]->next[!dir] = e;
  49     e->next[!dir]->next[dir] = e;
  50 }
  51 
  52 /* split a list, removing all elements between e1 and e2 */
  53 /* e1 is the leftmost node of the removed subsequence, e2 rightmost */
  54 /* the removed elements are formed into their own linked list */
  55 /* comment: listRemove could be implemented as listSplit(e,e) */
  56 void
  57 listSplit(Elt e1, Elt e2)
  58 {
  59     /* splice out the new list */
  60     e2->next[RIGHT]->next[LEFT] = e1->next[LEFT];
  61     e1->next[LEFT]->next[RIGHT] = e2->next[RIGHT];
  62 
  63     /* fix up the ends */
  64     e2->next[RIGHT] = e1;
  65     e1->next[LEFT] = e2;
  66 }
  67 
  68 /* splice a list starting at e2 after e1 */
  69 /* e2 becomes e1's right neighbor */
  70 /* e2's left neighbor becomes left neighbor of e1's old right neighbor */
  71 void
  72 listSplice(Elt e1, Elt e2)
  73 {
  74     /* fix up tail end */
  75     e2->next[LEFT]->next[RIGHT] = e1->next[RIGHT];
  76     e1->next[RIGHT]->next[LEFT] = e2->next[LEFT];
  77 
  78     /* fix up e1 and e2 */
  79     e1->next[RIGHT] = e2;
  80     e2->next[LEFT] = e1;
  81 }
  82 
  83 /* free all elements of the list containing e */
  84 void
  85 listDestroy(Elt e)
  86 {
  87     Elt target;
  88     Elt next;
  89 
  90     /* we'll free elements until we get back to e, then free e */
  91     /* note use of pointer address comparison to detect end of loop */
  92     for(target = e->next[RIGHT]; target != e; target = next) {
  93         next = target->next[RIGHT];
  94         free(target);
  95     }
  96 
  97     free(e);
  98 }

circular.c

The above code might or might not actually work. What if it doesn't? It may make sense to include some sanity-checking code that we can run to see if our pointers are all going to the right place:

   1 /* assert many things about correctness of the list */
   2 /* Amazingly, this is guaranteed to abort or return no matter
   3    how badly screwed up the list is. */
   4 void
   5 listSanityCheck(Elt e)
   6 {
   7     Elt check;
   8 
   9     assert(e != 0);
  10 
  11     check = e;
  12 
  13     do {
  14 
  15         /* are our pointers consistent with our neighbors? */
  16         assert(check->next[RIGHT]->next[LEFT] == check);
  17         assert(check->next[LEFT]->next[RIGHT] == check);
  18 
  19         /* on to the next */
  20         check = check->next[RIGHT];
  21 
  22     } while(check != e);
  23 }

What if we want to store something in this list? The simplest approach is to extend the definition of struct elt:

   1 struct elt {
   2     struct elt *next[2];
   3     char *name;
   4     int socialSecurityNumber;
   5     int gullibility;
   6 };

But then we can only use the code for one particular type of data. An alternative approach is to define a new Elt-plus struct:

   1 struct fancyElt {
   2     struct elt *next[2];
   3     char *name;
   4     int socialSecurityNumber;
   5     int gullibility;
   6 };

and then use pointer casts to pretend the fancy structs into Elts:

   1     struct fancyElt *e;
   2 
   3     e = malloc(sizeof(*e));
   4 
   5     /* fill in fields on e */
   6 
   7     listInsert(someList, (Elt) e);

The trick here is that as long as the initial part of the struct fancyElt looks like a struct elt, any code that expects a struct elt will happily work with it and ignore the fields that happen to be sitting later in memory. (This trick is how C++ inheritance works.)

The downside is that if something needs to be done with the other fields (e.g freeing e->name if e is freed), then the Elt functions won't know to do this. So if you use this trick you should be careful.

A similar technique using void * pointers is described in C/GenericContainers.

104. What linked lists are and are not good for

Linked lists are good for any task that involves inserting or deleting elements next to an element you already have a pointer to; such operations can usually be done in O(1) time. They generally beat arrays (even resizeable arrays) if you need to insert or delete in the middle of a list, since an array has to copy any elements above the insertion point to make room; if inserts or deletes always happen at the end, an array may be better.

Linked lists are not good for any operation that requires random access, since reaching an arbitrary element of a linked list takes as much as O(n) time. For such applications, arrays are better if you don't need to insert in the middle; if you do, you should use some sort of tree (see BinaryTrees).

105. Further reading

KernighanPike gives an example of a linked list in Section 2.7. A description of many different kinds of linked lists with pictures can be found at Linked_list.

CategoryProgrammingNotes

106. C/Recursion

Recursion is when a function calls itself. Some programming languages (particularly functional programming languages like Scheme, ML, or Haskell) use recursion as a basic tool for implementing algorithms that in other languages would typically be expressed using iteration (loops). Procedural languages like C tend to emphasize iteration over recursion, but can support recursion as well.

107. Example of recursion in C

Here are a bunch of routines that print the numbers from 0 to 9:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 /* all of these routines print numbers i where start <= i < stop */
   6 
   7 void
   8 printRangeIterative(int start, int stop)
   9 {
  10     int i;
  11 
  12     for(i = start; i < stop; i++) {
  13         printf("%d\n", i);
  14     }
  15 }
  16 
  17 void
  18 printRangeRecursive(int start, int stop)
  19 {
  20     if(start < stop) {
  21         printf("%d\n", start);
  22         printRangeRecursive(start+1, stop);
  23     }
  24 }
  25 
  26 void
  27 printRangeRecursiveReversed(int start, int stop)
  28 {
  29     if(start < stop) {
  30         printRangeRecursiveReversed(start+1, stop);
  31         printf("%d\n", start);
  32     }
  33 }
  34 
  35 void
  36 printRangeRecursiveSplit(int start, int stop)
  37 {
  38     int mid;
  39 
  40     if(start < stop) {
  41         mid = (start + stop) / 2;
  42 
  43         printRangeRecursiveSplit(start, mid);
  44         printf("%d\n", mid);
  45         printRangeRecursiveSplit(mid+1, stop);
  46     }
  47 }
  48 
  49 #define Noisy(x) (puts(#x), x)
  50 
  51 int
  52 main(int argc, char **argv)
  53 {
  54 
  55     if(argc != 1) {
  56         fprintf(stderr, "Usage: %s\n", argv[0]);
  57         return 1;
  58     }
  59 
  60     Noisy(printRangeIterative(0, 10));
  61     Noisy(printRangeRecursive(0, 10));
  62     Noisy(printRangeRecursiveReversed(0, 10));
  63     Noisy(printRangeRecursiveSplit(0, 10));
  64 
  65     return 0;
  66 }

recursion.c

And here is the output:

printRangeIterative(0, 10)
0
1
2
3
4
5
6
7
8
9
printRangeRecursive(0, 10)
0
1
2
3
4
5
6
7
8
9
printRangeRecursiveReversed(0, 10)
9
8
7
6
5
4
3
2
1
0
printRangeRecursiveSplit(0, 10)
0
1
2
3
4
5
6
7
8
9

The first function printRangeIterative is simple and direct: it's what we've been doing to get loops forever. The others are a bit more mysterious.

The function printRangeRecursive is an example of solving a problem using the DivideAndConquer method. If we don't know how to print a range of numbers 0 through 9, maybe we can start by solving a simpler problem of printing the first number 0. Having done that, we have a new, smaller problem: print the numbers 1 through 9. But then we notice we already have a function printRangeRecursive that will do that for us. So we'll call it.

If you aren't used to this, it has the feeling of trying to make yourself fly by pulling very hard on your shoelaces.⁸ But in fact the computer will happily generate the eleven nested instances of printRangeRecursive to make this happen. When we hit the bottom, the call stack will look something like this:

printRangeRecursive(0, 10)
 printRangeRecursive(1, 10)
  printRangeRecursive(2, 10)
   printRangeRecursive(3, 10)
    printRangeRecursive(4, 10)
     printRangeRecursive(5, 10)
      printRangeRecursive(6, 10)
       printRangeRecursive(7, 10)
        printRangeRecursive(8, 10)
         printRangeRecursive(9, 10)
          printRangeRecursive(10, 10)

This works because each call to printRangeRecursive gets its own parameters and its own variables separate from the others, even the ones that are still in progress. So each will print out start and then call another copy in to print start+1 etc. In the last call, we finally fail the test start < stop, so the function exits, then its parent exits, and so on until we unwind all the calls on the stack back to the first one.

In printRangeRecursiveReversed, the calling pattern is exactly the same, but now instead of printing start on the way down, we print start on the way back up, after making the recursive call. This means that in printRangeRecursiveReversed(0, 10), 0 is printed only after the results of printRangeRecursiveReversed(1, 10), which gives us the countdown effect.

So far these procedures all behave very much like ordinary loops, with increasing values on the stack standing in for the loop variable. More exciting is printRangeRecursiveSplit. This function takes a much more aggressive approach to dividing up the problem: it splits a range [0, 10) as two ranges [0, 5) and [6, 10) separated by a midpoint 5.⁹ We want to print the midpoint in the middle, of course, and we can use printRangeRecursiveSplit recursively to print the two ranges. Following the execution of this procedure is more complicated, with the start of the sequence of calls looking something like this:

printRangeRecursiveSplit(0, 10)
 printRangeRecursiveSplit(0, 5)
  printRangeRecursiveSplit(0, 2)
   printRangeRecursiveSplit(0, 1)
    printRangeRecursiveSplit(0, 0)
    printRangeRecursiveSplit(1, 1)
   printRangeRecursiveSplit(2, 2)
  printRangeRecursiveSplit(3, 5)
   printRangeRecursiveSplit(3, 4)
    printRangeRecursiveSplit(3, 3)
    printRangeRecursiveSplit(4, 4)
   printRangeRecursiveSplit(5, 5)
 printRangeRecursiveSplit(6, 10)
  ... etc.

Here it is not so obvious how one might rewrite this procedure as a loop.

108. Common problems with recursion

Like iteration, recursion is a powerful tool that can cause your program to do much more than expected. While it may seem that errors in recursive functions would be harder to track down than errors in loops, most of the time there are a few basic causes.

108.1. Omitting the base case

Suppose we leave out the if statement in printRangeRecursive:

   1 void
   2 printRangeRecursiveBad(int start, int stop)
   3 {
   4     printf("%d\n", start);
   5     printRangeRecursiveBad(start+1, stop);
   6 }

This will still work, in a sense. When called as printRangeRecursiveBad(0, 10), it will print 0, call itself with printRangeRecursiveBad(1, 10), print 1, 2, 3, etc., but there is nothing to stop it at 10 (or anywhere else). So our output will be a long string of numbers followed by a segmentation fault, when we blow out the stack.

This is the recursive version of an infinite loop: the same thing happens if we forget a loop test and write

   1 void
   2 printRangeIterativeBad(int start, int stop)
   3 {
   4     for(i = 0; ; i++) {
   5         printf("%d\n", i);
   6 }

except that now the program just runs forever, since it never runs out of resources. (This is an example of how iteration is more efficient than recursion, at least in C.)

108.2. Blowing out the stack

Blowing out the stack is what happens when a recursion is too deep. Typically, the operating system puts a hard limit on how big the stack can grow, on the assumption that any program that grows the stack too much has gone insane and needs to be killed before it does more damage. One of the ways this can happen is if we forget the base case as above, but it can also happen if we just try to use a recursive function to do too much. For example, if we call printRangeRecursive(0, 1000000), we will probably get a segmentation fault after the first 100,000 numbers or so.

For this reason, it's best to try to avoid linear recursions like the one in printRangeRecursive, where the depth of the recursion is proportional to the number of things we are doing. Much safer are even splits like printRangeRecursiveSplit, since the depth of the stack will now be only logarithmic in the number of things we are doing. Fortunately, linear recursions are often tail-recursive, where the recursive call is the last thing the recursive function does; in this case, we can use a standard transformation (see below) to convert the tail-recursive function into an iterative function.

108.3. Failure to make progress

Sometimes we end up blowing out the stack because we thought we were recursing on a smaller instance of the problem, but in fact we weren't. Consider this broken version of printRangeRecursiveSplit:

   1 void
   2 printRangeRecursiveSplitBad(int start, int stop)
   3 {
   4     int mid;
   5 
   6     if(start == stop) {
   7         printf("%d\n", start);
   8     } else {
   9         mid = (start + stop) / 2;
  10 
  11         printRangeRecursiveSplitBad(start, mid);
  12         printRangeRecursiveSplitBad(mid, stop);
  13     }
  14 }

This will get stuck on as simple a call as printRangeRecursiveSplitBad(0, 1); it will set mid to 0, and while the recursive call to printRangeRecursiveSplitBad(0, 0) will work just fine, the recursive call to printRangeRecursiveSplitBad(0, 1) will put us back where we started, giving an infinite recursion.

Detecting these errors is usually not too hard (segmentation faults that produce huge piles of stack frames when you type where in gdb are a dead give-away). Figuring out how to make sure that you do in fact always make progress can be trickier.

109. Tail-recursion versus iteration

Tail recursion is when a recursive function calls itself only once, and as the last thing it does. The printRangeRecursive function is an example of a tail-recursive function:

   1 void
   2 printRangeRecursive(int start, int stop)
   3 {
   4     if(start < stop) {
   5         printf("%d\n", start);
   6         printRangeRecursive(start+1, stop);
   7     }
   8 }

The nice thing about tail-recursive functions is that we can always translate them directly into iterative functions. The reason is that when we do the tail call, we are effectively replacing the current copy of the function with a new copy with new arguments. So rather than keeping around the old zombie parent copy—which has no purpose other than to wait for the child to return and then return itself—we can reuse it by assigning new values to its arguments and jumping back to the top of the function.

Done literally, this produces this goto-considered-harmful monstrosity:

   1 void
   2 printRangeRecursiveGoto(int start, int stop)
   3 {
   4     topOfFunction:
   5 
   6     if(start < stop) {
   7         printf("%d\n", start);
   8 
   9         start = start+1;
  10         goto topOfFunction;
  11     }
  12 }

But we can always remove goto statements using less offensive control structures. In this particular case, the pattern of jumping back to just before an if matches up exactly with what we get from a while loop:

   1 void
   2 printRangeRecursiveNoMore(int start, int stop)
   3 {
   4     while(start < stop) {
   5         printf("%d\n", start);
   6 
   7         start = start+1;
   8     }
   9 }

In functional programming languages, this transformation is usually done in the other direction, to unroll loops into recursive functions. Since C doesn't like recursive functions so much (they blow out the stack!), we usually do this transformation got get rid of recursion instead of adding it.

110. An example of useful recursion

So far the examples we have given have not been very useful, or have involved recursion that we can easily replace with iteration. Here is an example of a recursive procedure that cannot be as easily turned into an iterative version.

We are going to implement the Mergesort algorithm on arrays. This is a classic DivideAndConquer sorting algorithm that splits an array into two pieces, sorts each piece (recursively!), then merges the results back together. Here is the code, together with a simple test program.

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <string.h>
   4 
   5 /* merge sorted arrays a1 and a2, putting result in out */
   6 void
   7 merge(int n1, const int a1[], int n2, const int a2[], int out[])
   8 {
   9     int i1;
  10     int i2;
  11     int iout;
  12 
  13     i1 = i2 = iout = 0;
  14 
  15     while(i1 < n1 || i2 < n2) {
  16         if(i2 >= n2 || (i1 < n1) && (a1[i1] < a2[i2])) {
  17             /* a1[i1] exists and is smaller */
  18             out[iout++] = a1[i1++];
  19         }  else {
  20             /* a1[i1] doesn't exist, or is bigger than a2[i2] */
  21             out[iout++] = a2[i2++];
  22         }
  23     }
  24 }
  25 
  26 /* sort a, putting result in out */
  27 /* we call this mergeSort to avoid conflict with mergesort in libc */
  28 void
  29 mergeSort(int n, const int a[], int out[])
  30 {
  31     int *a1;
  32     int *a2;
  33 
  34     if(n < 2) {
  35         /* 0 or 1 elements is already sorted */
  36         memcpy(out, a, sizeof(int) * n);
  37     } else {
  38         /* sort into temp arrays */
  39         a1 = malloc(sizeof(int) * (n/2));
  40         a2 = malloc(sizeof(int) * (n - n/2));
  41 
  42         mergeSort(n/2, a, a1);
  43         mergeSort(n - n/2, a + n/2, a2);
  44 
  45         /* merge results */
  46         merge(n/2, a1, n - n/2, a2, out);
  47 
  48         /* free the temp arrays */
  49         free(a1);
  50         free(a2);
  51     }
  52 }
  53 
  54 #define N (10)
  55 
  56 int
  57 main(int argc, char **argv)
  58 {
  59     int a[N];
  60     int b[N];
  61     int i;
  62 
  63     for(i = 0; i < N; i++) {
  64         a[i] = N-i-1;
  65     }
  66 
  67     for(i = 0; i < N; i++) {
  68         printf("%d ", a[i]);
  69     }
  70 
  71     putchar('\n');
  72 
  73     mergeSort(N, a, b);
  74 
  75     for(i = 0; i < N; i++) {
  76         printf("%d ", b[i]);
  77     }
  78 
  79     return 0;
  80 }

mergesort.c

The cost of this is pretty cheap: O(n log n), since each element of a is processed through merge once for each array it gets put in, and the recursion only goes log n layers deep before we hit 1-element arrays.

CategoryProgrammingNotes

111. C/HashTables

A hash table is a randomized data structure that supports the INSERT, DELETE, and FIND operations in expected O(1) time. The core idea behind hash tables is to use a hash function that maps a large keyspace to a smaller domain of array indices, and then use constant-time array operations to store and retrieve the data.

112. Dictionary data types

A hash table is typically used to implement a dictionary data type, where keys are mapped to values, but unlike an array, the keys are not conveniently arranged as integers 0, 1, 2, ... . Dictionary data types are a fundamental data structure often found in scripting languages like AWK, Perl, Python, PHP, Lua, or Ruby. For example, here is some Python code that demonstrates use of a dictionary accessed using an array-like syntax:

   1 title = {}   # empty dictionary
   2 title["Barack"] = "President"
   3 user = "Barack"
   4 print("Welcome" + title[user] + " " + user)

In C, we don't have the convenience of reusing [] for dictionary lookups (we'd need C++ for that), but we can still get the same effect with more typing using functions. For example, using an abstract dictionary in C might look like this:

   1 Dict *title;
   2 const char *user;
   3 title = dictCreate();
   4 dictSet(title, "Barack", "President");
   5 user = "Barack";
   6 printf("Welcome %s %s\n", dictGet(title, user), user);

As with other abstract data types, the idea is that the user of the dictionary type doesn't need to know how it is implemented. For example, we could implement the dictionary as an array of structs that we search through, but that would be expensive: O(n) time to find a key in the worst case.

113. Basics of hashing

If our keys were conveniently named 0, 1, 2, ..., n-1, we could simply use an array, and be able to find a record given a key in constant time. Unfortunately, naming conventions for most objects are not so convenient, and even enumerations like Social Security numbers are likely to span a larger range than we want to allocate. But we would like to get the constant-time performance of an array anyway.

The solution is to feed the keys through some hash function H, which maps them down to array indices. So in a database of people, to find "Smith, Wayland", we would first compute H("Smith, Wayland") = 137 (say), and then look in position 137 in the array. Because we are always using the same function H, we will always be directed to the same position 137.

114. Resolving collisions

But what if H("Smith, Wayland") and H("Hephaestos") both equal 137? Now we have a collision, and we have to resolve it by finding some way to either (a) effectively store both records in a single array location, or (b) move one of the records to a new location that we can still find later. Let's consider these two approaches separately.

114.1. Chaining

We can't really store more than one record in an array location, but we can fake it by making each array location be a pointer to a linked list. Every time we insert a new element in a particular location, we simply add it to this list.

Since the cost of scanning a linked list is linear in its size, this means that the worst-case cost of searching for a particular key will be linear in the number of keys in the table that hash to the same location. Under the assumption that the hash function is a random function (which does not mean that it returns random values every time you call it but instead means that we picked one of the many possible hash functions uniformly at random), we can analyze the expected cost of a failed search as a function of the load factor α = n/m, where n is the number of elements in the table and m is the number of locations. We have

E[number of elements hashed to the same location as x]
= ∑_{all elements y} Pr[y is hashed to the same location as x] = n * (1/m) = n/m.

So as long as the number of elements is proportional to the available space, we get constant-time FIND operations.

114.2. Open addressing

With open addressing, we store only one element per location, and handle collisions by storing the extra elements in other unused locations in the array. To find these other locations, we fix some probe sequence that tells us where to look if A[H(x)] contains an element that is not x. A typical probe sequence (called linear probing) is just H(x), H(x)+1, H(x)+2, ..., wrapping around at the end of the array. The idea is that if we can't put an element in a particular place, we just keep walking up through the array until we find an empty slot. As long as we follow the same probe sequence when looking for an element, we will be able to find the element again. If we are looking for an element and reach an empty location, then we know that the element is not present in the table.

For open addressing, we always have that α = n/m is less than or equal to 1, since we can't store more elements in the table than we have locations. In fact, we must ensure that the load factor is strictly less than 1, or some searches will never terminate because they never reach an empty location. Assuming α < 1 and that the hash function is uniform, we can calculate the worst-case expected cost of a FIND operation, which as before will occur when we have an unsuccessful FIND.

Let T(n,m) be the expected number of probes in an unsuccessful search in a hash table with open addressing, n elements, and m locations. We always do at least one probe. With probability n/m we found something and have to try again in the next location, at cost T(n-1,m-1). So we have

T(n,m) = 1 + (n/m) T(n-1,m-1)

and

T(0,m) = 1.

This is an annoying recurrence to have to solve exactly. So instead we will get an upper bound by observing that the probability that we keep going is always less than or equal to n/m (since (n-i)/(m-i) < n/m when n < m). If we further leave off the case where m = 0, we get a coin-flipping problem where we are waiting to see a coin with probability n/m of coming up heads come up tails. This has the much simpler recurrence

T = 1 + (n/m) T

for which the solution is

T = 1/(1-n/m).

115. Choosing a hash function

Here we will describe three methods for generating hash functions. The first two are typical methods used in practice. The last has additional desirable theoretical properties.

115.1. Division method

We want our hash function to look as close as it can to a random function, but random functions are (provably) expensive to store. So in practice we do something simpler and hope for the best. If the keys are large integers, a typical approach is to just compute the remainder mod m. This can cause problems if m is, say, a power of 2, since it may be that the low-order bits of all the keys are similar, which will produce lots of collisions. So in practice with this method m is typically chosen to be a large prime.

What if we want to hash strings instead of integers? The trick is to treat the strings as integers. Given a string a₁a₂a₃...a_k, we represent it as ∑_i a_ibⁱ, where b is a base chosen to be larger than the number of characters. We can then feed this resulting huge integer to our hash function. Typically we do not actually compute the huge integer directly, but instead compute its remainder mod m, as in this short C function:

   1 /* treat strings as base-256 integers */
   2 /* with digits in the range 1 to 255 */
   3 #define BASE (256)
   4 
   5 unsigned long
   6 hash(const char *s, unsigned long m)
   7 {
   8     unsigned long h;
   9     unsigned const char *us;
  10 
  11     /* cast s to unsigned const char * */
  12     /* this ensures that elements of s will be treated as having values >= 0 */
  13     us = (unsigned const char *) s;
  14 
  15     h = 0;
  16     while(*us != '\0') {
  17         h = (h * BASE + *us) % m;
  18         us++;
  19     } 
  20 
  21     return h;
  22 }

The division method works best when m is a prime, as otherwise regularities in the keys can produce clustering in the hash values. (Consider, for example, what happens if m equals 256). But this can be awkward for computing hash functions quickly, because computing remainders is a relatively slow operation.

115.2. Multiplication method

For this reason, the most commonly-used hash functions replace the modulus m with something like 2³² and replace the base with some small prime, relying on the multiplier to break up patterns in the input. This yields the "multiplication method." Typical code might look something like this:

   1 #define MULTIPLIER (37)
   2 
   3 unsigned long
   4 hash(const char *s)
   5 {
   6     unsigned long h;
   7     unsigned const char *us;
   8 
   9     /* cast s to unsigned const char * */
  10     /* this ensures that elements of s will be treated as having values >= 0 */
  11     us = (unsigned const char *) s;
  12 
  13     h = 0;
  14     while(*us != '\0') {
  15         h = h * MULTIPLIER + *us;
  16         us++;
  17     } 
  18 
  19     return h;
  20 }

The only difference between this code and the division method code is that we've renamed BASE to MULTIPLIER and dropped m. There is still some remainder-taking happening: since C truncates the result of any operation that exceeds the size of the integer type that holds it, the h = h * MULTIPLIER + *us; line effectively has a hidden mod 2³² or 2⁶⁴ at the end of it (depending on how big your unsigned longs are). Now we can't use, say, 256, as the multiplier, because then the hash value h would be determined by just the last four characters of s.

The choice of 37 is based on folklore. I like 97 myself, and 31 also has supporters. Almost any medium-sized prime should work.

115.3. Universal hashing

The preceding hash functions offer no guarantees that the adversary can't find a set of n keys that all hash to the same location; indeed, since they're deterministic, as long as the keyspace contains at least nm keys the adversary can always do so. Universal families of hash functions avoid this problem by choosing the hash function randomly, from some set of possible functions that is small enough that we can write our random choice down.

The property that makes a family of hash functions {H_r} universal is that, for any distinct keys x and y, the probability that r is chosen so that H_r(x) = H_r(y) is exactly 1/m.

Why is this important? Recall that for chaining, the expected number of collisions between an element x and other elements was just the sum over all particular elements y of the probability that x collides with that particular element. If H_r is drawn from a universal family, this probability is 1/m for each y, and we get the same n/m expected collisions as if H_r were completely random.

Several universal families of hash functions are known. Here is a simple one that works when the size of the keyspace and the size of the output space are both powers of 2. Let the keyspace consist of n-bit strings and let m = 2^k. Then the random index r consists of nk independent random bits organized as n m-bit strings a₁,a₂,...a_n. To compute the hash function of a particular input x, compute the bitwise exclusive or of a_i for each position i where the i-th bit of x is a 1. Formally, using XOR to mean bitwise exclusive or, we might write this as

XOR_i x_i*a_i.

We can implement this in C as

   1 /* implements universal hashing using random bit-vectors in x */
   2 /* assumes number of elements in x is at least BITS_PER_CHAR * MAX_STRING_SIZE */
   3 
   4 #define BITS_PER_CHAR (8)       /* not true on all machines! */
   5 #define MAX_STRING_SIZE (128)   /* we'll stop hashing after this many */
   6 #define MAX_BITS (BITS_PER_CHAR * MAX_STRING_SIZE)
   7 
   8 unsigned long
   9 hash(const char *s, unsigned long x[])
  10 {
  11     unsigned long h;
  12     unsigned const char *us;
  13     int i;
  14     unsigned char c;
  15     int shift;
  16 
  17     /* cast s to unsigned const char * */
  18     /* this ensures that elements of s will be treated as having values >= 0 */
  19     us = (unsigned const char *) s;
  20 
  21     h = 0;
  22     for(i = 0; *us != 0 && i < MAX_BITS; us++) {
  23         c = *us;
  24         for(shift = 0; shift < BITS_PER_CHAR; shift++, i++) {
  25             /* is low bit of c set? */
  26             if(c & 0x1) {
  27                 h ^= x[i];
  28             }
  29             
  30             /* shift c to get new bit in lowest position */
  31             c >>= 1;
  32         }
  33     }
  34 
  35     return h;
  36 }

As you can see, this requires a lot of bit-fiddling. It also fails if we get a lot of strings that are identical for the first MAX_STRING_SIZE characters. Conceivably, the latter problem could be dealt with by growing x dynamically as needed. But we also haven't addressed the question of where we get these random values from---see C/Randomization for some possibilities.

Why is this family universal? Consider two distinct inputs x and y. Because they are distinct, there must be some position j where the bits x_j and y_j are different. Assume without loss of generality that x_j is zero and y_j is 1. Let

X = XOR_{i != j} x_ia_i

and

Y = XOR_{i != j} y_ia_i,

so that H(x) = X and H(y) = Y XOR a_i. Suppose that we fix all the bits except for the ones in a_i; then H(x) = H(y) precisely when a_i = X XOR Y, where the right-hand side is some constant value independent of a_i. The probability that a_i is chosen to be exactly this value is 1/m.

In practice, universal families of hash functions are seldom used, since a reasonable fixed hash function is unlikely to be correlated with any patterns in the actual input. But they are useful for demonstrating provably good performance.

116. Maintaining a constant load factor

All of the running time results for hash tables depend on keeping the load factor α small. But as more elements are inserted into a fixed-size table, the load factor grows without bound. The usual solution to this problem is rehashing: when the load factor crosses some threshold, we create a new hash table of size 2n or thereabouts and migrate all the elements to it. This approach raises the worst-case cost of an insertion to O(n); however, we can bring the expected cost down to O(1) by rehashing only with probability O(1/n) for each insert after the threshold is crossed, or can apply AmortizedAnalysis to argue that the amortized cost (total cost divided by number of operations) is O(1) assuming we double the table size on each rehash.

117. Examples

117.1. A low-overhead hash table using open addressing

Here is a very low-overhead hash table based on open addressing. The application is rapidly verifying ID numbers in the range 000000000 to 999999999 by checking them against a list of known good IDs. Since the quantity of valid ID numbers may be very large, a goal of the mechanism is to keep the amount of extra storage used as small as possible. This implementation uses a tunable overhead parameter. Setting the parameter to a high value makes lookups fast but requires more space per ID number in the list. Setting it to a low value can reduce the storage cost arbitrarily close to 4 bytes per ID, at the cost of increasing search times.

This is idlist.h:

   1 typedef struct id_list *IDList;
   2 
   3 #define MIN_ID (0)
   4 #define MAX_ID (999999999)
   5 
   6 /* build an IDList out of an unsorted array of n good ids */
   7 /* returns 0 on allocation failure */
   8 IDList IDListCreate(int n, int unsorted_id_list[]);
   9 
  10 /* destroy an IDList */
  11 void IDListDestroy(IDList list);
  12 
  13 /* check an id against the list */
  14 /* returns nonzero if id is in the list */
  15 int IDListContains(IDList list, int id);

And this is idlist.c:

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include "idlist.h"
   4 
   5 /* overhead parameter that determines both space and search costs */
   6 /* must be strictly greater than 1 */
   7 #define OVERHEAD (1.1)
   8 #define NULL_ID (-1)
   9 
  10 
  11 struct id_list {
  12     int size;
  13     int ids[1];         /* we'll actually malloc more space than this */
  14 };
  15 
  16 IDList
  17 IDListCreate(int n, int unsorted_id_list[])
  18 {
  19     IDList list;
  20     int size;
  21     int i;
  22     int probe;
  23 
  24     size = (int) (n * OVERHEAD + 1);
  25 
  26     list = malloc(sizeof(*list) + sizeof(int) * (size-1));
  27     if(list == 0) return 0;
  28 
  29     /* else */
  30     list->size = size;
  31 
  32     /* clear the hash table */
  33     for(i = 0; i < size; i++) {
  34         list->ids[i] = NULL_ID;
  35     }
  36 
  37     /* load it up */
  38     for(i = 0; i < n; i++) {
  39 
  40         assert(unsorted_id_list[i] >= MIN_ID);
  41         assert(unsorted_id_list[i] <= MAX_ID);
  42 
  43         /* hashing with open addressing by division */
  44         /* this MUST be the same pattern as in IDListContains */
  45         for(probe = unsorted_id_list[i] % list->size;
  46             list->ids[probe] != NULL_ID;
  47             probe = (probe + 1) % list->size);
  48         
  49         assert(list->ids[probe] == NULL_ID);
  50 
  51         list->ids[probe] = unsorted_id_list[i];
  52     }
  53 
  54     return list;
  55 }
  56 
  57 void
  58 IDListDestroy(IDList list)
  59 {
  60     free(list);
  61 }
  62 
  63 int
  64 IDListContains(IDList list, int id)
  65 {
  66     int probe;
  67         
  68     /* this MUST be the same pattern as in IDListCreate */
  69     for(probe = id % size;
  70         list->ids[probe] != NULL_ID;
  71         probe = (probe + 1) % size) {
  72         if(list->ids[probe] == id) {
  73             return 1;
  74         }
  75     }
  76 
  77     return 0;
  78 }

117.2. A string to string dictionary using chaining

Here is a more complicated string to string dictionary based on chaining.

   1 typedef struct dict *Dict;
   2 
   3 /* create a new empty dictionary */
   4 Dict DictCreate(void);
   5 
   6 /* destroy a dictionary */
   7 void DictDestroy(Dict);
   8 
   9 /* insert a new key-value pair into an existing dictionary */
  10 void DictInsert(Dict, const char *key, const char *value);
  11 
  12 /* return the most recently inserted value associated with a key */
  13 /* or 0 if no matching key is present */
  14 const char *DictSearch(Dict, const char *key);
  15 
  16 /* delete the most recently inserted record with the given key */
  17 /* if there is no such record, has no effect */
  18 void DictDelete(Dict, const char *key);

dict.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include <string.h>
   4 
   5 #include "dict.h"
   6 
   7 struct elt {
   8     struct elt *next;
   9     char *key;
  10     char *value;
  11 };
  12 
  13 struct dict {
  14     int size;           /* size of the pointer table */
  15     int n;              /* number of elements stored */
  16     struct elt **table;
  17 };
  18 
  19 #define INITIAL_SIZE (1024)
  20 #define GROWTH_FACTOR (2)
  21 #define MAX_LOAD_FACTOR (1)
  22 
  23 /* dictionary initialization code used in both DictCreate and grow */
  24 Dict
  25 internalDictCreate(int size)
  26 {
  27     Dict d;
  28     int i;
  29 
  30     d = malloc(sizeof(*d));
  31 
  32     assert(d != 0);
  33 
  34     d->size = size;
  35     d->n = 0;
  36     d->table = malloc(sizeof(struct elt *) * d->size);
  37 
  38     assert(d->table != 0);
  39 
  40     for(i = 0; i < d->size; i++) d->table[i] = 0;
  41 
  42     return d;
  43 }
  44 
  45 Dict
  46 DictCreate(void)
  47 {
  48     return internalDictCreate(INITIAL_SIZE);
  49 }
  50 
  51 void
  52 DictDestroy(Dict d)
  53 {
  54     int i;
  55     struct elt *e;
  56     struct elt *next;
  57 
  58     for(i = 0; i < d->size; i++) {
  59         for(e = d->table[i]; e != 0; e = next) {
  60             next = e->next;
  61 
  62             free(e->key);
  63             free(e->value);
  64             free(e);
  65         }
  66     }
  67 
  68     free(d->table);
  69     free(d);
  70 }
  71 
  72 #define MULTIPLIER (97)
  73 
  74 static unsigned long
  75 hash_function(const char *s)
  76 {
  77     unsigned const char *us;
  78     unsigned long h;
  79 
  80     h = 0;
  81 
  82     for(us = (unsigned const char *) s; *us; us++) {
  83         h = h * MULTIPLIER + *us;
  84     }
  85 
  86     return h;
  87 }
  88 
  89 static void
  90 grow(Dict d)
  91 {
  92     Dict d2;            /* new dictionary we'll create */
  93     struct dict swap;   /* temporary structure for brain transplant */
  94     int i;
  95     struct elt *e;
  96 
  97     d2 = internalDictCreate(d->size * GROWTH_FACTOR);
  98 
  99     for(i = 0; i < d->size; i++) {
 100         for(e = d->table[i]; e != 0; e = e->next) {
 101             /* note: this recopies everything */
 102             /* a more efficient implementation would
 103              * patch out the strdups inside DictInsert
 104              * to avoid this problem */
 105             DictInsert(d2, e->key, e->value);
 106         }
 107     }
 108 
 109     /* the hideous part */
 110     /* We'll swap the guts of d and d2 */
 111     /* then call DictDestroy on d2 */
 112     swap = *d;
 113     *d = *d2;
 114     *d2 = swap;
 115 
 116     DictDestroy(d2);
 117 }
 118 
 119 /* insert a new key-value pair into an existing dictionary */
 120 void
 121 DictInsert(Dict d, const char *key, const char *value)
 122 {
 123     struct elt *e;
 124     unsigned long h;
 125 
 126     assert(key);
 127     assert(value);
 128 
 129     e = malloc(sizeof(*e));
 130 
 131     assert(e);
 132 
 133     e->key = strdup(key);
 134     e->value = strdup(value);
 135 
 136     h = hash_function(key) % d->size;
 137 
 138     e->next = d->table[h];
 139     d->table[h] = e;
 140 
 141     d->n++;
 142 
 143     /* grow table if there is not enough room */
 144     if(d->n >= d->size * MAX_LOAD_FACTOR) {
 145         grow(d);
 146     }
 147 }
 148 
 149 /* return the most recently inserted value associated with a key */
 150 /* or 0 if no matching key is present */
 151 const char *
 152 DictSearch(Dict d, const char *key)
 153 {
 154     struct elt *e;
 155 
 156     for(e = d->table[hash_function(key) % d->size]; e != 0; e = e->next) {
 157         if(!strcmp(e->key, key)) {
 158             /* got it */
 159             return e->value;
 160         }
 161     }
 162 
 163     return 0;
 164 }
 165 
 166 /* delete the most recently inserted record with the given key */
 167 /* if there is no such record, has no effect */
 168 void
 169 DictDelete(Dict d, const char *key)
 170 {
 171     struct elt **prev;          /* what to change when elt is deleted */
 172     struct elt *e;              /* what to delete */
 173 
 174     for(prev = &(d->table[hash_function(key) % d->size]); 
 175         *prev != 0; 
 176         prev = &((*prev)->next)) {
 177         if(!strcmp((*prev)->key, key)) {
 178             /* got it */
 179             e = *prev;
 180             *prev = e->next;
 181 
 182             free(e->key);
 183             free(e->value);
 184             free(e);
 185 
 186             return;
 187         }
 188     }
 189 }

dict.c

And here is some (very minimal) test code.

   1 #include <stdio.h>
   2 #include <assert.h>
   3 
   4 #include "dict.h"
   5 
   6 int
   7 main()
   8 {
   9     Dict d;
  10     char buf[512];
  11     int i;
  12 
  13     d = DictCreate();
  14 
  15     DictInsert(d, "foo", "hello world");
  16     puts(DictSearch(d, "foo"));
  17     DictInsert(d, "foo", "hello world2");
  18     puts(DictSearch(d, "foo"));
  19     DictDelete(d, "foo");
  20     puts(DictSearch(d, "foo"));
  21     DictDelete(d, "foo");
  22     assert(DictSearch(d, "foo") == 0);
  23     DictDelete(d, "foo");
  24 
  25     for(i = 0; i < 10000; i++) {
  26         sprintf(buf, "%d", i);
  27         DictInsert(d, buf, buf);
  28     }
  29 
  30     DictDestroy(d);
  31 
  32     return 0;
  33 }
  34 
  35

test_dict.c

CategoryProgrammingNotes CategoryAlgorithmNotes

118. BinaryTrees

DivideAndConquer yields algorithms whose execution has a tree structure. Sometimes we build data structures that are also trees. It is probably not surprising that DivideAndConquer is the natural way to build algorithms that use such trees as inputs.

119. Tree basics

Here is a typical complete binary tree. It is binary because every node has at most two children. It is complete because the nodes consist only of internal nodes with exactly two children and leaves with no children.

Structurally, a complete binary tree consists of either a single node (a leaf) or a root node with a left and right subtree, each of which is itself either a leaf or a root node with two subtrees. The set of all nodes underneath a particular node x is called the subtree rooted at x.

The size of a tree is the number of nodes; a leaf by itself has size 1. The height of a tree is the length of the longest path; 0 for a leaf, at least one in any larger tree. The depth of a node is the length of the path from the root to that node. The height of a node is the height of the subtree of which it is the root, i.e. the length of the longest path from that node to some leaf below it. A node u is an ancestor of a node v if v is contained in the subtree rooted at u; we may write equivalently that v is a descendant of u. Note that every node is both and ancestor and descendant of itself; if we wish to exclude the node itself, we refer to a proper ancestor or proper descendant.

120. Binary tree implementations

In a low-level programming language like C, a binary tree typically looks a lot like a linked list with an extra outgoing pointer from each element, e.g.

   1 struct tree_node {
   2     int key;
   3     struct tree_node *left;  /* left child */
   4     struct tree_node *right; /* right child */
   5 };
   6 
   7 typedef struct tree_node *Tree;

Missing children (and the empty tree) are represented by null pointers. Typically, individual tree nodes are allocated separately using malloc; however, for high-performance use it is not unusual for tree libraries to do their own storage allocation out of large blocks obtained from malloc.

Optionally, the struct may be extended to include additional information such as a pointer to the node's parent, hints for balancing (see BalancedTrees), or aggregate information about the subtree rooted at the node such as its size or the sum/max/average of the keys of its nodes.

When it is not important to be able to move large subtrees around simply by adjusting pointers, a tree may be represented implicitly by packing it into an array. For an example of how this works see Heaps.

121. The canonical binary tree algorithm

Pretty much every DivideAndConquer algorithm for binary trees looks like this:

   1 void
   2 doSomethingToAllNodes(Tree root)
   3 {
   4     if(root) {
   5         doSomethingTo(root);
   6         doSomethingToAllNodes(root->left);
   7         doSomethingToAllNodes(root->right);
   8     }
   9 }

The function processes all nodes in what is called a preorder traversal, where the "preorder" part means that the root of any tree is processed first. Moving the call to doSomethingTo in between or after the two recursive calls yields an inorder or postorder traversal, respectively.

In practice we usually want to extract some information from the tree. For example, this function computes the size of a tree:

   1 int
   2 tree_size(Tree root)
   3 {
   4     if(root == 0) {
   5         return 0;
   6     } else {
   7         return 1 + tree_size(root->left) + tree_size(root->right);
   8     }
   9 }

and this function computes the height:

   1 int
   2 tree_height(Tree root)
   3 {
   4     int lh;     /* height of left subtree */
   5     int rh;     /* height of right subtree */
   6 
   7     if(root == 0) {
   8         return -1;
   9     } else {
  10         lh = tree_height(root->left);
  11         rh = tree_height(root->right);
  12         return 1 + (lh > rh ? lh : rh);
  13     }
  14 }

Since both of these algorithms have the same structure, they both have the same asymptotic running time. We can compute this running time using the recurrence

T(n) = Theta(1) + T(k) + T(n-k-1)

where k is the size of the left subtree.

Now, there's a problem with this recurrence: for an arbitrary tree of size k, we don't know what k is! So how can we solve a recurrence that contains an unbound variable?

The trick is that in this case we get the same answer no matter what k is. First let's show that T(n) <= an for some a:

T(n) <= c + T(k) + T(n-k-1) <= c + ak + a(n-k-1) = c + a(n-1) <= an [provided c <= a].

Showing that it is greater than an (presumably for a different a), is essentially the same argument, now with c >= a:

T(n) >= c + T(k) + T(n-k-1) >= c + ak + a(n-k-1) = c + a(n-1) >= an [provided c >= a].

So these are all Theta(n) algorithms.

122. Nodes vs leaves

For some binary trees we don't store anything interesting in the internal nodes, using them only to provide a route to the leaves. We might reasonably ask if an algorithm that runs in O(n) time where n is the total number of nodes still runs in O(m) time, where m counts only the leaves. For complete binary trees, we can show that we get the same asymptotic performance whether we count leaves only, internal nodes only, or both leaves and internal nodes.

Let T(n) be the number of internal nodes in a complete binary tree with n leaves. It is easy to see that T(1) = 0 and T(2) = 1, but for larger trees there are multiple structures and so it makes sense to write a recurrence:

T(n) = 1 + T(k) + T(n-k).

We will show by induction that the solution to this recurrence is exactly T(n) = n-1. We already have the base case T(1) = 0. For larger n, we have

T(n) = 1 + T(k) + T(n-k) = 1 + (k-1) + (n-k-1) = n-1.

So a tree with Theta(n) nodes has Theta(n) internal nodes and Theta(n) leaves; if we don't care about constant factors, we won't care which number we use.

123. Special classes of binary trees

So far we haven't specified where particular nodes are placed in the binary tree. Most applicaitons of binary trees put some constraints on how nodes relate to one another. Some possibilities:

BinarySearchTrees: Each node has a key, and a node's key must be greater than all keys in the subtree of its left-hand child and less than all keys in the subtree of its right-hand child.
Heaps: Each node has a key that is less than the keys of both of its children.

CategoryAlgorithmNotes

124. BinarySearchTrees

A binary search tree is a binary tree (see BinaryTrees) in which each node has a key, and a node's key must be greater than all keys in the subtree of its left-hand child and less than all keys in the subtree of its right-hand child. This allows a node to be searched for using essentially the same binary search algorithm used on sorted arrays.

125. Searching for a node

   1 /* returns node with given target key */
   2 /* or null if no such node exists */
   3 Tree
   4 tree_search(Tree root, int target)
   5 {
   6     if(root->key == target) {
   7         return root;
   8     } else if(root->key > target) {
   9         return tree_search(root->left);
  10     } else {
  11         return tree_search(root->right);
  12     }
  13 }

This procedure can be rewritten iteratively, which avoids stack overflow and is likely to be faster:

   1 Tree
   2 tree_search(Tree root, int target)
   3 {
   4     while(root != 0 && root->key != target) {
   5         if(root->key > target) {
   6             root = root->left;
   7         } else {
   8             root = root->right;
   9         }
  10     }
  11 
  12     return root;
  13 }

These procedures can be modified in the obvious way to deal with keys that aren't ints, as long as they can be compared (e.g., by using strcmp on strings).

126. Inserting a new node

As in HashTables, the insertion procedure mirrors the search procedure. We must be a little careful to avoid actually walking all the way down to a null pointer, since a null pointer now indicates a missing space for a leaf that we can fill with our new node. So the code is a little more complex.

   1 void
   2 tree_insert(Tree root, int new_key)
   3 {
   4     Tree new_node;
   5 
   6     new_node = malloc(sizeof(*new_node));
   7     assert(new_node);
   8 
   9     new_node->key = new_key;
  10     new_node->left = 0;
  11     new_node->right = 0;
  12 
  13     for(;;) {
  14         if(root->key > target) {
  15             /* try left child */
  16             if(root->left) {
  17                 root = root->left;
  18             } else {
  19                 /* put it in */
  20                 root->left = new_node;
  21                 return;
  22             }
  23         } else {
  24             /* right child case is symmetric */
  25             if(root->right) {
  26                 root = root->right;
  27             } else {
  28                 /* put it in */
  29                 root->right = new_node;
  30                 return;
  31             }
  32         }
  33     }
  34 }

Note that this code happily inserts duplicate keys. It also makes no attempt to keep the tree balanced. This may lead to very long paths if new keys are inserted in strictly increasing or strictly decreasing order.

127. Costs

Searching for or inserting a node in a binary search tree with n nodes takes

T(n) ≤ T(k) + O(1)

time, where k is the size of the subtree that contains the target. In BalancedTrees, k will always be at most cn for some constant c < 1. In this case, the recurrence has the solution T(n) = O(log n).

CategoryAlgorithmNotes CategoryProgrammingNotes

128. BalancedTrees

BinarySearchTrees are a fine idea, but they only work if they are balanced---if moving from a tree to its left or right subtree reduces the size by a constant fraction. Balanced binary trees add some extra mechanism to the basic binary search tree to ensure balance. Finding efficient ways to balance a tree has been studied for decades, and several good mechanisms are known. We'll try to hit the high points of all of them.

129. The basics: tree rotations

The problem is that as we insert new nodes, some paths through the tree may become very long. So we need to be able to shrink the long paths by moving nodes elsewhere in the tree.

But how do we do this? The basic idea is to notice that there may be many binary search trees that contain the same data, and that we can transform one into another by a local modification called a rotation:

    y            x
   / \   <==>   / \
  x   C        A   y
 / \              / \
A   B            B   C

Single rotation on x-y edge

If A < x < B < y < C, then both versions of this tree have the binary search tree property. By doing the rotation in one direction, we move A up and C down; in the other direction, we move A down and C up. So rotations can be used to transfer depth from the leftmost grandchild of a node to the rightmost and vice versa.

But what if it's the middle grandchild B that's the problem? A single rotation as above doesn't move B up or down. To move B, we have to reposition it so that it's on the end of something. We do this by splitting B into two subtrees B1 and B2, and doing two rotations that split the two subtrees while moving both up. For this we need to do two rotations:

    z              z                y
   / \   ===>     / \     ===>     / \
  x   C          y   C            x   z
 / \            / \              /|   |\
A   y          x  B2            A B1 B2 C
   / \        / \
  B1 B2      A  B1

Double rotation: rotate xy then zy

130. AVL trees

Rotations in principle let us rebalance a tree, but we still need to decide when to do them. If we try to keep the tree in perfect balance (all paths nearly the same length), we'll spend so much time rotating that we won't be able to do anything else.

AVL trees solve this problem by maintaining the invariant that the heights of the two subtrees sitting under each node differ by at most one. This does not guarantee perfect balance, but it does get close. Let S(k) be the size of the smallest AVL tree with height k. This tree will have at least one subtree of height k-1, but its other subtree can be of height k-2 (and should be, to keep it as small as possible). We thus have the recurrence

S(k) = 1 + S(k-1) + S(k-2)

which is very close to the Fibonacci recurrence.

It is possible to solve this exactly using GeneratingFunctions. But we can get close by guessing that S(k) ≥ a^k for some constant a. This clearly works for S(0) = a⁰ = 1. For larger k, compute

S(k) = 1 + a^k-1 + a^k-2 = 1 + a^k (1/a + 1/a²) > a^k (1/a + 1/a²).

This last quantity is at least a^k provided (1/a) + 1/a²) is at least 1. We can solve exactly for the largest a that makes this work, but a very quick calculation shows that a = (3/2) works: 2/3 + 4/9 = 10/9 > 1. It follows that any AVL tree with height k has at least (3/2)^k nodes, or conversely that any AVL tree with (3/2)^k nodes has height at most k. So the height of an arbitrary AVL tree with n nodes is no greater than log_3/2 n = O(log n).

How do we maintain this invariant? The first thing to do is add extra information to the tree, so that we can tell when the invariant has been violated. AVL trees store in each node the difference between the heights of its left and right subtrees, which will be one of -1, 0, or 1. In an ideal world this would require lg 3 ≅ 1.58 bits per node, but since fractional bits are difficult to represent on modern computers a typical implementation uses two bits. Inserting a new node into an AVL tree involves

Doing a standard binary search tree insertion.
Updating the balance fields for every node on the insertion path.
Performing a single or double rotation to restore balance if needed.

Implementing this correctly is tricky. Intuitively, we can imagine a version of an AVL tree in which we stored the height of each node (using O(log log n) bits). When we insert a new node, only the heights of its ancestors change---so step 2 requires updating O(log n) height fields. Similarly, it is only these ancestors that can be overtall. It turns out that fixing the closest ancestor fixes all the ones above it (because it shortens their longest paths by one as well). So just one single or double rotation restores balance.

Deletions are also possible, but are uglier: a deletion in an AVL tree may require as many as O(log n) rotations. The basic idea is to use the standard BinarySearchTree deletion trick of either splicing out a node if it has no right child, or replacing it with the minimum value in its right subtree (the node for which is spliced out); we then have to check to see if we need to rebalance at every node above whatever node we removed.

If we are not fanatical about space optimization, we can just keep track of the heights of all nodes explicitly, instead of managing the -1, 0, 1 balance values. An example of an implementation that uses this approach is given on the page C/AvlTree.

131. 2–3 trees

An early branch in the evolution of balanced trees was the 2–3 tree. Here all paths have the same length, but internal nodes have either 2 or 3 children. So a 2–3 tree with height k has between 2^k and 3^k leaves and a comparable number of internal nodes. The maximum path length in a tree with n nodes is at most ceiling(lg n), as in a perfectly balanced binary tree.

An internal node in a 2–3 tree holds one key if it has two children (including two nil pointers) and two if it has three children. A search that reaches a three-child node must compare the target with both keys to decide which of the three subtrees to recurse into. As in binary trees, these comparisons take constant time, so we can search a 2–3 tree in O(log n) time.

Insertion is done by expanding leaf nodes. This may cause a leaf to split when it acquires a third key. When a leaf splits, it becomes two one-key nodes and the middle key moves up into its parent. This may cause further splits up the ancestor chain; the tree grows in height by adding a new root when the old root splits. In practice only a small number of splits are needed for most insertions, but even in the worst case this entire process takes O(log n) time.

It follows that 2–3 trees have the same performance as AVL trees. Conceptually, they are simpler, but having to write separate cases for 2-child and 3-child nodes doubles the size of most code that works on 2–3 trees. The real significance of 2–3 trees is as a precursor to two other kinds of trees, the red-black tree and the B-tree.

132. Red-black trees

A red-black tree is a 2–3–4 tree (i.e. all nodes have 2, 3, or 4 children and 1, 2, or 3 internal keys) where each node is represented by a little binary tree with a black root and zero, one, or two red extender nodes as follows:

The invariant for a red-black tree is that

No two red nodes are adjacent.
Every path contains the same number of black nodes.

For technical reasons, we include the null pointers at the bottom of the tree as black nodes; this has no effect on the invariant, but simplifies the description of the rebalancing procedure.

From the invariant it follows that every path has between k and 2k nodes, where k is the "black-height," the common number of black nodes on each path. From this we can prove that the height of the tree is O(log n).

Searching in a red-black tree is identical to searching in any other binary search tree; we simply ignore the color bit on each node. So search takes O(log n) time. For insertions, we use the standard binary search tree insertion algorithm, and insert the new node as a red node. This may violate the first part of the invariant (it doesn't violate the second because it doesn't change the number of black nodes on any path). In this case we need to fix up the constraint by recoloring nodes and possibly performing a single or double rotation.

Which operations we need to do depend on the color of the new node's uncle. If the uncle is red, we can recolor the node's parent, uncle, and grandparent and get rid of the double-red edge between the new node and its parent without changing the number of black nodes on any path. In this case, the grandparent becomes red, which may create a new double-red edge which must be fixed recursively. Thus up to O(log n) such recolorings may occur at a total cost of O(log n).

If the uncle is black (which includes the case where the uncle is a null pointer), a rotation (possibly a double rotation) and recoloring is necessary. In this case (depicted at the bottom of the picture above), the new grandparent is always black, so there are no more double-red edges. So at most two rotations occur after any insertion.

Deletion is more complicated but can also be done in O(log n) recolorings and O(1) (in this case up to 3) rotations. Because deletion is simpler in red-black trees than in AVL trees, and because operations on red-black trees tend to have slightly smaller constants than corresponding operation on AVL trees, red-black trees are more often used that AVL trees in practice.

133. B-trees

Neither is used as much as a B-tree, a specialized data structure optimized for storage systems where the cost of reading or writing a large block (of typically 4096 or 8192 bytes) is no greater than the cost of reading or writing a single bit. Such systems include typical disk drives, where the disk drive has to spend so long finding data on disk that it tries to amortize the huge (tens of millions of CPU clock cycles) seek cost over many returned bytes.

A B-tree is a generalization of a 2–3 tree where each node has between M/2 and M-1 children, where M is some large constant chosen so that a node (including up to M-1 pointers and up to M-2 keys) will just fit inside a single block. When a node would otherwise end up with M children, it splits into two nodes with M/2 children each, and moves its middle key up into its parent. As in 2–3 trees this may eventually require the root to split and a new root to be created; in practice, M is often large enough that a small fixed height is enough to span as much data as the storage system is capable of holding.

Searches in B-trees require looking through log_M n nodes, at a cost of O(M) time per node. If M is a constant the total time is asymptotically O(log n). But the reason for using B-trees is that the O(M) cost of reading a block is trivial compare to the much larger constant time to find the block on the disk; and so it is better to minimize the number of disk accesses (by making M large) than reduce the CPU time.

134. Splay trees

Yet another approach to balancing is to do it dynamically. Splay trees are binary search trees in which every search operation proceeds by rotating the target to the root. If this is done correctly, the amortized cost (see AmortizedAnalysis) of each tree operation is O(log n), although particular rare operations might take as much as O(n) time. Splay trees require no extra space because they store no balancing information; however, the constant factors on searches can be larger because every search requires restructuring the tree. For some applications this additional cost is balanced by the splay tree's ability to adapt to data access patterns; if some elements of the tree are hit more often than others, these elements will tend to migrate to the top, and the cost of a typical search will drop to O(log m), where m is the size of the "working set" of frequently-accessed elements.

For more details on splay trees, see SedgewickSeries, the original paper, or any number of demos, animations, and other descriptions that can be found via Google.

135. Skip lists

Skip lists are yet another balanced tree data structure, where the tree is disguised as a tower of linked lists. They are described on their own page (SkipLists).

136. Implementations

AVL trees and red-black trees have been implemented for every reasonable programming language you've ever heard of. For C implementations, a good place to start is at http://adtinfo.org/.

CategoryAlgorithmNotes CategoryProgrammingNotes

137. C/AvlTree

Basic implementation of an AVL tree storing ints. This is not particularly optimized, and effectively implements a set data type with the ability to delete the minimum value as in a heap.

138. Header file

   1 /* implementation of an AVL tree with explicit heights */
   2 
   3 typedef struct avlNode *AvlTree;
   4 
   5 /* empty avl tree is just a null pointer */
   6 
   7 #define AVL_EMPTY (0)
   8 
   9 /* free a tree */
  10 void avlDestroy(AvlTree t);
  11 
  12 /* return the height of a tree */
  13 int avlGetHeight(AvlTree t);
  14 
  15 /* return nonzero if key is present in tree */
  16 int avlSearch(AvlTree t, int key);
  17 
  18 /* insert a new element into a tree */
  19 /* note *t is actual tree */
  20 void avlInsert(AvlTree *t, int key);
  21 
  22 /* run sanity checks on tree (for debugging) */
  23 /* assert will fail if heights are wrong */
  24 void avlSanityCheck(AvlTree t);
  25 
  26 /* print all keys of the tree in order */
  27 void avlPrintKeys(AvlTree t);
  28 
  29 /* delete and return minimum value in a tree */
  30 int avlDeleteMin(AvlTree *t);

avlTree.h

139. Implementation

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 #include "avlTree.h"
   6 
   7 /* implementation of an AVL tree with explicit heights */
   8 
   9 struct avlNode {
  10     struct avlNode *child[2];    /* left and right */
  11     int key;
  12     int height;
  13 };
  14 
  15 /* free a tree */
  16 void 
  17 avlDestroy(AvlTree t)
  18 {
  19     if(t != AVL_EMPTY) {
  20         avlDestroy(t->child[0]);
  21         avlDestroy(t->child[1]);
  22         free(t);
  23     }
  24 }
  25 
  26 /* return height of an AVL tree */
  27 int
  28 avlGetHeight(AvlTree t)
  29 {
  30     if(t != AVL_EMPTY) {
  31         return t->height;
  32     } else {
  33         return 0;
  34     }
  35 }
  36 
  37 /* return nonzero if key is present in tree */
  38 int
  39 avlSearch(AvlTree t, int key)
  40 {
  41     if(t == AVL_EMPTY) {
  42         return 0;
  43     } else if(t->key == key) {
  44         return 1;
  45     } else {
  46         return avlSearch(t->child[key > t->key], key);
  47     }
  48 }
  49 
  50 #define Max(x,y) ((x)>(y) ? (x) : (y))
  51 
  52 /* assert height fields are correct throughout tree */
  53 void
  54 avlSanityCheck(AvlTree root)
  55 {
  56     int i;
  57 
  58     if(root != AVL_EMPTY) {
  59         for(i = 0; i < 2; i++) {
  60             avlSanityCheck(root->child[i]);
  61         }
  62 
  63         assert(root->height == 1 + Max(avlGetHeight(root->child[0]), avlGetHeight(root->child[1])));
  64     }
  65 }
  66 
  67 /* recompute height of a node */
  68 static void
  69 avlFixHeight(AvlTree t)
  70 {
  71     assert(t != AVL_EMPTY);
  72 
  73     t->height = 1 + Max(avlGetHeight(t->child[0]), avlGetHeight(t->child[1]));
  74 }
  75 
  76 /* rotate child[d] to root */
  77 /* assumes child[d] exists */
  78 /* Picture:
  79  *
  80  *     y            x
  81  *    / \   <==>   / \
  82  *   x   C        A   y
  83  *  / \              / \
  84  * A   B            B   C
  85  *
  86  */
  87 static void
  88 avlRotate(AvlTree *root, int d)
  89 {
  90     AvlTree oldRoot;
  91     AvlTree newRoot;
  92     AvlTree oldMiddle;
  93 
  94     oldRoot = *root;
  95     newRoot = oldRoot->child[d];
  96     oldMiddle = newRoot->child[!d];
  97 
  98     oldRoot->child[d] = oldMiddle;
  99     newRoot->child[!d] = oldRoot;
 100     *root = newRoot;
 101 
 102     /* update heights */
 103     avlFixHeight((*root)->child[!d]);   /* old root */
 104     avlFixHeight(*root);                /* new root */
 105 }
 106 
 107 
 108 /* rebalance at node if necessary */
 109 /* also fixes height */
 110 static void
 111 avlRebalance(AvlTree *t)
 112 {
 113     int d;
 114 
 115     if(*t != AVL_EMPTY) {
 116         for(d = 0; d < 2; d++) {
 117             /* maybe child[d] is now too tall */
 118             if(avlGetHeight((*t)->child[d]) > avlGetHeight((*t)->child[!d]) + 1) {
 119                 /* imbalanced! */
 120                 /* how to fix it? */
 121                 /* need to look for taller grandchild of child[d] */
 122                 if(avlGetHeight((*t)->child[d]->child[d]) > avlGetHeight((*t)->child[d]->child[!d])) {
 123                     /* same direction grandchild wins, do single rotation */
 124                     avlRotate(t, d);
 125                 } else {
 126                     /* opposite direction grandchild moves up, do double rotation */
 127                     avlRotate(&(*t)->child[d], !d);
 128                     avlRotate(t, d);
 129                 }
 130 
 131                 return;   /* avlRotate called avlFixHeight */
 132             }
 133         }
 134                   
 135         /* update height */
 136         avlFixHeight(*t);
 137     }
 138 }
 139 
 140 /* insert into tree */
 141 /* this may replace root, which is why we pass
 142  * in a AvlTree * */
 143 void
 144 avlInsert(AvlTree *t, int key)
 145 {
 146     /* insertion procedure */
 147     if(*t == AVL_EMPTY) {
 148         /* new t */
 149         *t = malloc(sizeof(struct avlNode));
 150         assert(*t);
 151 
 152         (*t)->child[0] = AVL_EMPTY;
 153         (*t)->child[1] = AVL_EMPTY;
 154 
 155         (*t)->key = key;
 156 
 157         (*t)->height = 1;
 158 
 159         /* done */
 160         return;
 161     } else if(key == (*t)->key) {
 162         /* nothing to do */
 163         return;
 164     } else {
 165         /* do the insert in subtree */
 166         avlInsert(&(*t)->child[key > (*t)->key], key);
 167 
 168         avlRebalance(t);
 169 
 170         return;
 171     }
 172 }
 173 
 174 
 175 /* print all elements of the tree in order */
 176 void
 177 avlPrintKeys(AvlTree t)
 178 {
 179     if(t != AVL_EMPTY) {
 180         avlPrintKeys(t->child[0]);
 181         printf("%d\n", t->key);
 182         avlPrintKeys(t->child[1]);
 183     }
 184 }
 185 
 186 
 187 /* delete and return minimum value in a tree */
 188 int
 189 avlDeleteMin(AvlTree *t)
 190 {
 191     AvlTree oldroot;
 192     int minValue;
 193 
 194     assert(t != AVL_EMPTY);
 195 
 196     if((*t)->child[0] == AVL_EMPTY) {
 197         /* root is min value */
 198         oldroot = *t;
 199         minValue = oldroot->key;
 200         *t = oldroot->child[1];
 201         free(oldroot);
 202     } else {
 203         /* min value is in left subtree */
 204         minValue = avlDeleteMin(&(*t)->child[0]);
 205     }
 206 
 207     avlRebalance(t);
 208     return minValue;
 209 }
 210 
 211 /* delete the given value */
 212 void
 213 avlDelete(AvlTree *t, int key)
 214 {
 215     AvlTree oldroot;
 216 
 217     if(*t != AVL_EMPTY) {
 218         return;
 219     } else if((*t)->key == key) {
 220         /* do we have a right child? */
 221         if((*t)->child[1] != AVL_EMPTY) {
 222             /* give root min value in right subtree */
 223             (*t)->key = avlDeleteMin(&(*t)->child[1]);
 224         } else {
 225             /* splice out root */
 226             oldroot = (*t);
 227             *t = (*t)->child[0];
 228             free(oldroot);
 229         }
 230     } else {
 231         avlDelete(&(*t)->child[key > (*t)->key], key);
 232     }
 233 
 234     /* rebalance */
 235     avlRebalance(t);
 236 }

avlTree.c

140. Test code and Makefile

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 #include "avlTree.h"
   6 
   7 #define N (1024)
   8 #define MULTIPLIER (97)
   9 
  10 int
  11 main(int argc, char **argv)
  12 {
  13     AvlTree t = AVL_EMPTY;
  14     int i;
  15 
  16     if(argc != 1) {
  17         fprintf(stderr, "Usage: %s\n", argv[0]);
  18         return 1;
  19     }
  20 
  21     for(i = 0; i < N; i++) {
  22         avlInsert(&t, (i*MULTIPLIER) % N);
  23     }
  24 
  25     printf("height %d\n", avlGetHeight(t));
  26 
  27     assert(avlSearch(t, N-1) == 1);
  28     assert(avlSearch(t, N) == 0);
  29 
  30     avlSanityCheck(t);
  31 
  32     for(i = 0; i < N-1; i++) {
  33         avlDeleteMin(&t);
  34     }
  35 
  36     avlSanityCheck(t);
  37 
  38     avlPrintKeys(t);
  39 
  40     avlDestroy(t);
  41 
  42     return 0;
  43 }

test_avlTree.c

Makefile

CategoryProgrammingNotes

141. Heaps

A heap is a binary tree data structure (see BinaryTrees) in which each element has a key (or sometimes priority) that is less than the keys of its children. Heaps are used to implement the priority queue abstract data type (see AbstractDataTypes), which we'll talk about first.

142. Priority queues

In a standard queue, elements leave the queue in the same order as they arrive. In a priority queue, elements leave the queue in order of decreasing priority: the DEQUEUE operation becomes a DELETE-MIN operation (or DELETE-MAX, if higher numbers mean higher priority), which removes and returns the highest-priority element of the priority queue, regardless of when it was inserted. Priority queues are often used in operating system schedulers to determine which job to run next: a high-priority job (e.g., turn on the fire suppression system) runs before a low-priority job (floss the cat) even if the low-priority job has been waiting longer.

143. Expensive implementations of priority queues

Implementing a priority queue using an array or linked list is likely to be expensive. If the array or list is unsorted, it takes O(n) time to find the minimum element; if it is sorted, it takes O(n) time (in the worst case) to add a new element. So such implementations are only useful when the numbers of INSERT and DELETE-MIN operations are very different. For example, if DELETE-MIN is called only rarely but INSERT is called often, it may actually be cheapest to implement a priority queue as an unsorted linked list with O(1) INSERTs and O(n) DELETE-MINs. But if we expect that every element that is inserted is eventually removed, we want something for which both INSERT and DELETE-MIN are cheap operations.

144. Heaps

A heap is a binary tree in which each node has a smaller key than its children; this property is called the heap property or heap invariant.

To insert a node in the heap, we add it as a new leaf, which may violate the heap property if the new node has a lower key than its parent. But we can restore the heap property (at least between this node and its parent) by swapping either the new node or its sibling with the parent, where in either case we move up the node with the smaller key. This may still leave a violation of the heap property one level up in the tree, but by continuing to swap small nodes with their parents we eventually reach the top and have a heap again. The time to complete this operation is proportional to the depth of the heap, which will typically be O(log n) (we will see how to enforce this in a moment).

To implement DELETE-MIN, we can easily find the value to return at the top of the heap. Unfortunately, removing it leaves a vacuum that must be filled in by some other element. The easiest way to do this is to grab a leaf (which probably has a very high key), and then float it down to where it belongs by swapping it with its smaller child at each iteration. After time proportional to the depth (again O(log n) if we are doing things right), the heap invariant is restored.

Similar local swapping can be used to restore the heap invariant if the priority of some element in the middle changes; we will not discuss this in detail.

145. Packed heaps

It is possible to build a heap using structs and pointers, where each element points to its parent and children. In practice, heaps are instead stored in arrays, with an implicit pointer structure determined by array indices. For zero-based arrays as in C, the rule is that a node at position i has children at positions 2*i+1 and 2*i+2; in the other direction, a node at position i has a parent at position (i-1)/2 (which rounds down in C). This is equivalent to storing a heap in an array by reading through the tree in BreadthFirstSearch order:

becomes

0 1 2 3 4 5 6

This approach works best if there are no gaps in the array. So to maximize efficiency we make this "no gaps" policy part of the invariant. We can do so because we don't care which leaf gets added when we do an INSERT, so we can make it be the position at the end of the array. Similarly, in a DELETE-MIN operation we can promote the last element to the root before floating it down. Of course, the usual implementation considerations with variable-length arrays apply (see C/DynamicStorageAllocation).

146. Bottom-up heapification

If we are presented with an unsorted array, we can turn it into a heap more quickly than the O(n log n) time required to do n INSERTs. The trick is to build the heap from the bottom up (i.e. starting with position n-1 and working back to position 0), so that when it comes time to fix the heap invariant at position i we have already fixed it at all later positions (this is a form of DynamicProgramming). Unfortunately, it is not quite enough simply to swap a[i] with its smaller child when we get there, because we could find that a[0] (say) was the largest element in the heap. But the cost of floating a[i] down to its proper place will be proportional to its own height rather than the height of the entire heap. Since most of the elements of the heap are close to the bottom, the total cost will turn out to be O(n).

The detailed analysis for a heap with exactly 2^k-1 elements is that we pay 0 for the 2^k-1 bottom elements, 1 for the 2^k-2 elements one level up, 2 for the 2^k-3 elements one level above this, and so forth, giving a total cost proportional to

$\begin{displaymath} \sum_{i=1}^{k} i 2^{k-i} = 2^k \sum_{i=1}^{k} \frac{i}{2^i} \le 2^k \sum_{i=1}^{\infty} \frac{i}{2^i} = 2^{k+1}. \end{displaymath}$

The last step depends on recognizing that the sum equals 2, which can be proved using GeneratingFunctions.

147. Heapsort

Bottom-up heapification is used in the Heapsort algorithm, which first does bottom-up heapification in O(n) time and then repeatedly calls DELETE-MIN to extract the next element. This is no faster than the O(n log n) cost of MergeSort or QuickSort in typical use, but it can be faster if we only want to get the first few elements of the array.

Here is a simple implementation of heapsort, that demonstrates how both bottom-up heapification and the DELETE-MIN procedure work by floating elements down to their proper places:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 /* max heap implementation */
   6 
   7 /* compute child 0 or 1 */
   8 #define Child(x, dir) (2*(x)+1+(dir))
   9 
  10 /* float value at position pos down */
  11 static void
  12 floatDown(int n, int *a, int pos)
  13 {
  14     int x;
  15 
  16     /* save original value once */
  17     x = a[pos];
  18 
  19     for(;;) {
  20         if(Child(pos, 1) < n && a[Child(pos, 1)] > a[Child(pos, 0)]) {
  21             /* maybe swap with Child(pos, 1) */
  22             if(a[Child(pos, 1)] > x) {
  23                 a[pos] = a[Child(pos, 1)];
  24                 pos = Child(pos, 1);
  25             } else {
  26                 /* x is bigger than both kids */
  27                 break;
  28             }
  29         } else if(Child(pos, 0) < n && a[Child(pos, 0)] > x) {
  30             /* swap with Child(pos, 0) */
  31             a[pos] = a[Child(pos, 0)];
  32             pos = Child(pos, 0);
  33         } else {
  34             /* done */
  35             break;
  36         }
  37     }
  38 
  39     a[pos] = x;
  40 }
  41 
  42 /* construct a heap bottom-up */
  43 static void
  44 heapify(int n, int *a)
  45 {
  46     int i;
  47 
  48     for(i = n-1; i >= 0; i--) {
  49         floatDown(n, a, i);
  50     }
  51 }
  52 
  53 /* sort an array */
  54 void
  55 heapSort(int n, int *a)
  56 {
  57     int i;
  58     int tmp;
  59 
  60     heapify(n, a);
  61 
  62     for(i = n-1; i > 0; i--) {
  63         /* swap max to a[i] */
  64         tmp = a[0];
  65         a[0] = a[i];
  66         a[i] = tmp;
  67 
  68         /* float new a[0] down */
  69         floatDown(i, a, 0);
  70     }
  71 }
  72 
  73 #define N (100)
  74 #define MULTIPLIER (17)
  75 
  76 int
  77 main(int argc, char **argv)
  78 {
  79     int a[N];
  80     int i;
  81 
  82     if(argc != 1) {
  83         fprintf(stderr, "Usage: %s\n", argv[0]);
  84         return 1;
  85     }
  86 
  87     for(i = 0; i < N; i++) { a[i] = (i*MULTIPLIER) % N; }
  88 
  89     for(i = 0; i < N; i++) { printf("%d ", a[i]); }
  90     putchar('\n');
  91 
  92     heapSort(N, a);
  93 
  94     for(i = 0; i < N; i++) { printf("%d ", a[i]); }
  95     putchar('\n');
  96 
  97     return 0;
  98 }

heapsort.c

148. More information

CategoryProgrammingNotes CategoryAlgorithmNotes

149. C/FunctionPointers

150. Basics

A function pointer, internally, is just the numerical address for the code for a function. When a function name is used by itself without parentheses, the value is a pointer to the function, just as the name of an array by itself is a pointer to its zeroth element. Function pointers can be stored in variables, structs, unions, and arrays and passed to and from functions just like any other pointer type. They can also be called: a variable of type function pointer can be used in place of a function name.

151. Function pointer declarations

A function pointer declaration looks like a function declaration, except that the function name is wrapped in parentheses and preceded by an asterisk. For example:

   1 /* a function taking two int arguments and returning an int */
   2 int function(int x, int y);
   3 
   4 /* a pointer to such a function */
   5 int (*pointer)(int x, int y);

As with function declarations, the names of the arguments can be omitted.

Here's a short program that uses function pointers:

   1 /* Functional "hello world" program */
   2 
   3 #include <stdio.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     /* function for emitting text */
   9     int (*say)(const char *);
  10 
  11     say = puts;
  12 
  13     say("hello world");
  14 
  15     return 0;
  16 }

152. Applications

Function pointers are not used as much in C as in functional languages, but there are many common uses even in C code.

152.1. Callbacks

The classic example is qsort, from the standard library:

   1 /* defined in stdlib.h */
   2 void qsort(void *base, size_t n, size_t size,
   3     int (*cmp)(const void *key1, const void *key2));

This is a generic sorting routine that will sort any array in place. It needs to know (a) the base address of the array; (b) how many elements there are; (c) how big each element is; and (d) how to compare two elements. The only tricky part is supplying the comparison, which could involve arbitrarily-complex code. So we supply this code as a function with an interface similar to strcmp.

   1 static int
   2 compare_ints(void *key1, void *key2)
   3 {
   4     return *((int *) key1) - *((int *) key2);
   5 }
   6 
   7 int
   8 sort_int_array(int *a, int n)
   9 {
  10     qsort(a, n, sizeof(*a), compare_ints);
  11 }

Other examples might include things like registering an error handler for a library, instead of just having it call abort() or something equally catastrophic, or providing a cleanup function for freeing data passed into a data structure.

152.2. Dispatch tables

Alternative to gigantic if/else if or switch statements. See page 234 of KernighanPike for a good example of this.

152.3. Iterators

See C/Iterators.

153. Closures

A closure is a function plus some associated state. A simple way to implement closures in C is to use a static local variable, but then you only get one. Better is to allocate the state somewhere and pass it around with the function. For example, here's a simple functional implementation of infinite sequences, that generalizes the example in AbstractDataTypes:

   1 /* a sequence is an object that returns a new value each time it is called */
   2 struct sequence {
   3     int (*next)(void *data);
   4     void *data;
   5 };
   6 
   7 typedef struct sequence *Sequence;
   8 
   9 Sequence
  10 create_sequence(int (*next)(void *data), void *data)
  11 {
  12     Sequence s;
  13 
  14     s = malloc(sizeof(*s));
  15     assert(s);
  16 
  17     s->next = next;
  18     s->data = data;
  19 
  20     return s;
  21 }
  22 
  23 int
  24 sequence_next(Sequence s)
  25 {
  26     return s->next(s->data);
  27 }

And here are some examples of sequences:

   1 /* make a constant sequence that always returns x */
   2 static int
   3 constant_sequence_next(void *data)
   4 {
   5     return *((int *) data);
   6 }
   7 
   8 Sequence
   9 constant_sequence(int x)
  10 {
  11     int *data;
  12 
  13     data = malloc(sizeof(*data));
  14     if(data == 0) return 0;
  15 
  16     *data = x;
  17 
  18     return create_sequence(constant_sequence_next, data);
  19 }
  20 
  21 /* make a sequence x, x+a, x+2*a, x+3*a, ... */
  22 struct arithmetic_sequence_data {
  23     int cur;
  24     int step;
  25 };
  26 
  27 static int
  28 arithmetic_sequence_next(void *data)
  29 {
  30     struct arithmetic_sequence_data *d;
  31 
  32     d = data;
  33     d->cur += d->step;
  34 
  35     return d->cur;
  36 }
  37 
  38 Sequence
  39 arithmetic_sequence(int x, int a)
  40 {
  41     struct arithmetic_sequence_data *d;
  42 
  43     d = malloc(sizeof(*d));
  44     if(d == 0) return 0;
  45 
  46     d->cur = x - a;             /* back up so first value returned is x */
  47     d->step = a;
  48 
  49     return create_sequence(arithmetic_sequence_next, d);
  50 }
  51 
  52 /* Return the sum of two sequences */
  53 static int
  54 add_sequences_next(void *data)
  55 {
  56     Sequence *s;
  57 
  58     s = data;
  59     return sequence_next(s[0]) + sequence_next(s[1]);
  60 }
  61 
  62 Sequence
  63 add_sequences(Sequence s0, Sequence s1)
  64 {
  65     Sequence *s;
  66 
  67     s = malloc(2*sizeof(*s));
  68     if(s == 0) return 0;
  69 
  70     s[0] = s0;
  71     s[1] = s1;
  72 
  73     return create_sequence(add_sequences_next, s);
  74 }
  75 
  76 /* Return the sequence x, f(x), f(f(x)), ... */
  77 struct iterated_function_sequence_data {
  78     int x;
  79     int (*f)(int);
  80 }
  81 
  82 static int
  83 iterated_function_sequence_next(void *data)
  84 {
  85     struct iterated_function_sequence_data *d;
  86     int retval;
  87 
  88     d = data;
  89 
  90     retval = d->x;
  91     d->x = d->f(d->x);
  92 
  93     return retval;
  94 }
  95 
  96 Sequence
  97 iterated_function_sequence(int (*f)(int), int x0)
  98 {
  99     struct iterated_function_sequence_data *d;
 100 
 101     d = malloc(sizeof(*d));
 102     if(d == 0) return 0;
 103 
 104     d->x = x0;
 105     d->f = f;
 106 
 107     return create_sequence(iterated_function_sequence_next, d);
 108 }

Note that we haven't worried about how to free the data field inside a Sequence, and indeed it's not obvious that we can write a generic data-freeing routine since we don't know what structure it has. The solution is to add more function pointers to a Sequence, so that we can get the next value, get the sequence to destroy itself, etc. When we do so, we have gone beyond building a closure to building an object.

154. Objects

Here's an example of a hierarchy of counter objects. Each counter object has (at least) three operations: reset, next, and destroy. To call the next operation on counter c we include c and the first argument, e.g. c->next(c) (one could write a wrapper to enforce this).

The main trick is that we define a basic counter structure and then extend it to include additional data, using lots of pointer conversions to make everything work.

   1 /* use preprocessor to avoid rewriting these */
   2 #define COUNTER_FIELDS  \
   3     void (*reset)(struct counter *);    \
   4     int (*next)(struct counter *);      \
   5     void (*destroy)(struct counter *);
   6 
   7 struct counter {
   8     COUNTER_FIELDS
   9 };
  10 
  11 typedef struct counter *Counter;
  12 
  13 /* minimal counter--always returns zero */
  14 /* we don't even allocate this, just have one global one */
  15 static void noop(Counter c) { ; }
  16 static int return_zero(Counter c) { return 0; }
  17 static struct counter Zero_counter = { noop, return_zero, noop };
  18 
  19 Counter
  20 make_zero_counter(void)
  21 {
  22     return &Zero_counter;
  23 }
  24 
  25 /* a fancier counter that iterates a function sequence */
  26 /* this struct is not exported anywhere */
  27 struct ifs_counter {
  28 
  29     /* copied from struct counter declaration */
  30     COUNTER_FIELDS
  31 
  32     /* new fields */
  33     int init;
  34     int cur;
  35     int (*f)(int);      /* update rule */
  36 };
  37 
  38 static void
  39 ifs_reset(Counter c)
  40 {
  41     struct ifs_counter *ic;
  42 
  43     ic = (struct ifs_counter *) c;
  44 
  45     ic->cur = ic->init;
  46 }
  47 
  48 static void
  49 ifs_next(Counter c)
  50 {
  51     struct ifs_counter *ic;
  52     int ret;
  53 
  54     ic = (struct ifs_counter *) c;
  55 
  56     ret = ic->cur;
  57     ic->cur = ic->f(ic->cur);
  58 
  59     return ret;
  60 }
  61 
  62 Counter
  63 make_ifs_counter(int init, int (*f)(int))
  64 {
  65     struct ifs_counter *ic;
  66 
  67     ic = malloc(sizeof(*ic));
  68 
  69     ic->reset = ifs_reset;
  70     ic->next = ifs_next;
  71     ic->destroy = (void (*)(struct counter *)) free;
  72 
  73     ic->init = init;
  74     ic->cur = init;
  75     ic->f = f;
  76 
  77     /* it's always a Counter on the outside */
  78     return (Counter) ic;
  79 }

A typical use might be

   1 static int
   2 times2(int x)
   3 {
   4     return x*2;
   5 }
   6 
   7 void
   8 print_powers_of_2(void)
   9 {
  10     int i;
  11     Counter c;
  12 
  13     c = make_ifs_counter(1, times2);
  14 
  15     for(i = 0; i < 10; i++) {
  16         printf("%d\n", c->next(c));
  17     }
  18 
  19     c->reset(c);
  20 
  21     for(i = 0; i < 20; i++) {
  22         printf("%d\n", c->next(c));
  23     }
  24 
  25     c->destroy(c);
  26 }

CategoryProgrammingNotes

155. C/Iterators

156. The problem

Suppose we have an abstract data type that represents some sort of container, such as a list or dictionary. We'd like to be able to do something to every element of the container; say, count them up. How can we write operations on the abstract data type to allow this, without exposing the implementation?

To make the problem more concrete, let's suppose we have an abstract data type that represents the set of all non-negative numbers less than some fixed bound. The core of its interface might look like this:

156.1. nums.h

   1 /*
   2  * Abstract data type representing the set of numbers from 0 to
   3  * bound-1 inclusive, where bound is passed in as an argument at creation.
   4  */
   5 typedef struct nums *Nums;
   6 
   7 /* Create a Nums object with given bound. */
   8 Nums nums_create(int bound);
   9 
  10 /* Destructor */
  11 void nums_destroy(Nums);
  12 
  13 /* Returns 1 if nums contains element, 0 otherwise */
  14 int nums_contains(Nums nums, int element);

156.2. nums.c

   1 #include <stdlib.h>
   2 #include "nums.h"
   3 
   4 struct nums {
   5     int bound;
   6 };
   7 
   8 Nums nums_create(int bound)
   9 {
  10     struct nums *n;
  11     n = malloc(sizeof(*n));
  12     n->bound = bound;
  13     return n;
  14 }
  15 
  16 void nums_destroy(Nums n) { free(n); }
  17 
  18 int nums_contains(Nums n, int element)
  19 {
  20     return element >= 0 && element < n->bound;
  21 }

From the outside, a Nums acts like the set of numbers from 0 to bound - 1; nums_contains will insist that it contains any int that is in this set and contains no int that is not in this set.

Let's suppose now that we want to loop over all elements of some Nums, say to add them together. In particular, we'd like to implement the following pseudocode, where nums is some Nums instance:

   1 sum = 0;
   2 for(each i in nums) {
   3     sum += i;
   4 }

One way to do this would be to build the loop into some operation in nums.c, including its body. But we'd like to be able to substitute any body for the sum += i line. Since we can't see the inside of a Nums, we need to have some additional operation or operations on a Nums that lets us write the loop. How can we do this?

157. Option 1: Function that returns a sequence

A data-driven approach might be to add a nums_contents function that returns a sequence of all elements of some instance, perhaps in the form of an array or linked list. The advantage of this approach is that once you have the sequence, you don't need to worry about changes to (or destruction of) the original object. The disadvantage is that you have to deal with storage management issues, and have to pay the costs in time and space of allocating and filling in the sequence. This can be particularly onerous for a "virtual" container like Nums, since we could conceivably have a Nums instance with billions of elements.

Bearing these facts in mind, let's see what this approach might look like. We'll define a new function nums_contents that returns an array of ints, terminated by a -1 sentinel:

   1 int *
   2 nums_contents(Nums n)
   3 {
   4     int *a;
   5     int i;
   6     a = malloc(sizeof(*a) * (n->bound + 1));
   7     for(i = 0; i < n->bound; i++) a[i] = i;
   8     a[n->bound] = -1;
   9     return a;
  10 }

We might use it like this:

   1     sum = 0;
   2     contents = nums_contents(nums);
   3     for(p = contents; *p != -1; p++) {
   4         sum += *p;
   5     }
   6     free(contents);

Despite the naturalness of the approach, returning a sequence in this case leads to the most code complexity of the options we will examine.

158. Option 2: Iterator with first/done/next operations

If we don't want to look at all the elements at once, but just want to process them one at a time, we can build an iterator. An iterator is an object that allows you to step through the contents of another object, by providing convenient operations for getting the first element, testing when you are done, and getting the next element if you are not. In C, we try to design iterators to have operations that fit well in the top of a for loop.

For the Nums type, we'll make each Nums its own iterator. The new operations are given here:

   1 int nums_first(Nums n) { return 0; }
   2 int nums_done(Nums n, int val) { return val >= n->bound; }
   3 int nums_next(Nums n, int val) { return val+1; }

And we use them like this:

   1     sum = 0;
   2     for(i = nums_first(nums); !nums_done(nums, i); i = nums_next(nums, i)) {
   3         sum += i;
   4     }

Not only do we completely avoid the overhead of building a sequence, we also get much cleaner code. It helps in this case that all we need to find the next value is the previous one; for a more complicated problem we might have to create and destroy a separate iterator object that holds the state of the loop. But for many tasks in C, the first/done/next idiom is a pretty good one.

159. Option 3: Iterator with function argument

Suppose we have a very complicated iteration, say one that might require several nested loops or even a recursion to span all the elements. In this case it might be very difficult to provide first/done/next operations, because it would be hard to encode the state of the iteration so that we could easily pick up in the next operation where we previously left off. What we'd really like to do is to be able to plug arbitrary code into the innermost loop of our horrible iteration procedure, and do it in a way that is reasonably typesafe and doesn't violate our abstraction barrier. This is a job for function pointers, and an example of the functional programming style in action.

We'll define a nums_foreach function that takes a function as an argument:

   1 void nums_foreach(Nums n, void (*f)(int, void *), void *f_data)
   2 {
   3     int i;
   4     for(i = 0; i < n->bound; i++) f(i, f_data);
   5 }

The f_data argument is used to pass extra state into the passed-in function f; it's a void * because we want to let f work on any sort of extra state it likes.

Now to do our summation, we first define an extra function sum_helper, which adds each element to an accumulator pointed to by f_data:

   1 static void sum_helper(int i, void *f_data)
   2 {
   3     *((int *) f_data) += i;
   4 }

We then feed sum_helper to the nums_foreach function:

   1     sum = 0;
   2     nums_foreach(nums, sum_helper, (void *) &sum);

There is a bit of a nuisance in having to define the auxiliary sum_helper function and in all the casts to and from void, but on the whole the complexity of this solution is not substantially greater than the first/done/next approach. Which you should do depends on whether it's harder to encapsulate the state of the iterator (in which case the functional approach is preferable) or of the loop body (in which case the first/done/next approach is preferable), and whether you need to bail out of the loop early (which would require special support from the foreach procedure, perhaps checking a return value from the function). However, it's almost always straightforward to encapsulate the state of a loop body; just build a struct containing all the variables that it uses, and pass a pointer to this struct as f_data.

160. Appendix: Complete code for Nums

Here's a grand unified Nums implementation that provides all the interfaces we've discussed:

160.1. nums.h

   1 /*
   2  * Abstract data type representing the set of numbers from 0 to
   3  * bound-1 inclusive, where bound is passed in as an argument at creation.
   4  */
   5 typedef struct nums *Nums;
   6 
   7 /* Create a Nums object with given bound. */
   8 Nums nums_create(int bound);
   9 
  10 /* Destructor */
  11 void nums_destroy(Nums);
  12 
  13 /* Returns 1 if nums contains element, 0 otherwise */
  14 int nums_contains(Nums nums, int element);
  15 /*
  16  * Returns a freshly-malloc'd array containing all elements of n,
  17  * followed by a sentinel value of -1.
  18  */
  19 int *nums_contents(Nums n);
  20 
  21 /* Three-part iterator */
  22 int nums_first(Nums n);           /* returns smallest element in n */
  23 int nums_done(Nums n, int val);   /* returns 1 if val is past end */
  24 int nums_next(Nums n, int val);   /* returns next value after val */
  25 
  26 /* Call f on every element of n with with extra argument f_data */
  27 void nums_foreach(Nums n, void (*f)(int, void *f_data), void *f_data);

160.2. nums.c

   1 #include <stdlib.h>
   2 #include "nums.h"
   3 
   4 struct nums {
   5     int bound;
   6 };
   7 
   8 Nums nums_create(int bound)
   9 {
  10     struct nums *n;
  11     n = malloc(sizeof(*n));
  12     n->bound = bound;
  13     return n;
  14 }
  15 
  16 void nums_destroy(Nums n) { free(n); }
  17 
  18 int nums_contains(Nums n, int element)
  19 {
  20     return element >= 0 && element < n->bound;
  21 }
  22 
  23 int *
  24 nums_contents(Nums n)
  25 {
  26     int *a;
  27     int i;
  28     a = malloc(sizeof(*a) * (n->bound + 1));
  29     for(i = 0; i < n->bound; i++) a[i] = i;
  30     a[n->bound] = -1;
  31     return a;
  32 }
  33 
  34 int nums_first(Nums n) { return 0; }
  35 int nums_done(Nums n, int val) { return val >= n->bound; }
  36 int nums_next(Nums n, int val) { return val+1; }
  37 
  38 void nums_foreach(Nums n, void (*f)(int, void *), void *f_data)
  39 {
  40     int i;
  41     for(i = 0; i < n->bound; i++) f(i, f_data);
  42 }

And here's some test code to see if it all works:

160.3. test-nums.c

   1 #include <stdio.h>
   2 #include <setjmp.h>
   3 #include <signal.h>
   4 #include <unistd.h>
   5 #include <stdlib.h>
   6 
   7 #include "nums.h"
   8 #include "tester.h"
   9 
  10 static void sum_helper(int i, void *f_data)
  11 {
  12     *((int *) f_data) += i;
  13 }
  14 
  15 int
  16 main(int argc, char **argv)
  17 {
  18     Nums nums;
  19     int sum;
  20     int *contents;
  21     int *p;
  22     int i;
  23 
  24     tester_init();
  25 
  26     TRY { nums = nums_create(100); } ENDTRY;
  27     TEST(nums_contains(nums, -1), 0);
  28     TEST(nums_contains(nums, 0), 1);
  29     TEST(nums_contains(nums, 1), 1);
  30     TEST(nums_contains(nums, 98), 1);
  31     TEST(nums_contains(nums, 99), 1);
  32     TEST(nums_contains(nums, 100), 0);
  33 
  34     sum = 0;
  35     contents = nums_contents(nums);
  36     for(p = contents; *p != -1; p++) {
  37         sum += *p;
  38     }
  39     free(contents);
  40     TEST(sum, 4950);
  41 
  42     sum = 0;
  43     for(i = nums_first(nums); !nums_done(nums, i); i = nums_next(nums, i)) {
  44         sum += i;
  45     }
  46     TEST(sum, 4950);
  47 
  48     sum = 0;
  49     nums_foreach(nums, sum_helper, (void *) &sum);
  50     TEST(sum, 4950);
  51     tester_report(stdout, argv[0]);
  52     return tester_result();
  53 }

$ make test
gcc -g3 -ansi -pedantic -Wall   -c -o test-nums.o test-nums.c
gcc -g3 -ansi -pedantic -Wall   -c -o nums.o nums.c
gcc -g3 -ansi -pedantic -Wall   -c -o tester.o tester.c
gcc -g3 -ansi -pedantic -Wall -o test-nums test-nums.o nums.o tester.o
./test-nums
OK!

CategoryProgrammingNotes

161. C/Randomization

Contents

Generating random values in C
Randomized algorithms
1. Randomized search
2. Quickselect and quicksort
Randomized data structures
1. Randomized tree balancing
2. Universal hash families

Randomization is a fundamental technique in algorithm design, that allows programs to run quickly when the average-case behavior of an algorithm is better than the worst-case behavior. It is also heavily used in games, both in entertainment and gambling. The latter application gives the only known example of a programmer being murdered for writing bad code http://www.zdnet.co.uk/news/security-management/1999/11/11/comdex-99-the-mysterious-death-of-larry-volk-2075068/, which shows how serious good random-number generation is.

162. Generating random values in C

If you want random values in a C program, there are three typical ways of getting them, depending on how good (i.e. uniform, uncorrelated, and unpredictable) you want them to be.

162.1. The rand function from the standard library

E.g.

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 int
   5 main(int argc, char **argv)
   6 {
   7     printf("%d\n", rand());
   8     return 0;
   9 }

The rand function, declared in stdlib.h, returns a random integer in the range 0 to RAND_MAX (inclusive) every time you call it. On machines using the GNU C library RAND_MAX is equal to INT_MAX or 2³¹-1, but it may be as small as 32767. There are no particularly strong guarantees about the quality of random numbers that rand returns, but it should be good enough for casual use, and has the advantage that as part of the C standard you can assume it is present almost everywhere.

Note that rand is a pseudorandom number generator: the sequence of values it returns is predictable if you know its starting state (and is still predictable from past values in the sequence even if you don't know the starting state, if you are clever enough). It is also the case that the initial seed is fixed, so that the program above will print the same value every time you run it (this is a feature: it permits debugging randomized programs).

If you want to get different sequences, you need to seed the random number generator using srand. A typical use might be:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <time.h>
   4 
   5 int
   6 main(int argc, char **argv)
   7 {
   8     srand(time(0));
   9     printf("%d\n", rand());
  10     return 0;
  11 }

Here time(0) returns the number of seconds since the epoch (00:00:00 UTC, January 1, 1970, for POSIX systems, not counting leap seconds). Note that this still might give repeated values if you run it twice in the same second, and it's extremely dangerous if you expect to distribute your code to a lot of people who want different results, since two of your users are likely to run it twice in the same second. See the discussion of /dev/urandom below for a better method.

162.2. Better pseudorandom number generators

There has been quite a bit of research on pseudorandom number generators over the years, and much better pseudorandom number generators than rand are available. The current champion for simulation work is the Mersenne Twister, which runs about 4 times faster than rand in its standard C implementation and passes a much wider battery of statistical tests. Its English-language home page is at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html. As with rand, you still need to provide an initial seed value.

There are also cryptographically secure pseudorandom number generators, of which the most famous is Blum Blum Shub. These cannot be predicted based on their output if seeded with a true random value (under certain cryptographic assumptions: hardness of factoring for Blum Blum Shub). Unfortunately, cryptographic PRNGs are usually too slow for day-to-day use.

162.3. Random numbers without the pseudo

If you really need actual random numbers and are on a Linux or BSD-like operating system, you can use the special device files /dev/random and /dev/urandom. These can be opened for reading like ordinary files, but the values read from them are a random sequence of bytes (including null characters). A typical use might be:

   1 #include <stdio.h>
   2 
   3 int
   4 main(int argc, char **argv)
   5 {
   6     unsigned int randval;
   7     FILE *f;
   8 
   9     f = fopen("/dev/random", "r");
  10     fread(&randval, sizeof(randval), 1, f);
  11     fclose(f);
  12 
  13     printf("%u\n", randval);
  14 
  15     return 0;
  16 }

(A similar construction can also be used to obtain a better initial seed for srand than time(0).)

Both /dev/random and /dev/urandom derive their random bits from physically random properties of the computer, like time between keystrokes or small variations in hard disk rotation speeds. The difference between the two is that /dev/urandom will always give you some random-looking bits, even if it has to generate extra ones using a cryptographic pseudo-random number generator, while /dev/random will only give you bits that it is confident are in fact random. Since your computer only generates a small number of genuinely random bits per second, this may mean that /dev/random will exhaust its pool if read too often. In this case, a read on /dev/random will block (just like reading a terminal with no input on it) until the pool has filled up again.

Neither /dev/random nor /dev/urandom is known to be secure against a determined attacker, but they are about the best you can do without resorting to specialized hardware.

162.4. Issues with RAND_MAX

The problem with rand is that getting a uniform value between 0 and 2³¹-1 may not be what you want. It could be that RAND_MAX is be too small; in this case, you may have to call rand more than once and paste together the results. But there can be problems with RAND_MAX even if it is bigger than the values you want.

For example, suppose you want to simulate a die roll for your video craps machine, but you don't want to get whacked by Johnny "The Debugger" when the Nevada State Gaming Commission notices that 6-6 is coming up slightly less often than it's supposed to. A natural thing to try would be to take the output of rand mod 6:

   1 int d6(void) {
   2     return rand() % 6 + 1;
   3 }

The problem here is that there are 2³¹ outputs from rand, and 6 doesn't divide 2³¹. So 1 and 2 are slightly more likely to come up than 3, 4, 5, or 6. This can be particularly noticeable if we want a uniform variable from a larger range, e.g. 0..⌊(2/3)⋅2³¹⌋.

We can avoid this with a technique called rejection sampling, where we reject excess parts of the output range of rand. For rolling a die, the trick is to reject anything in the last extra bit of the range that is left over after the largest multiple of the die size. Here's a routine that does this, returning a uniform value in the range 0 to n-1 for any positive n, together with a program that demonstrates its use for rolling dice:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 #include <time.h>
   5 
   6 /* return a uniform random value in the range 0..n-1 inclusive */
   7 int
   8 randRange(int n)
   9 {
  10     int limit;
  11     int r;
  12 
  13     limit = RAND_MAX - (RAND_MAX % n);
  14 
  15     while((r = rand()) >= limit);
  16 
  17     return r % n;
  18 }
  19 
  20 int
  21 main(int argc, char **argv)
  22 {
  23     int i;
  24 
  25     srand(time(0));
  26 
  27     for(i = 0; i < 40; i++) {
  28         printf("%d ", randRange(6)+1);
  29     }
  30 
  31     putchar('\n');
  32 
  33     return 0;
  34 }

randRange.c

More generally, rejection sampling can be used to get random values with particular properties, where it's hard to generate a value with that property directly. Here's a program that generates random primes:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 #include <time.h>
   5 
   6 /* return 1 if n is prime */
   7 int
   8 isprime(int n)
   9 {
  10     int i;
  11 
  12     if(n % 2 == 0 || n == 1) { return 0; }
  13 
  14     for(i = 3; i*i <= n; i += 2) {
  15         if(n % i == 0) { return 0; }
  16     }
  17 
  18     return 1;
  19 }
  20 
  21 /* return a uniform random value in the range 0..n-1 inclusive */
  22 int
  23 randPrime(void)
  24 {
  25     int r;
  26 
  27     /* extra parens avoid warnings */
  28     while(!isprime((r = rand())));
  29 
  30     return r;
  31 }
  32 
  33 int
  34 main(int argc, char **argv)
  35 {
  36     int i;
  37 
  38     srand(time(0));
  39 
  40     for(i = 0; i < 10; i++) {
  41         printf("%d\n", randPrime());
  42     }
  43 
  44     return 0;
  45 }

randPrime.c

One temptation to avoid is to re-use your random values. If, for example, you try to find a random prime by picking a random x and trying x, x+1, x+2, etc., until you hit a prime, some primes are more likely to come up than others.

163. Randomized algorithms

163.1. Randomized search

This is essentially rejection sampling in disguise. Suppose that you want to find one of many needles in a large haystack. One approach is to methodically go through the straws/needles one at a time until you find a needle. But you may find that your good friend the adversary has put all the needles at the end of your list. Picking candidate at random is likely to hit a needle faster if there are many of them.

Here is a (silly) routine that quickly finds a number whose high-order bits match a particular pattern:

   1 int
   2 matchBits(int pattern)
   3 {
   4     int r;
   5 
   6     while(((r = rand()) & 0x70000000) != (pattern & 0x70000000));
   7 
   8     return r;
   9 }

This will find a winning value in 8 tries on average. In contrast, this deterministic version will take a lot longer for nonzero patterns:

   1 int
   2 matchBitsDeterministic(int pattern)
   3 {
   4     int i;
   5 
   6     for(i = 0; (i & 0x70000000) != (pattern & 0x70000000); i++);
   7 
   8     return i;
   9 }

The downside of the randomized approach is that it's hard to tell when to quit if there are no matches; if we stop after some fixed number of trials, we get a Monte Carlo algorithm that may give the wrong answer with small probability. The usual solution is to either accept a small probability of failure, or interleave a deterministic backup algorithm that always works. The latter approach gives a Las Vegas algorithm whose running time is variable but whose correctness is not.

163.2. Quickselect and quicksort

QuickSelect, or Hoare's FIND, is an algorithm for quickly finding the k-th largest element in an unsorted array of n elements. It runs in O(n) time on average, which is the best one can hope for (we have to look at every element of the array to be sure we didn't miss a small one that changes our answer) and better than the O(n log n) time we get if we sort the array first using a comparison-based sorting algorithm.

The idea is to pick a random pivot and divide the input into two piles, each of which is likely to be roughly a constant fraction of the size of the original input.¹⁰ It takes O(n) time to split the input up (we have to compare each element to the pivot once), and in the recursive calls this gives a geometric series. We can even do the splitting up in place if we are willing to reorder the elements of our original array.

If we recurse into both piles instead of just one, we get QuickSort, a very fast and simple comparison-based sorting algorithm. Here is an implementation of both algorithms:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 /* reorder an array to put elements <= pivot
   6  * before elements > pivot.
   7  * Returns number of elements <= pivot */
   8 static int
   9 splitByPivot(int n, int *a, int pivot)
  10 {
  11     int lo;
  12     int hi;
  13     int temp;  /* for swapping */
  14 
  15     assert(n >= 0);
  16 
  17     /* Dutch Flag algorithm */
  18     /* swap everything <= pivot to bottom of array */
  19     /* invariant is i < lo implies a[i] <= pivot */
  20     /* and i > hi implies a[i] > pivot */
  21     lo = 0;
  22     hi = n-1;
  23 
  24     while(lo <= hi) {
  25         if(a[lo] <= pivot) {
  26             lo++;
  27         } else {
  28             temp = a[hi];
  29             a[hi--] = a[lo];
  30             a[lo] = temp;
  31         }
  32     }
  33 
  34     return lo;
  35 }
  36 
  37 /* find the k-th smallest element of an n-element array */
  38 /* may reorder elements of the original array */
  39 int
  40 quickselectDestructive(int k, int n, int *a)
  41 {
  42     int pivot;
  43     int lo;
  44 
  45     assert(0 <= k);
  46     assert(k < n);
  47 
  48     if(n == 1) { 
  49         return a[0];
  50     }
  51     
  52     /* else */
  53     pivot = a[rand() % n];   /* we will tolerate non-uniformity */
  54 
  55     lo = splitByPivot(n, a, pivot);
  56 
  57     /* lo is now number of values <= pivot */
  58     if(k < lo) {
  59         return quickselectDestructive(k, lo, a);
  60     } else {
  61         return quickselectDestructive(k - lo, n - lo, a + lo);
  62     }
  63 }
  64 
  65 /* sort an array in place */
  66 void
  67 quickSort(int n, int *a)
  68 {
  69     int pivot;
  70     int lo;
  71 
  72     if(n <= 1) { 
  73         return;
  74     }
  75     
  76     /* else */
  77     pivot = a[rand() % n];   /* we will tolerate non-uniformity */
  78 
  79     lo = splitByPivot(n, a, pivot);
  80 
  81     quickSort(lo, a);
  82     quickSort(n - lo, a + lo);
  83 }
  84 
  85 
  86 /* shuffle an array */
  87 void
  88 shuffle(int n, int *a)
  89 {
  90     int i;
  91     int r;
  92     int temp;
  93 
  94     for(i = n - 1; i > 0; i--) {
  95         r = rand() % i;
  96         temp = a[r];
  97         a[r] = a[i];
  98         a[i] = temp;
  99     }
 100 }
 101 
 102 #define N (1024)
 103 
 104 int
 105 main(int argc, char **argv)
 106 {
 107     int a[N];
 108     int i;
 109 
 110     srand(0);  /* use fixed value for debugging */
 111 
 112     for(i = 0; i < N; i++) {
 113         a[i] = i;
 114     }
 115 
 116     shuffle(N, a);
 117 
 118     for(i = 0; i < N; i++) {
 119         assert(quickselectDestructive(i, N, a) == i);
 120     }
 121 
 122     shuffle(N, a);
 123 
 124     quickSort(N, a);
 125 
 126     for(i = 0; i < N; i++) {
 127         assert(a[i] == i);
 128     }
 129 
 130     return 0;
 131 }

quick.c

164. Randomized data structures

164.1. Randomized tree balancing

Suppose we insert n elements into an initially-empty binary search tree in random order with no rebalancing. Then each element is equally likely to be the root, and all the elements less than the root end up in the left subtree, while all the elements greater than the root end up in the right subtree, where they are further partitioned recursively. This is exactly what happens in quicksort, so the structure of the tree will exactly mirror the structure of an execution of quicksort. In particular, the average depth of a node will be O(log n), giving us the same expected search cost as in a balanced binary tree.

The problem with this approach is that we don't have any guarantees that the input will be supplied in random order, and in the worst case we end up with a linked list. The solution is to put the randomization into the algorithm itself, making the structure of the tree depend on random choices made by the program itself.

A skip list (Pugh, 1990) is a randomized tree-like data structure based on linked lists. It consists of a level 0 list that is an ordinary sorted linked list, together with higher-level lists that contain a random sampling of the elements at lower levels. When inserted into the level i list, an element flips a coin that tells it with probability p to insert itself in the level i+1 list as well.

Searches in a skip list are done by starting in the highest-level list and searching forward for the last element whose key is smaller than the target; the search then continues in the same way on the next level down. The idea is that the higher-level lists act as express lanes to get us to our target value faster. To bound the expected running time of a search, it helps to look at this process backwards; the reversed search path starts at level 0 and continues going backwards until it reaches the first element that is also in a higher level; it then jumps to the next level up and repeats the process. On average, we hit 1+1/p nodes at each level before jumping back up; for constant p (e.g. 1/2), this gives us O(log n) steps for the search.

The space per element of a skip list also depends on p. Every element has at least one outgoing pointer (on level 0), and on average has exactly 1/(1-p) expected pointers. So the space cost can also be adjusted by adjusting p. For example, if space is at a premium, setting p = 1/10 produces 10/9 pointers per node on average—not much more than in a linked list—but still gives O(log n) search times.

Below is an implementation of a skip list. To avoid having to allocate a separate array of pointers for each element, we put a length-1 array at the end of struct skiplist and rely on C's lack of bounds checking to make the array longer if necessary. A dummy head element stores pointers to all the initial elements in each level of the skip list; it is given the dummy key INT_MIN so that searches for values less than any in the list will report this value. Aside from these nasty tricks, the code for search and insertion is pretty straightforward. Code for deletion is a little more involved, because we have to make sure that we delete the leftmost copy of a key if there are duplicates (an alternative would be to modify skiplistInsert to ignore duplicates).

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include <limits.h>
   4 
   5 #include "skiplist.h"
   6 
   7 #define MAX_HEIGHT (32)
   8 
   9 struct skiplist {
  10     int key;
  11     int height;                /* number of next pointers */
  12     struct skiplist *next[1];  /* first of many */
  13 };
  14 
  15 /* choose a height according to a geometric distribution */
  16 static int
  17 chooseHeight(void)
  18 {
  19     int i;
  20 
  21     for(i = 1; i < MAX_HEIGHT && rand() % 2 == 0; i++); 
  22 
  23     return i;
  24 }
  25 
  26 /* create a skiplist node with the given key and height */
  27 /* does not fill in next pointers */
  28 static Skiplist
  29 skiplistCreateNode(int key, int height)
  30 {
  31     Skiplist s;
  32 
  33     assert(height > 0);
  34     assert(height <= MAX_HEIGHT);
  35 
  36     s = malloc(sizeof(struct skiplist) + sizeof(struct skiplist *) * (height - 1));
  37 
  38     assert(s);
  39 
  40     s->key = key;
  41     s->height = height;
  42 
  43     return s;
  44 }
  45 
  46 /* create an empty skiplist */
  47 Skiplist
  48 skiplistCreate(void)
  49 {
  50     Skiplist s;
  51     int i;
  52 
  53     /* s is a dummy head element */
  54     s = skiplistCreateNode(INT_MIN, MAX_HEIGHT);
  55 
  56     /* this tracks the maximum height of any node */
  57     s->height = 1;
  58 
  59     for(i = 0; i < MAX_HEIGHT; i++) {
  60         s->next[i] = 0;
  61     }
  62 
  63     return s;
  64 }
  65 
  66 /* free a skiplist */
  67 void
  68 skiplistDestroy(Skiplist s)
  69 {
  70     Skiplist next;
  71 
  72     while(s) {
  73         next = s->next[0];
  74         free(s);
  75         s = next;
  76     }
  77 }
  78 
  79 /* return maximum key less than or equal to key */
  80 /* or INT_MIN if there is none */
  81 int
  82 skiplistSearch(Skiplist s, int key)
  83 {
  84     int level;
  85 
  86     for(level = s->height - 1; level >= 0; level--) {
  87         while(s->next[level] && s->next[level]->key <= key) {
  88             s = s->next[level];
  89         }
  90     }
  91 
  92     return s->key;
  93 }
  94 
  95 /* insert a new key into s */
  96 void
  97 skiplistInsert(Skiplist s, int key)
  98 {
  99     int level;
 100     Skiplist elt;
 101 
 102     elt = skiplistCreateNode(key, chooseHeight());
 103 
 104     assert(elt);
 105 
 106     if(elt->height > s->height) {
 107         s->height = elt->height;
 108     }
 109 
 110     /* search through levels taller than elt */
 111     for(level = s->height - 1; level >= elt->height; level--) {
 112         while(s->next[level] && s->next[level]->key < key) {
 113             s = s->next[level];
 114         }
 115     }
 116 
 117     /* now level is elt->height - 1, we can start inserting */
 118     for(; level >= 0; level--) {
 119         while(s->next[level] && s->next[level]->key < key) {
 120             s = s->next[level];
 121         }
 122 
 123         /* s is last entry on this level < new element */
 124         /* do list insert */
 125         elt->next[level] = s->next[level];
 126         s->next[level] = elt;
 127     }
 128 }
 129 
 130 /* delete a key from s */
 131 void 
 132 skiplistDelete(Skiplist s, int key)
 133 {
 134     int level;
 135     Skiplist target;
 136 
 137     /* first we have to find leftmost instance of key */
 138     target = s;
 139 
 140     for(level = s->height - 1; level >= 0; level--) {
 141         while(target->next[level] && target->next[level]->key < key) {
 142             target = target->next[level];
 143         }
 144     }
 145 
 146     /* take one extra step at bottom */
 147     target = target->next[0];
 148 
 149     if(target == 0 || target->key != key) {
 150         return;
 151     }
 152 
 153     /* now we found target, splice it out */
 154     for(level = s->height - 1; level >= 0; level--) {
 155         while(s->next[level] && s->next[level]->key < key) {
 156             s = s->next[level];
 157         }
 158 
 159         if(s->next[level] == target) {
 160             s->next[level] = target->next[level];
 161         }
 162     }
 163 
 164     free(target);
 165 }

skiplist.c

Here is the header file, Makefile, and test code: skiplist.h, Makefile.skiplist, test_skiplist.c.

164.2. Universal hash families

Randomization can also be useful in hash tables. Recall that in building a hash table, we are relying on the hash function to spread out bad input distributions over the indices of our array. But for any fixed hash function, in the worst case we may get inputs where every key hashes to the same location. Universal hashing (Carter and Wegman, 1979) solves this problem by choosing a hash function at random. We may still get unlucky and have the hash function hash all our values to the same location, but now we are relying on the random number generator to be nice to us instead of the adversary. We can also rehash with a new random hash function if we find out that the one we are using is bad.

The problem here is that we can't just choose a function uniformly at random out of the set of all possible hash functions, because there are too many of them, meaning that we would spend more space representing our hash function than we would on the table. The solution is to observe that we don't need our hash function h to be truly random; it's enough if the probability of collision Pr[h(x) = h(y)] for any fixed keys x≠y is 1/m, where m is the size of the hash table. The reason is that the cost of searching for x (with chaining) is linear in the number of keys already in the table that collide with it. The expected number of such collisions is the sum of Pr[h(x) = h(y)] over all keys y in the table, or n/m if we have n keys. So this pairwise collision probability bound is enough to get the desired n/m behavior out of our table. A family of hash function with this property is called universal.

How do we get a universal hash family? For strings, we can use a table of random values, one for each position and possible character in the string. The hash of a string is then the exclusive or of the random values hashArray[i][s[i]] corresponding to the actual characters in the string. If our table has size a power of two, this has the universal property, because if two strings x and y differ in some position i, then there is only one possible value of hashArray[i][y[i]] (mod m) that will make the hash functions equal.

Typically to avoid having to build an arbitrarily huge table of random values, we only has an initial prefix of the string. Here is a hash function based on this idea, which assumes that the d data structure includes a hashArray field that contains the random values for this particular hash table:

   1 static unsigned long
   2 hash_function(Dict d, const char *s)
   3 {
   4     unsigned const char *us;
   5     unsigned long h;
   6     int i;
   7 
   8     h = 0;
   9 
  10     us = (unsigned const char *) s;
  11 
  12     for(i = 0; i < HASH_PREFIX_LENGTH && us[i] != '\0'; i++) {
  13         h ^= d->hashArray[i][us[i]];
  14     }
  15 
  16     return h;
  17 }

A modified version of the Dict hash table from C/HashTables that uses this hash function is given here: dict.c, dict.h, test_dict.c, Makefile.dict.

CategoryProgrammingNotes

165. RadixSort

166. What's wrong with comparison-based sorting

The standard quicksort routine is an example of a comparison-based sorting algorithm. This means that the only information the algorithm uses about the items it is sorting is the return value of the compare routine. This has a rather nice advantage of making the algorithm very general, but has the disadvantage that the algorithm can extract only one bit of information from every call to compare. Since there are n! possible ways to reorder the input sequence, this means we need at least log(n!) = O(n log n) calls to compare to finish the sort. If we are sorting something like strings, this can get particularly expensive, because calls to strcmp may take time linear in the length of the strings being compared. In the worst case, sorting n strings of length m each could take O(n m log n) time.

167. Bucket sort

The core idea of radix sort is that if we want to sort values from a small range, we can do it by making one bucket for each possible value and throw any object with that value into the corresponding bucket. In the old days, when Solitaire was a game played with physical pieces of cardboard, a player who suspected that that one of these "cards" had fallen under the couch might sort the deck by dividing it up into Spades, Hearts, Diamonds, and Club piles and then sorting each pile recursively. The same trick works in a computer, but there the buckets are typically implemented as an array of some convenient data structure.

If the number of possible values is too big, we may still be able to use bucket sort digit-by-digit. The resulting algorithms are known generally as radix sort. These are a class of algorithms designed for sorting strings in lexicographic order—the order used by dictionaries where one string is greater than another if the first character on which they differ is greater. One particular variant, most-significant-byte radix sort or MSB radix sort, has the beautiful property that its running time is not only linear in the size of the input in bytes, but is also linear in the smallest number of characters in the input that need to be examined to determine the correct order. This algorithm is so fast that it takes not much more time to sort data than it does to read the data from memory and write it back. But it's a little trickier to explain that the original least-significant-byte radix sort or LSB radix sort.

168. Classic LSB radix sort

This is the variant used for punch cards, and works well for fixed-length strings. The idea is to sort on the least significant position first, then work backwards to the most significant position. This works as long as each sort is stable, meaning that it doesn't reorder values with equal keys. For example, suppose we are sorting the strings:

sat
bat
bad

The first pass sorts on the third column, giving:

bad
sat
bat

The second pass sorts on the second column, producing no change in the order (all the characters are the same). The last pass sorts on the first column. This moves the s after the two bs, but preserves the order of the two words starting with b. The result is:

bad
bat
sat

There are three downsides to LSB radix sort:

All the strings have to be the same length (this is not necessarily a problem if they are really fixed-width data types like ints).
The algorithm used to sort each position must be stable, which may require more effort than most programmers would like to take.
It may be that the late positions in the strings don't affect the order, but we have to sort on them anyway. If we are sorting aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa and baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, we spend a lot of time matching up as against each other.

169. MSB radix sort

For these reasons, MSB radix sort is used more often. This is basically the radix sort version of QuickSort, where instead of partitioning our input data into two piles based on whether each element is less than or greater than a random pivot, we partition the input into 256 piles, one for each initial byte. We can then sort each pile recursively using the same algorithm, taking advantage of the fact that we know that the first byte (or later, the first k bytes) are equal and so we only need to look at the next one. The recursion stops when we get down to 1 value, or in practice where we get down to a small enough number of values that the cost of doing a different sorting algorithm becomes lower than the cost of creating and tearing down the data structures for managing the piles.

169.1. Issues with recursion depth

The depth of recursion for MSB radix sort is equal to the length of the second-longest string in the worst case. Since strings can be pretty long, this creates a danger of blowing out the stack. The solution (as in QuickSort) is to use tail recursion for the largest pile. Now any pile we recurse into with an actual procedure call is at most half the size of the original pile, so we get stack depth at most O(log n).

169.2. Implementing the buckets

There is a trick we can do analagous to the Dutch flag algorithm where we sort the array in place. The idea is that we first count the number of elements that land in each bucket and assign a block of the array for each bucket, keeping track in each block of an initial prefix of values that belong in the bucket with the rest not yet processed. We then walk through the buckets swapping out any elements at the top of the good prefix to the bucket they are supposed to be in. This procedure puts at least one element in the right bucket for each swap, so we reorder everything correctly in at most n swaps and O(n) additional work.

To keep track of each bucket, we use two pointers bucket[i] for the first element of the bucket and top[i] for the first unused element. We could make these be integer array indices, but this slows the code down by about 10%. This seems to be a situation where our use of pointers is complicated enough that the compiler can't optimize out the array lookups.

169.3. Further optimization

Since we are detecting the heaviest bucket anyway, there is a straightforward optimization that speeds the sort up noticeably on inputs with a lot of duplicates: if everything would land in the same bucket, we can skip the bucket-sort and just go directly to the next character.

169.4. Sample implementation

Here is an implementation of MSB radix sort using the ideas above:

   1 #include <assert.h>
   2 #include <limits.h>
   3 #include <string.h>
   4 
   5 #include "radixsort.h"
   6 
   7 /* in-place MSB radix sort for null-terminated strings */
   8 
   9 /* helper routine for swapping */
  10 static void
  11 swapStrings(const char **a, const char **b)
  12 {
  13     const char *temp;
  14 
  15     temp = *a;
  16     *a = *b;
  17     *b = temp;
  18 }
  19 
  20 /* this is the internal routine that assumes all strings are equal for the
  21  * first k characters */
  22 static void
  23 radixSortInternal(int n, const char **a, int k)
  24 {
  25     int i;
  26     int count[UCHAR_MAX+1];  /* number of strings with given character in position k */
  27     int mode;                /* most common position-k character */
  28     const char **bucket[UCHAR_MAX+1]; /* position of character block in output */
  29     const char **top[UCHAR_MAX+1];    /* first unused index in this character block */
  30 
  31     /* loop implements tail recursion on most common character */
  32     while(n > 1) {
  33 
  34         /* count occurrences of each character */
  35         memset(count, 0, sizeof(int)*(UCHAR_MAX+1));
  36 
  37         for(i = 0; i < n; i++) {
  38             count[(unsigned char) a[i][k]]++;
  39         }
  40 
  41         /* find the most common nonzero character */
  42         /* we will handle this specially */
  43         mode = 1;
  44         for(i = 2; i < UCHAR_MAX+1; i++) {
  45             if(count[i] > count[mode]) {
  46                 mode = i;
  47             }
  48         }
  49 
  50         if(count[mode] < n) {
  51 
  52             /* generate bucket and top fields */
  53             bucket[0] = top[0] = a;
  54             for(i = 1; i < UCHAR_MAX+1; i++) {
  55                 top[i] = bucket[i] = bucket[i-1] + count[i-1];
  56             }
  57 
  58             /* reorder elements by k-th character */
  59             /* this is similar to dutch flag algorithm */
  60             /* we start at bottom character and swap values out until everything is in place */
  61             /* invariant is that for all i, bucket[i] <= j < top[i] implies a[j][k] == i */
  62             for(i = 0; i < UCHAR_MAX+1; i++) {
  63                 while(top[i] < bucket[i] + count[i]) {
  64                     if((unsigned char) top[i][0][k] == i) {
  65                         /* leave it in place, advance bucket */
  66                         top[i]++;
  67                     } else {
  68                         /* swap with top of appropriate block */
  69                         swapStrings(top[i], top[(unsigned char) top[i][0][k]]++);
  70                     }
  71                 }
  72             }
  73 
  74             /* we have now reordered everything */
  75             /* recurse on all but 0 and mode */
  76             for(i = 1; i < UCHAR_MAX+1; i++) {
  77                 if(i != mode) {
  78                     radixSortInternal(count[i], bucket[i], k+1);
  79                 }
  80             }
  81 
  82             /* tail recurse on mode */
  83             n = count[mode];
  84             a = bucket[mode];
  85             k = k+1;
  86 
  87         } else {
  88 
  89             /* tail recurse on whole pile */
  90             k = k+1;
  91         }
  92     }
  93 }
  94 
  95 void
  96 radixSort(int n, const char **a)
  97 {
  98     radixSortInternal(n, a, 0);
  99 }

radixsort.c

Some additional files: radixsort.h, test_radixsort.c, Makefile, sortInput.c. The last is a program that sorts lines on stdin and writes the result to stdout, similar to the GNU sort utility. When compiled with -O3 and run on my machine, this runs in about the same time as the standard sort program when run on a 4.7 million line input file consisting of a random shuffle of 20 copies of /usr/share/dict/words, provided sort is run with LANG=C sort < /usr/share/dict/words to keep it from having to deal with locale-specific collating issues. On other inputs, sort is faster. This is not bad given how thoroughly sort has been optimized, but is a sign that further optimization is possible.

CategoryProgrammingNotes

170. RadixSearch

Radix search refers to a variety of data structures that support searching for strings considered as sequences of digits in some large base (or radix). These are generally faster than simple BinarySearchTrees because they usually only require examining one byte or less of the target string at each level of the tree, as compared to every byte in the target in a full string comparison. In many cases, the best radix search trees are even faster than HashTables, because they only need to look at a small part of the target string to identify it.

We'll describe several radix search trees, starting with the simplest and working up.

171. Tries

A trie is a binary tree (or more generally, a k-ary tree where k is the radix) where the root represents the empty bit sequence and the two children of a node representing sequence x represent the extended sequences x0 and x1 (or generally x0, x1, ... x(k-1)). So a key is not stored at a particular node but is instead represented bit-by-bit (or digit-by-digit) along some path. Typically a trie assumes that the set of keys is prefix-free, i.e. that no key is a prefix of another; in this case there is a one-to-one correspondence between keys and leaves of the trie. If this is not the case, we can mark internal nodes that also correspond to the ends of keys, getting a slightly different data structure known as a digital search tree. For null-terminated strings as in C, the null terminator ensures that any set of strings is prefix-free.

Given this simple description, a trie storing a single long key would have a very large number of nodes. A standard optimization is to chop off any path with no branches in it, so that each leaf corresponds to the shortest unique prefix of a key. This requires storing the key in the leaf so that we can distinguish different keys with the same prefix.

The name trie comes from the phrase "information retrieval." Despite the etymology, trie is now almost always pronounced like try instead of tree to avoid confusion with other tree data structures.

171.1. Searching a trie

Searching a trie is similar to searching a binary search tree, except that instead of doing a comparison at each step we just look at the next bit in the target. The time to perform a search is proportional to the number of bits in the longest path in the tree matching a prefix of the target. This can be very fast for search misses if the target is wildly different from all the keys in the tree.

171.2. Inserting a new element into a trie

Insertion is more complicated for tries than for binary search trees. The reason is that a new element may add more than one new node. There are essentially two cases:

(The simple case.) In searching for the new key, we reach a null pointer leaving a non-leaf node. In this case we can simply add a new leaf. The cost of this case is essentially the same as for search plus O(1) for building the new leaf.
(The other case.) In searching for the new key, we reach a leaf, but the key stored there isn't the same as the new key. Now we have to generate a new path for as long as the old key and the new key have the same bits, branching out to two different leaves at the end. The cost of this operation is within a constant factor of the cost for searching for the new leaf after it is inserted, since that's how long the newly-built search path will be.

In either case, the cost is bounded by the length of the new key, which is about the best we can hope for in the worst case for any data structure.

171.3. Implementation

A typical trie implementation in C might look like this. It uses a GET_BIT macro similar to the one from C/BitExtraction, except that we reverse the bits within each byte to get the right sorting order for keys.

   1 typedef struct trie_node *Trie;
   2 
   3 #define EMPTY_TRIE (0)
   4 
   5 /* returns 1 if trie contains target */
   6 int trie_contains(Trie trie, const char *target);
   7 
   8 /* add a new key to a trie */
   9 /* and return the new trie */
  10 Trie trie_insert(Trie trie, const char *key);
  11 
  12 /* free a trie */
  13 void trie_destroy(Trie);
  14 
  15 /* debugging utility: print all keys in trie */
  16 void trie_print(Trie);

trie.h

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <string.h>
   4 #include <assert.h>
   5 
   6 #include "trie.h"
   7 
   8 #define BITS_PER_BYTE (8)
   9 
  10 /* extract the n-th bit of x */
  11 /* here we process bits within bytes in MSB-first order */
  12 /* this sorts like strcmp */
  13 #define GET_BIT(x, n) ((((x)[(n) / BITS_PER_BYTE]) & (0x1 << (BITS_PER_BYTE - 1 - (n) % BITS_PER_BYTE))) != 0)
  14 
  15 #define TRIE_BASE (2)
  16 
  17 struct trie_node {
  18     char *key;
  19     struct trie_node *kids[TRIE_BASE];
  20 };
  21 
  22 #define IsLeaf(t) ((t)->kids[0] == 0 && (t)->kids[1] == 0)
  23 
  24 /* returns 1 if trie contains target */
  25 int
  26 trie_contains(Trie trie, const char *target)
  27 {
  28     int bit;
  29 
  30     for(bit = 0; trie && !IsLeaf(trie); bit++) {
  31         /* keep going */
  32         trie = trie->kids[GET_BIT(target, bit)];
  33     }
  34 
  35     if(trie == 0) {
  36         /* we lost */
  37         return 0;
  38     } else {
  39         /* check that leaf really contains the target */
  40         return !strcmp(trie->key, target);
  41     }
  42 }
  43 
  44 /* gcc -pedantic kills strdup! */
  45 static char *
  46 my_strdup(const char *s)
  47 {
  48     char *s2;
  49 
  50     s2 = malloc(strlen(s) + 1);
  51     assert(s2);
  52 
  53     strcpy(s2, s);
  54     return s2;
  55 }
  56 
  57 
  58 /* helper functions for insert */
  59 static Trie
  60 make_trie_node(const char *key)
  61 {
  62     Trie t;
  63     int i;
  64 
  65     t = malloc(sizeof(*t));
  66     assert(t);
  67 
  68     if(key) {
  69         t->key = my_strdup(key);
  70         assert(t->key);
  71     } else {
  72         t->key = 0;
  73     }
  74 
  75     for(i = 0; i < TRIE_BASE; i++) t->kids[i] = 0;
  76 
  77     return t;
  78 }
  79 
  80 /* add a new key to a trie */
  81 /* and return the new trie */
  82 Trie
  83 trie_insert(Trie trie, const char *key)
  84 {
  85     int bit;
  86     int bitvalue;
  87     Trie t;
  88     Trie kid;
  89     const char *oldkey;
  90 
  91     if(trie == 0) {
  92         return make_trie_node(key);
  93     }
  94     /* else */
  95     /* first we'll search for key */
  96     for(t = trie, bit = 0; !IsLeaf(t); bit++, t = kid) {
  97         kid = t->kids[bitvalue = GET_BIT(key, bit)];
  98         if(kid == 0) {
  99             /* woohoo! easy case */
 100             t->kids[bitvalue] = make_trie_node(key);
 101             return trie;
 102         }
 103     }
 104 
 105     /* did we get lucky? */
 106     if(!strcmp(t->key, key)) {
 107         /* nothing to do */
 108         return trie;
 109     }
 110 
 111     /* else */
 112     /* hard case---have to extend the @#!$ trie */
 113     oldkey = t->key;
 114 #ifdef EXCESSIVE_TIDINESS
 115     t->key = 0;      /* not required but makes data structure look tidier */
 116 #endif
 117 
 118     /* walk the common prefix */
 119     while(GET_BIT(oldkey, bit) == (bitvalue = GET_BIT(key, bit))) {
 120         kid = make_trie_node(0);
 121         t->kids[bitvalue] = kid;
 122         bit++;
 123         t = kid;
 124     }
 125 
 126     /* then split */
 127     t->kids[bitvalue] = make_trie_node(key);
 128     t->kids[!bitvalue] = make_trie_node(oldkey);
 129 
 130     return trie;
 131 }
 132 
 133 /* kill it */
 134 void
 135 trie_destroy(Trie trie)
 136 {
 137     int i;
 138 
 139     if(trie) {
 140         for(i = 0; i < TRIE_BASE; i++) {
 141             trie_destroy(trie->kids[i]);
 142         } 
 143 
 144         if(IsLeaf(trie)) {
 145             free(trie->key);
 146         }
 147 
 148         free(trie);
 149     }
 150 }
 151 
 152 static void
 153 trie_print_internal(Trie t, int bit)
 154 {
 155     int i;
 156     int kid;
 157 
 158     if(t != 0) {
 159         if(IsLeaf(t)) {
 160             for(i = 0; i < bit; i++) putchar(' ');
 161             puts(t->key);
 162         } else {
 163             for(kid = 0; kid < TRIE_BASE; kid++) {
 164                 trie_print_internal(t->kids[kid], bit+1);
 165             }
 166         }
 167     }
 168 }
 169 
 170 void
 171 trie_print(Trie t)
 172 {
 173     trie_print_internal(t, 0);
 174 }

trie.c

Here is a short test program that demonstrates how to use it:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 #include "trie.h"
   5 
   6 /* test for trie.c */
   7 /* reads lines from stdin and echoes lines that haven't appeared before */
   8 
   9 /* read a line of text from stdin
  10  * and return it (without terminating newline) as a freshly-malloc'd block.
  11  * Caller is responsible for freeing this block.
  12  * Returns 0 on error or EOF.
  13  */
  14 char *
  15 getline(void)
  16 {
  17     char *line;         /* line buffer */
  18     int n;              /* characters read */
  19     int size;           /* size of line buffer */
  20     int c;
  21 
  22     size = 1;
  23     line = malloc(size);
  24     if(line == 0) return 0;
  25     
  26     n = 0;
  27 
  28     while((c = getchar()) != '\n' && c != EOF) {
  29         while(n >= size - 1) {
  30             size *= 2;
  31             line = realloc(line, size);
  32             if(line == 0) return 0;
  33         }
  34         line[n++] = c;
  35     }
  36 
  37     if(c == EOF && n == 0) {
  38         /* got nothing */
  39         free(line);
  40         return 0;
  41     } else {
  42         line[n++] = '\0';
  43         return line;
  44     }
  45 }
  46 
  47 int
  48 main(int argc, char **argv)
  49 {
  50     Trie t;
  51     char *line;
  52 
  53     t = EMPTY_TRIE;
  54 
  55     while((line = getline()) != 0) {
  56         if(!trie_contains(t, line)) {
  57             puts(line);
  58         }
  59 
  60         /* try to insert it either way */
  61         /* this tests that insert doesn't blow up on duplicates */
  62         t = trie_insert(t, line);
  63 
  64         free(line);
  65     }
  66 
  67     puts("===");
  68 
  69     trie_print(t);
  70 
  71     trie_destroy(t);
  72 
  73     return 0;
  74 }

test_trie.c

172. Patricia trees

Tries perform well when all keys are short (or are distinguished by short prefixes), but can grow very large if one inserts two keys that have a long common prefix. The reason is that a trie has to store an internal node for every bit of the common prefix until the two keys become distinguishable, leading to long chains of internal nodes each of which has only one child. An optimization (described in this paper) known as a Patricia tree eliminates these long chains by having each node store the number of the bit to branch on, like this:

   1 struct patricia_node {
   2     char *key;
   3     int bit;
   4     struct patricia_node *kids[2];
   5 };
   6 
   7 typedef struct patricia_node *Patricia;

Now when searching for a key, instead of using the number of nodes visited so far to figure out which bit to look at, we just read the bit out of the table. This means in particular that we can skip over any bits that we don't actually branch on. We do however have to be more careful to make sure we don't run off the end of our target key, since it is possible that when skipping over intermediate bits we might skip over some that distinguish our target from all keys in the table, including longer keys. For example, a Patricia tree storing the strings abc and abd will first test bit position 22, since that's where abc and abd differ. This can be trouble if we are looking for x.

Here's the search code:

   1 int
   2 patricia_contains(Patricia t, const char *key)
   3 {
   4     int key_bits;
   5 
   6     key_bits = BITS_PER_BYTE * (strlen(key)+1);   /* +1 for the nul */
   7 
   8     while(t && !IsLeaf(t)) {
   9         if(t->bit >= key_bits) {
  10             /* can't be there */
  11             return 0;
  12         } else {
  13             t = t->kids[GET_BIT(key, t->bit)];
  14         }
  15     }
  16 
  17     return t && !strcmp(t->key, key);
  18 }

The insertion code is similar in many respects to the insertion code for a trie. The differences are that we never construct a long chain of internal nodes when splitting a leaf (although we do have to scan through both the old and new keys to find the first bit position where they differ), but we may sometimes have to add a new internal node between two previously existing nodes if a new key branches off at a bit position that was previously skipped over.

In the worst case Patricia trees are much more efficient than tries, in both space (linear in the number of keys instead of linear in the total size of the keys) and time complexity, often needing to examine only a very small number of bits for misses (hits still require a full scan in strcmp to verify the correct key). The only downside of Patricia trees is that since they work on bits, they are not quite perfectly tuned to the byte or word-oriented structure of modern CPUs.

173. Ternary search trees

Ternary search trees were described by Jon Bentley and Bob Sedgewick in an article in the April 1988 issue of Dr. Dobb's Journal, available here.

The basic idea is that each node in the tree stores one character from the key and three child pointers lt, eq, and gt. If the corresponding character in the target is equal to the character in the node, we move to the next character in the target and follow the eq pointer out of the node. If the target is less, follow the lt pointer but stay at the same character. If the target is greater, follow the gt pointer and again stay at the same character. When searching for a string, we walk down the tree until we either reach a node that matches the terminating nul (a hit), or follow a null pointer (a miss).

A TST acts a bit like a 256-way trie, except that instead of storing an array of 256 outgoing pointers, we build something similar to a small binary search tree for the next character. Note that no explicit balancing is done within these binary search trees. From a theoretical perspective, the worst case is that we get a 256-node deep linked-list equivalent at each step, multiplying our search time by 256 = O(1). In practice, only those characters that actual appear in some key at this stage will show up, so the O(1) is likely to be a small O(1), especially if keys are presented in random order.

TSTs are one of the fastest known data structures for implementing dictionaries using strings as keys, beating both hash tables and tries in most cases. They can be slower than Patricia trees if the keys have many keys with long matching prefixes; however, a Patricia-like optimization can be applied to give a compressed ternary search tree that works well even in this case. In practice, regular TSTs are usually good enough.

Here is a simple implementation of an insert-only TST. The C code includes two versions of the insert helper routine; the first is the original recursive version and the second is an iterative version generated by eliminating the tail recursion from the first.

   1 typedef struct tst_node *TST;
   2 
   3 #define EMPTY_TST (0)
   4 
   5 /* returns 1 if t contains target */
   6 int tst_contains(TST t, const char *target);
   7 
   8 /* add a new key to a TST */
   9 /* and return the new TST */
  10 TST tst_insert(TST t, const char *key);
  11 
  12 /* free a TST */
  13 void tst_destroy(TST);

tst.h

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 #include "tst.h"
   6 
   7 struct tst_node {
   8     char key;                   /* value to split on */
   9     struct tst_node *lt;        /* go here if target[index] < value */
  10     struct tst_node *eq;        /* go here if target[index] == value */
  11     struct tst_node *gt;        /* go here if target[index] > value */
  12 };
  13 
  14 /* returns 1 if t contains key */
  15 int
  16 tst_contains(TST t, const char *key)
  17 {
  18     assert(key);
  19 
  20     while(t) {
  21         if(*key < t->key) {
  22             t = t->lt;
  23         } else if(*key > t->key) {
  24             t = t->gt;
  25         } else if(*key == '\0') {
  26             return 1;
  27         } else {
  28             t = t->eq;
  29             key++;
  30         }
  31     }
  32 
  33     return 0;
  34 }
  35 
  36 /* original recursive insert */
  37 static void
  38 tst_insert_recursive(TST *t, const char *key)
  39 {
  40     if(*t == 0) {
  41         *t = malloc(sizeof(**t));
  42         assert(*t);
  43         (*t)->key = *key;
  44         (*t)->lt = (*t)->eq = (*t)->gt = 0;
  45     }
  46 
  47     /* now follow search */
  48     if(*key < (*t)->key) {
  49         tst_insert_recursive(&(*t)->lt, key);
  50     } else if(*key > (*t)->key) {
  51         tst_insert_recursive(&(*t)->gt, key);
  52     } else if(*key == '\0') {
  53         /* do nothing, we are done */
  54         ;
  55     } else {
  56         tst_insert_recursive(&(*t)->eq, key+1);
  57     }
  58 }
  59 
  60 /* iterative version of above, since somebody asked */
  61 /* This is pretty much standard tail-recursion elimination: */
  62 /* The whole function gets wrapped in a loop, and recursive
  63  * calls get replaced by assignment */
  64 static void
  65 tst_insert_iterative(TST *t, const char *key)
  66 {
  67     for(;;) {
  68         if(*t == 0) {
  69             *t = malloc(sizeof(**t));
  70             assert(*t);
  71             (*t)->key = *key;
  72             (*t)->lt = (*t)->eq = (*t)->gt = 0;
  73         }
  74 
  75         /* now follow search */
  76         if(*key < (*t)->key) {
  77             t = &(*t)->lt;
  78         } else if(*key > (*t)->key) {
  79             t = &(*t)->gt;
  80         } else if(*key == '\0') {
  81             /* do nothing, we are done */
  82             return;
  83         } else {
  84             t = &(*t)->eq;
  85             key++;
  86         }
  87     }
  88 }
  89 
  90 
  91 /* add a new key to a TST */
  92 /* and return the new TST */
  93 TST
  94 tst_insert(TST t, const char *key)
  95 {
  96     assert(key);
  97 
  98 #ifdef USE_RECURSIVE_INSERT
  99     tst_insert_recursive(&t, key);
 100 #else
 101     tst_insert_iterative(&t, key);
 102 #endif
 103     return t;
 104 }
 105 
 106 /* free a TST */
 107 void
 108 tst_destroy(TST t)
 109 {
 110     if(t) {
 111         tst_destroy(t->lt);
 112         tst_destroy(t->eq);
 113         tst_destroy(t->gt);
 114         free(t);
 115     }
 116 }

tst.c

And here is some test code, almost identical to the test code for tries: test_tst.c.

The Dr. Dobb's article contains additional code for doing deletions and partial matches, plus some optimizations for inserts.

174. More information

SedgewickSeries is probably the best place to read about these data structures.
http://imej.wfu.edu/articles/2002/2/02/index.asp has some good Java-based animations of radix tries, Patricia tries, and other tree-like data structures.

CategoryProgrammingNotes CategoryAlgorithmNotes

175. DynamicProgramming

Dynamic programming is a general-purpose AlgorithmDesignTechnique that is most often used to solve CombinatorialOptimization problems. There are two parts to dynamic programming. The first part is a programming technique: dynamic programming essentially DivideAndConquer run in reverse: as in DivideAndConquer, we solve a big instance of a problem by breaking it up recursively into smaller instances; but instead of carrying out the computation recursively from the top down, we start from the bottom with the smallest instances of the problem, solving each increasingly large instance in turn and storing the result in a table. The second part is a design principle: in building up our table, we are careful always to preserve alternative solutions we may need later, by delaying commitment to particular choices to the extent that we can.

The bottom-up aspect of dynamic programming is most useful when a straightforward recursion would produce many duplicate subproblems. It is most efficient when we can enumerate a class of subproblems that doesn't include too many extraneous cases that we don't need for our original problem.

To take a simple example, suppose that we want to compute the n-th Fibonacci number using the defining recurrence

F(n) = F(n-1) + F(n-2); F(1) = F(0) = 1.

A naive approach would simply code the recurrence up directly:

   1 int
   2 fib(int n)
   3 {
   4     if(n < 2) {
   5         return 1
   6     } else {
   7         return fib(n-1) + fib(n-2);
   8     }
   9 }

The running time of this procedure is easy to compute. The recurrence is

T(n) = T(n-1) + T(n-2) + Θ(1),

whose solution is Θ(aⁿ) where a is the golden ratio 1.6180339887498948482... . This is badly exponential.¹¹

176. Memoization

The problem is that we keep recomputing values of fib that we've already computed. We can avoid this by memoization, where we wrap our recursive solution in a memoizer that stores previously-computed solutions in a HashTable. Sensible¹² programming languages will let you write a memoizer once and apply it to arbitrary recursive functions. In less sensible programming languages it is usually easier just to embed the memoization in the function definition itself:

   1 int
   2 memoFib(int n)
   3 {
   4     int ret;
   5 
   6     if(hashContains(FibHash, n)) {
   7         return hashGet(FibHash, n);
   8     } else {
   9         ret = memoFib(n-1) + memoFib(n-2);
  10         hashPut(FibHash, n, ret);
  11         return ret;
  12     }
  13 }

The assumption here is that FibHash is a global hash table that we have initialized to map 0 and 1 to 1. The total cost of running this procedure is O(n), because fib is called at most twice for each value k in 0..n.

Memoization is a very useful technique in practice, but it is not popular with algorithm designers because computing the running time of a complex memoized procedure is often much more difficult than computing the time to fill a nice clean table. The use of a hash table instead of an array may also add overhead (and code complexity) that comes out in the constant factors. But it is always the case that a memoized recursive procedure considers no more subproblems than a table-based solution, and it may consider many fewer if we are sloppy about what we put in our table (perhaps because we can't easily predict what subproblems will be useful).

177. Dynamic programming

Dynamic programming comes to the rescue. Because we know what smaller cases we have to reduce F(n) to, instead of computing F(n) top-down, we compute it bottom-up, hitting all possible smaller cases and storing the results in an array:

   1 int
   2 fib2(int n)
   3 {
   4     int *a;
   5     int i;
   6     int ret;
   7     
   8     if(n < 2) {
   9         return 1;
  10     } else {
  11         a = malloc(sizeof(*a) * (n+1));
  12         assert(a);
  13 
  14         a[1] = a[2] = 1;
  15 
  16         for(i = 3; i <= n; i++) {
  17             a[i] = a[i-1] + a[i-2];
  18         }
  19     }
  20 
  21     ret = a[n];
  22     free(a);
  23     return ret;
  24 }

Notice the recurrence is exactly the same in this version as in our original recursive version, except that instead of computing F(n-1) and F(n-2) recursively, we just pull them out of the array. This is typical of dynamic-programming solutions: often the most tedious editing step in converting a recursive algorithm to dynamic programming is changing parentheses to square brackets. As with memoization, the effect of this conversion is dramatic; what used to be an exponential-time algorithm is now linear-time.

177.1. More examples

177.1.1. Longest increasing subsequence

Suppose that we want to compute the longest increasing subsequence of an array. This is a sequence, not necessarily contiguous, of elements from the array such that each is strictly larger than the one before it. Since there are 2ⁿ different subsequences of an n-element array, it will take a while to try all of them by BruteForce.

What makes this problem suitable for dynamic programming is that any prefix of a longest increasing subsequence is a longest increasing subsequence of the part of the array that ends where the prefix ends; if it weren't, we could make the big sequence longer by choosing a longer prefix. So to find the longest increasing subsequence of the whole array, we build up a table of longest increasing subsequences for each initial prefix of the array. At each step, when finding the longest increasing subsequence of elements 0..i, we can just scan through all the possible values for the second-to-last element and read the length of the best possible subsequence ending there out of the table. When the table is complete, we can scan for the best last element and then work backwards to reconstruct the actual subsequence.

This last step requires some explanation. We don't really want to store in table[i] the full longest increasing subsequence ending at position i, because it may be very big. Instead, we store the index of the second-to-last element of this sequence. Since that second-to-last element also has a table entry that stores the index of its predecessor, by following the indices we can generate a subsequence of length O(n), even though we only stored O(1) pieces of information in each table entry.

Here's what the code looks like:

   1 /* compute a longest strictly increasing subsequence of an array of ints */
   2 /* input is array a with given length n */
   3 /* returns length of LIS */
   4 /* If the output pointer is non-null, writes LIS to output pointer. */
   5 /* Caller should provide at least sizeof(int)*n space for output */
   6 /* If there are multiple LIS's, which one is returned is arbitrary. */
   7 unsigned long
   8 longest_increasing_subsequence(const int a[], unsigned long n, int *output);

lis.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 
   4 #include "lis.h"
   5 
   6 unsigned long
   7 longest_increasing_subsequence(const int a[], unsigned long n, int *output)
   8 {
   9     struct lis_data {
  10         unsigned long length;             /* length of LIS ending at this point */
  11         unsigned long prev;               /* previous entry in the LIS ending at this point */
  12     } *table;
  13 
  14     unsigned long best;      /* best entry in table */
  15     unsigned long scan;      /* used to generate output */
  16 
  17     unsigned long i;            
  18     unsigned long j;
  19     unsigned long best_length;
  20 
  21     /* special case for empty table */
  22     if(n == 0) return 0;
  23 
  24     table = malloc(sizeof(*table) * n);
  25 
  26     for(i = 0; i < n; i++) {
  27         /* default best is just this element by itself */
  28         table[i].length = 1;
  29         table[i].prev = n;              /* default end-of-list value */
  30 
  31         /* but try all other possibilities */
  32         for(j = 0; j < i; j++) {
  33             if(a[j] < a[i] && table[j].length + 1 > table[i].length) {
  34                 /* we have a winner */
  35                 table[i].length = table[j].length + 1;
  36                 table[i].prev = j;
  37             }
  38         }
  39     }
  40 
  41     /* now find the best of the lot */
  42     best = 0;
  43 
  44     for(i = 1; i < n; i++) {
  45         if(table[i].length > table[best].length) {
  46             best = i;
  47         }
  48     }
  49 
  50     /* table[best].length is now our return value */
  51     /* save it so that we don't lose it when we free table */
  52     best_length = table[best].length;
  53 
  54     /* do we really have to compute the output? */
  55     if(output) {
  56         /* yes :-( */
  57         scan = best;
  58         for(i = 0; i < best_length; i++) {
  59             assert(scan >= 0);
  60             assert(scan < n);
  61 
  62             output[best_length - i - 1] = a[scan];
  63 
  64             scan = table[scan].prev;
  65         }
  66     }
  67 
  68     free(table);
  69 
  70     return best_length;
  71 }

lis.c

A sample program that runs longest_increasing_subsequence on a list of numbers passed in by stdin is given in test_lis.c. Here is a Makefile.

177.1.2. All-pairs shortest paths

Suppose we want to compute the distance between any two points in a graph, where each edge uv has a length l_uv (+∞ for edges not in the graph) and the distance between two vertices s and t is the minimum over all s-t paths of the total length of the edges. There are various algorithms for doing this for a particular s and t, but there is also a very simple dynamic programming algorithm known as Floyd-Warshall that computes the distance between all n² pairs of vertices in Θ(n³) time. This algorithm will not be described here; see ShortestPath.

177.1.3. Longest common subsequence

Given sequences of characters v and w, v is a subsequence of w if every character in v appears in w in the same order. For example, aaaaa, brac, and badar are all subsequences of abracadabra, but badcar is not. A longest common subsequence (LCS for short) of two sequences x and y is the longest sequence that is a subsequence of both: two longest common subsequences of abracadabra and badcar are badar and bacar.

One can find the LCS of two sequence by BruteForce, but it will take a while; there are 2ⁿ subsequences of a sequence of length n, and for each of these subsequences of the first sequence it will take some additional time to check if it is a subsequence of the second. It is better to solve the problem using dynamic programming. Having sequences gives an obvious linear structure to exploit: the basic strategy will be to compute LCSs for increasingly long prefixes of the inputs. But with two sequences we will have to consider prefixes of both, which will give us a two-dimensional table where rows correspond to prefixes of sequence x and columns correspond to prefixes of sequence y.

The recursive decomposition that makes this technique work looks like this: the LCS of x[1..i] and y[1..j] is the longest of

LCS(x[1..i-1], y[1..j-1] + 1) if x[i] = y[j],
LCS(x[1..i-1], y[1..j]), or
LCS(x[1..i], y[1..j-1]).

The idea is that we either have a new matching character we couldn't use before (the first case), or we have an LCS that doesn't use one of x[i] or y[j] (the remaining cases). In each case the recursive call to LCS involves a shorter prefix of x or y. So we can fill in these values in a table, as long as we are careful to make sure that the shorter prefixes are always filled first. Here's a short C program that does this:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 #include <string.h>
   5 #include <limits.h>
   6 
   7 /* compute longest common subsequence of argv[1] and argv[2] */
   8 
   9 /* computes longest common subsequence of x and y, writes result to lcs */
  10 /* lcs should be pre-allocated by caller to 1 + minimum length of x or y */
  11 void
  12 longestCommonSubsequence(const char *x, const char *y, char *lcs)
  13 {
  14     int xLen;
  15     int yLen;
  16     int i;             /* position in x */
  17     int j;             /* position in y */
  18 
  19     xLen = strlen(x);
  20     yLen = strlen(y);
  21 
  22     /* best choice at each position */
  23     /* length gives length of LCS for these prefixes */
  24     /* prev points to previous substring */
  25     /* newChar if non-null is new character */
  26     struct choice {
  27         int length;
  28         struct choice *prev;
  29         char newChar;
  30     } best[xLen][yLen];
  31 
  32     for(i = 0; i < xLen; i++) {
  33         for(j = 0; j < yLen; j++) {
  34             /* we can always do no common substring */
  35             best[i][j].length = 0;
  36             best[i][j].prev = 0;
  37             best[i][j].newChar = 0;
  38 
  39             /* if we have a match, try adding new character */
  40             /* this is always better than the nothing we started with */
  41             if(x[i] == y[j]) {
  42                 best[i][j].newChar = x[i];
  43                 if(i > 0 && j > 0) {
  44                     best[i][j].length = best[i-1][j-1].length + 1;
  45                     best[i][j].prev = &best[i-1][j-1];
  46                 } else {
  47                     best[i][j].length = 1;
  48                 }
  49             }
  50 
  51             /* maybe we can do even better by ignoring a new character */
  52             if(i > 0 && best[i-1][j].length > best[i][j].length) {
  53                 /* throw away a character from x */
  54                 best[i][j].length = best[i-1][j].length;
  55                 best[i][j].prev = &best[i-1][j];
  56                 best[i][j].newChar = 0;
  57             }
  58 
  59             if(j > 0 && best[i][j-1].length > best[i][j].length) {
  60                 /* throw away a character from x */
  61                 best[i][j].length = best[i][j-1].length;
  62                 best[i][j].prev = &best[i][j-1];
  63                 best[i][j].newChar = 0;
  64             }
  65 
  66         }
  67     }
  68 
  69     /* reconstruct string working backwards from best[xLen-1][yLen-1] */
  70     int outPos;        /* position in output string */
  71     struct choice *p;  /* for chasing linked list */
  72 
  73     outPos = best[xLen-1][yLen-1].length;
  74     lcs[outPos--] = '\0';
  75 
  76     for(p = &best[xLen-1][yLen-1]; p; p = p->prev) {
  77         if(p->newChar) {
  78             assert(outPos >= 0);
  79             lcs[outPos--] = p->newChar;
  80         }
  81     }
  82 }
  83 
  84 int
  85 main(int argc, char **argv)
  86 {
  87     if(argc != 3) {
  88         fprintf(stderr, "Usage: %s string1 string2\n", argv[0]);
  89         return 1;
  90     }
  91 
  92     char output[strlen(argv[1]) + 1];
  93 
  94     longestCommonSubsequence(argv[1], argv[2], output);
  95 
  96     puts(output);
  97 
  98     return 0;
  99 }

lcs.c

The whole thing takes O(nm) time where n and m are the lengths of A and B.

178. Dynamic programming: algorithmic perspective

These are mostly old notes from CS365; CS223 students should feel free to ignore this part.

178.1. Preserving alternatives

Like any DivideAndConquer algorithm, a dynamic programming algorithm needs to have some notion of what problems are big and what are small, so that we know that our recursive decomposition is making progress. Often a dynamic programming algorithm works by walking up this problem ordering. The simplest case (as in Fib2 above) is when the ordering is linear; every problem has an index 1, 2, 3, ..., and we simply solve problems in increasing order. Dynamic programming naturally lends itself to any problem we can put in a line, and a common application in combinatorial optimization is solving problems with an underlying temporal structure, where we want to maximize our profit after n steps by first figuring out how to maximize our profit after n-1. However, in figuring out the n-1 step solution, we may have to preserve multiple potential solutions, because the choices we make during these steps may limit our options later. Preserving these alternatives is what distinguishes dynamic programming from the GreedyMethod, where we just grab what we can now and hope for the best.

To take a simple example, consider the following problem: a large chemical plant can be in any of states a, b, c, d, etc. The plant starts at the beginning of the day in state a (all vats empty and clean, all workers lined up at the entry gate) and ends the day n steps later in the same state. In between, the chemical plant makes its owners money by moving from state to state (for example, moving from a to b might involve filling vat number 126 with toluene and moving from state r to state q might involve dumping the contents of vat 12 into vat 17 and stirring the resulting mixture vigorously). We don't care about the details of the particular operations, except that we assume that there is a profit function p(i,j) that tells us how much money we make (or lose, if p(i,j) is negative) when we move from state i to state j. Our goal is to find a sequence of n+1 states with state a at the start and end, that maximizes our total profit.

This is a good candidate for solution using dynamic programming, and our approach will be to keep around increasingly large partial sequences that tell us what to do for the first k steps. The only trick part is that certain profitable partial sequences might lead us to final states from which it is very expensive to get back to state a (e.g., chemical plant is on fire and venting toxic fumes over most of New England while board of directors flees to Brazil carrying suitcases full of $100 bills). But on the other hand if we know what the final state of a partial sequence is, we don't particularly care how we got there. So we can recursively decompose the problem of finding the best sequence of n steps into finding many best sequences of n-1 steps:

Plan(n, final state):
  for each state s:
    BestPlan[n-1, s] = Plan(n-1, s)
  Find s that maximizes profit of BestPlan[n-1, s] + p(s, final state)
  return BestPlan[n-1,s] with final state appended

This is easily turned upside-down to get a dynamic programming algorithm:

Plan(n, final state):
  BestPlan[0,a] = "a"
  for i = 1 to n:
    for each s:
      Find s1 that maximizes profit of BestPlan[i-1,s1] + p(s1,s)
      BestPlan[i,s] = BestPlan[i-1,s1] with s appended
  return BestPlan[n, final state]

The running time of this algorithm easily seen to be Θ(nm²) where n is the number of steps in the schedule and m is the number of states. Note that we can optimize the storage requirements by keeping around only two rows of the BestPlan array at a time.

The pattern here generalizes to other combinatorial optimization problems: the subproblems we consider are finding optimal assignments to increasingly large subsets of the variables, and have have to keep around an optimal representative of each set of assignments that interact differently with the rest of the variables. Typically this works best when the partial assignment has a small "frontier" (like the last state of the chemical plant) that is all that is visible to later operations.

178.2. Knapsack

In the knapsack problem, we are offered the opportunity to carry away items 1..n, where item i has value p_i and weight w_i. We'd like to maximize the total value of the items we take, but the total weight of the items we take is limited by the capacity K of our knapsack. How can we pick the best subset?

Greedy algorithms will let us down here: suppose item 1 costs $1000 and weighs K, while items 2..1001 cost $100 each and weigh K/10000. An algorithm that grabs the most expensive item first will fill the knapsack and get a profit of $1000 vs $100,000 for the optimal solution. It is tempting to try instead to grab first the items with the highest profit/weight ratio, but this too fails in many cases: consider two items, one of which weigh K/100000 and has value $1, and the other of which weighs K but has value $10,000. We can only take one, and the first item is worth ten times as much per pound as the second.

There are two ways to solve knapsack using dynamic programming. One works best when the weights (or the size of the knapsack) are small integers, and the other works best when the profits are small integers. Both process the items one at a time, maintaining a list of the best choices of the first k items yielding a particular total weight or a particular total profit.

Here's the version that tracks total weight:

KnapsackByWeight(p, w, K):
  // MaxProfit[i][w] is the maximum profit on subsets of items 1..i
  // with total weight w
  MaxProfit = array[0..n][0..K] initialized to -infinity

  MaxProfit[0][0] = 0

  for i = 1 to n:
    for j = 0..K:
      MaxProfit[i][j] = max(MaxProfit[i-1][j], MaxProfit[i-1][j-w[i]] + p[i])
  
  return max over all j of MaxProfit[n][j]

This runs in time Θ(nK). Note that even though this time bound is polynomial in the numerical value of K, it is exponential in the size of the input, since the size of K represented in binary is ⌈lg k ⌉ .

We can also solve the problem by tracking the minimum weight needed to obtain a certain profit. This variant is used less often than the previous one, because it is less likely in practice that the total profit on items are small than that the size of the knapsack is small.

KnapsackByProfit(p, w, K):
  // MinWeight[i][j] is the minimum weight of subsets of items 1..i
  // with total profit j
  MinWeight = array[0..n][0..sum p[i]] initialized to +infinity

  MinWeight[0][0] = 0

  for i = 1 to n:
    for j = 0..sum p[i]:
      MinWeight[i][j] = min(MinWeight[i-1][j], MinWeight[i-1][j-p[i]] + w[i])
  
  return maximum j for which MinWeight[n][j] <= K

The running time of this algorithm is Θ(n∑p_i).

178.3. Non-linear structures

The main thing we need to do dynamic programming is some sort of structure that is ordered (so we can build solutions bottom-up) and that doesn't require keeping track of too many alternatives (so that we don't just end up reimplementing BruteForce search).

178.3.1. Trees

One way to get both of these properties is to do dynamic programming on a tree: a tree is naturally ordered by depth (or height), and for many problems the visible part of the solution to a subproblem is confined to a small frontier consisting only of the root of a subtree.

For example, suppose we want to solve minimum vertex cover on a tree. A vertex cover is a set of marked vertices in a graph that covers every edge; in a tree this means that every node or its parent is marked. A minimum vertex cover is a vertex cover that marks the fewest nodes.

Here's a recursive algorithm for finding the size of a minimum vertex cover in a tree, based on the very simple fact that the root must either be put in or not:

VC(root, mustIncludeRoot):
  if mustIncludeRoot:
    return 1 + sum over all children of VC(child, false)
  else:
    withRoot = 1 + sum over all children of VC(child, false)
    withoutRoot = sum over all children of VC(child, true)
    return min(withRoot, withoutRoot)

The running time of this algorithm depends on the structure of the tree in a complicated way, but we can easily see that it will grow at least exponentially in the depth. This is a job for dynamic programming.

The dynamic programming version computes both VC(root, false) and VC(root, true) simultaneously, avoiding the double call for each child. The structure of the resulting algorithm does not look much like the table structures we typically see elsewhere, but the pairs of values passed up through the tree are in fact acting as very small vestigial tables.

DynamicVC(root):

  for each child c:
    Best[c][0], Best[c][1] = DynamicVC(c)

  withoutRoot = sum over all c of Best[c][1]
  withRoot = 1 + sum over all c of min(Best[c][0], Best[c][1])

  return (withoutRoot, withRoot)

The running time of this algorithm is easily seen to be Θ(n).

178.3.2. Treewidth

The important property of trees that makes them suitable for dynamic programming is that they have small separators: if we cut out one vertex, the tree disintegrates into subtrees that only interact (for certain problems) through the missing vertex. We can generalize this property to arbitrary graphs through the notion of treewidth.

Given a graph G = (V,E), a tree decomposition of G is a set of bags X = {X₁, X₂, ... X_n} of vertices in G, together with a tree structure T on the bags satisfying the following axioms:

The union of the bags equals V.
For every edge (u,v) in E, there is a bag containing both u and v.
If there is a path from X_i through X_j to X_k in T, then X_i ∩ X_k ⊆ X_j. (Equivalently: the set of all bags that contain any vertex v form a connected subtree of T.)

The width w(X,T) of a tree decomposition is the size of the largest bag. The treewidth tw(G) of a graph G is min(w(X,T))-1 over all tree decompositions of G. If we limit ourselves to tree decompositions in which T is a path, we get pathwidth instead.

178.3.2.1. Examples

If G is a tree, make each edge a bag, and let (X_i,X_j)∈T if X_i and X_j share a vertex. This is a width 2 tree decomposition of G, giving treewidth 2-1 = 1. (Making trees have treewidth 1 is why we subtract the 1.)
A k×m grid has treewidth min(k,m). Proof: Suppose k < m. Label the vertices v_ij where i ranges from 1..k and j ranges from 1..m. Construct a path of bags of size k+1 as follows: Bag 1 contains all k vertices v_i1 in column 1 plus v_k2. Bag 2 drops v_k1 and adds v_(k-1)2. Bag three drops v_(k-1)1 and add v_(k-2)2. Continue until we removed all column 1 vertices except v₁₁ and then repeat the process starting with column 2. We've covered every edge (most of them many times) and the subtree condition holds, since once we remove some vertex v_ij we never put it back. Since each bag has (k+1) vertices we get the desired treewidth. Generalization: the same tree decomposition works if we add edges from v_ij to v_(i+1)(j+1).
Any connected subgraph of a graph of treewidth k has treewidth at most k. Proof: Use the tree decomposition of the parent graph, removing any vertices that don't appear in the subgraph. The union axiom still holds, we have only removed edges, so the edge axiom still hold, and we have only removed vertices, so the collections of bags containing any surviving vertices are still subtrees.
Any cycle has treewidth 2. Proof: embed it in a 2×m grid with the extra diagonal edges.

178.3.2.2. Treewidth and dynamic programming

Idea: Reconstruct the tree decomposition as a rooted binary tree, then use separation properties to argue that we can build up solutions to e.g. independent set, graph coloring, Hamiltonian circuit, etc. recursively through the tree.

Problem: This whole process is a pain in the neck, and it's not clear that treewidth is any more intuitive than just having small separators. So I suspect we will not be covering this in lecture.

CategoryAlgorithmNotes CategoryProgrammingNotes

179. C/Graphs

These are notes on implementing graphs and graph algorithms in C. For a general overview of graphs, see GraphTheory. For pointers to specific algorithms on graphs, see GraphAlgorithms.

180. Graphs

A graph consists of a set of nodes or vertices together with a set of edges or arcs where each edge joins two vertices. Unless otherwise specified, a graph is undirected: each edge is an unordered pair {u,v} of vertices, and we don't regard either of the two vertices as having a distinct role from the other. However, it is more common in computing to consider directed graphs or digraphs in which edges are ordered pairs (u,v); here the vertex u is the source of the edge and vertex v is the sink or target of the edge. Directed edges are usually drawn as arrows and undirected edges as curves or line segments; see GraphTheory for examples. It is always possible to represent an undirected graph as a directed graph where each undirected edge {u,v} becomes two oppositely directed edges (u,v) and (v,u).

Given an edge (u,v), the vertices u and v are said to be incident to the edge and adjacent to each other. The number of vertices adjacent to a given vertex u is the degree of u; this can be divided into the out-degree (number of vertices v such that (u,v) is an edge) and the in-degree (number of vertices v such that (v,u) is an edge). A vertex v adjacent to u is called a neighbor of u, and (in a directed graph) is a predecessor of u if (v,u) is an edge and a successor of u if (u,v) is an edge. We will allow a node to be its own predecessor and successor.

181. Why graphs are useful

Graphs can be used to model any situation where we have things that are related to each other in pairs; for example, all of the following can be represented by graphs:

Family trees: Nodes are members, with an edge from each parent to each of their children.
Transportation networks: Nodes are airports, intersections, ports, etc. Edges are airline flights, one-way roads, shipping routes, etc.
Assignments: Suppose we are assigning classes to classrooms. Let each node be either a class or a classroom, and put an edge from a class to a classroom if the class is assigned to that room. This is an example of a bipartite graph, where the nodes can be divided into two sets S and T and all edges go from S to T.

182. Operations on graphs

What would we like to do to graphs? Generally, we first have to build a graph by starting with a set of nodes and adding in any edges we need, and then we want to extract information from it, such as "Is this graph connected?", "What is the shortest path in this graph from s to t?", or "How many edges can I remove from this graph before some nodes become unreachable from other nodes?" There are standard algorithms for answering all of these questions; the information these algorithms need is typically (a) given a vertex u, what successors does it have; and sometimes (b) given vertices u and v, does the edge (u,v) exist in the graph?

183. Representations of graphs

A good graph representation will allow us to answer one or both of these questions quickly. There are generally two standard representations of graphs that are used in graph algorithms, depending on which question is more important.

For both representations, we simplify the representation task by insisting that vertices be labeled 0, 1, 2, ..., n-1, where n is the number of vertices in the graph. If we have a graph with different vertex labels (say, airport codes), we can enforce an integer labeling by a preprocessing step where we assign integer labels, and then translate the integer labels back into more useful user labels afterwards. The preprocessing step can usually be done in O(n) time, which is likely to be smaller than the cost of whatever algorithm we are running on our graph, and the savings in code complexity and running time from working with just integer labels will pay this cost back many times over.

183.1. Adjacency matrices

An adjacency matrix is just a matrix a where a[i][j] is 1 if (i,j) is an edge in the graph and 0 otherwise. It's easy to build an adjacency matrix, and adding or testing for the existence of an edges takes O(1) time. The downsides of adjacency matrices are that enumerating the outgoing edges from a vertex takes O(n) time even if there aren't very many, and the O(n²) space cost is high for "sparse graphs," those with much fewer than n² edges.

183.2. Adjacency lists

An adjacency list representation of a graph creates a list of successors for each node u. These lists may be represented as linked lists (the typical assumption in algorithms textbooks), or in languages like C may be represented by variable-length arrays. The cost for adding an edge is still O(1), but testing for the existence of an edge (u,v) rises to O(d⁺(u)), where d⁺(u) is the out-degree of u (i.e., the length of the list of u's successors). The cost of enumerating the successors of u is also O(d⁺(u)), which is clearly the best possible since it takes that long just to write them all down. Finding predecessors of a node u is extremely expensive, requiring looking through every list of every node in time O(n+m), where m is the total number of edges.

Adjacency lists are thus most useful when we mostly want to enumerate outgoing edges of each node. This is common in search tasks, where we want to find a path from one node to another or compute the distances between pairs of nodes. If other operations are important, we can optimize them by augmenting the adjacency list representation; for example, using sorted arrays for the adjacency lists reduces the cost of edge existence testing to O(log(d⁺(u))), and adding a second copy of the graph with reversed edges lets us find all predecessors of u in O(d^-(u)) time, where d^-(u) is u's in-degree.

Adjacency lists also require much less space than adjacency matrices for sparse graphs: O(n+m) vs O(n²) for adjacency matrices. For this reason adjacency lists are more commonly used than adjacency matrices.

183.2.1. An implementation

Here is an implementation of a basic graph type using adjacency lists.

   1 /* basic directed graph type */
   2 
   3 typedef struct graph *Graph;
   4 
   5 /* create a new graph with n vertices labeled 0..n-1 and no edges */
   6 Graph graph_create(int n);
   7 
   8 /* free all space used by graph */
   9 void graph_destroy(Graph);
  10 
  11 /* add an edge to an existing graph */
  12 /* doing this more than once may have unpredictable results */
  13 void graph_add_edge(Graph, int source, int sink);
  14 
  15 /* return the number of vertices/edges in the graph */
  16 int graph_vertex_count(Graph);
  17 int graph_edge_count(Graph);
  18 
  19 /* return the out-degree of a vertex */
  20 int graph_out_degree(Graph, int source);
  21 
  22 /* return 1 if edge (source, sink) exists), 0 otherwise */
  23 int graph_has_edge(Graph, int source, int sink);
  24 
  25 /* invoke f on all edges (u,v) with source u */
  26 /* supplying data as final parameter to f */
  27 /* no particular order is guaranteed */
  28 void graph_foreach(Graph g, int source,
  29         void (*f)(Graph g, int source, int sink, void *data),
  30         void *data);

graph.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 
   4 #include "graph.h"
   5 
   6 /* basic directed graph type */
   7 /* the implementation uses adjacency lists
   8  * represented as variable-length arrays */
   9 
  10 /* these arrays may or may not be sorted: if one gets long enough
  11  * and you call graph_has_edge on its source, it will be */
  12 
  13 struct graph {
  14     int n;              /* number of vertices */
  15     int m;              /* number of edges */
  16     struct successors {
  17         int d;          /* number of successors */
  18         int len;        /* number of slots in array */
  19         char is_sorted; /* true if list is already sorted */
  20         int list[1];    /* actual list of successors */
  21     } *alist[1];
  22 };
  23 
  24 /* create a new graph with n vertices labeled 0..n-1 and no edges */
  25 Graph
  26 graph_create(int n)
  27 {
  28     Graph g;
  29     int i;
  30 
  31     g = malloc(sizeof(struct graph) + sizeof(struct successors *) * (n-1));
  32     assert(g);
  33 
  34     g->n = n;
  35     g->m = 0;
  36 
  37     for(i = 0; i < n; i++) {
  38         g->alist[i] = malloc(sizeof(struct successors));
  39         assert(g->alist[i]);
  40 
  41         g->alist[i]->d = 0;
  42         g->alist[i]->len = 1;
  43         g->alist[i]->is_sorted= 1;
  44     }
  45     
  46     return g;
  47 }
  48 
  49 /* free all space used by graph */
  50 void
  51 graph_destroy(Graph g)
  52 {
  53     int i;
  54 
  55     for(i = 0; i < g->n; i++) free(g->alist[i]);
  56     free(g);
  57 }
  58 
  59 /* add an edge to an existing graph */
  60 void
  61 graph_add_edge(Graph g, int u, int v)
  62 {
  63     assert(u >= 0);
  64     assert(u < g->n);
  65     assert(v >= 0);
  66     assert(v < g->n);
  67 
  68     /* do we need to grow the list? */
  69     while(g->alist[u]->d >= g->alist[u]->len) {
  70         g->alist[u]->len *= 2;
  71         g->alist[u] =
  72             realloc(g->alist[u], 
  73                 sizeof(struct successors) + sizeof(int) * (g->alist[u]->len - 1));
  74     }
  75 
  76     /* now add the new sink */
  77     g->alist[u]->list[g->alist[u]->d++] = v;
  78     g->alist[u]->is_sorted = 0;
  79 
  80     /* bump edge count */
  81     g->m++;
  82 }
  83 
  84 /* return the number of vertices in the graph */
  85 int
  86 graph_vertex_count(Graph g)
  87 {
  88     return g->n;
  89 }
  90 
  91 /* return the number of vertices in the graph */
  92 int
  93 graph_edge_count(Graph g)
  94 {
  95     return g->m;
  96 }
  97 
  98 /* return the out-degree of a vertex */
  99 int
 100 graph_out_degree(Graph g, int source)
 101 {
 102     assert(source >= 0);
 103     assert(source < g->n);
 104 
 105     return g->alist[source]->d;
 106 }
 107 
 108 /* when we are willing to call bsearch */
 109 #define BSEARCH_THRESHOLD (10)
 110 
 111 static int
 112 intcmp(const void *a, const void *b)
 113 {
 114     return *((const int *) a) - *((const int *) b);
 115 }
 116 
 117 /* return 1 if edge (source, sink) exists), 0 otherwise */
 118 int
 119 graph_has_edge(Graph g, int source, int sink)
 120 {
 121     int i;
 122 
 123     assert(source >= 0);
 124     assert(source < g->n);
 125     assert(sink >= 0);
 126     assert(sink < g->n);
 127 
 128     if(graph_out_degree(g, source) >= BSEARCH_THRESHOLD) {
 129         /* make sure it is sorted */
 130         if(! g->alist[source]->is_sorted) {
 131             qsort(g->alist[source]->list,
 132                     g->alist[source]->d,
 133                     sizeof(int),
 134                     intcmp);
 135         }
 136         
 137         /* call bsearch to do binary search for us */
 138         return 
 139             bsearch(&sink,
 140                     g->alist[source]->list,
 141                     g->alist[source]->d,
 142                     sizeof(int),
 143                     intcmp)
 144             != 0;
 145     } else {
 146         /* just do a simple linear search */
 147         /* we could call lfind for this, but why bother? */
 148         for(i = 0; i < g->alist[source]->d; i++) {
 149             if(g->alist[source]->list[i] == sink) return 1;
 150         }
 151         /* else */
 152         return 0;
 153     }
 154 }
 155 
 156 /* invoke f on all edges (u,v) with source u */
 157 /* supplying data as final parameter to f */
 158 void
 159 graph_foreach(Graph g, int source,
 160     void (*f)(Graph g, int source, int sink, void *data),
 161     void *data)
 162 {
 163     int i;
 164 
 165     assert(source >= 0);
 166     assert(source < g->n);
 167 
 168     for(i = 0; i < g->alist[source]->d; i++) {
 169         f(g, source, g->alist[source]->list[i], data);
 170     }
 171 }

graph.c

And here is some test code: test_graph.c.

183.3. Implicit representations

For some graphs, it may not make sense to represent them explicitly. An example might be the word-search graph from CS223/2005/Assignments/HW10, which consists of all words in a dictionary with an edge between any two words that differ only by one letter. In such a case, rather than building an explicit data structure containing all the edges, we might generate edges as needed when computing the neighbors of a particular vertex. This gives us an implicit or procedural representation of a graph.

Implicit representations require the ability to return a vector or list of values from a the neighborhood-computing function; some ways of doing this are described in C/Iterators.

184. Searching for paths in a graph

A path is a sequence of vertices v₁, v₂, ... v_k where each pair (v_i, v_i+1) is an edge. Often we want to find a path from a source vertex s to a target vertex t, or more generally to detect which vertices are reachable from a given source vertex s. We can solve these problems by using any of several standard graph search algorithms, of which the simplest and most commonly used are DepthFirstSearch and BreadthFirstSearch.

Both of these search algorithms are a special case of a more general algorithm for growing a directed tree in a graph rooted at a given node s. Here we are using tree as a graph theorist would, to mean a set of k nodes joined by k-1 edges; this is similar to trees used in data structures except that there are no limits on the number of children a node can have and no ordering constraints within the tree.

The general tree-growing algorithm might be described as follows:

Start with a tree consisting of just s.
If there is at least one edge that leaves the tree (i.e. goes from a node in the current tree to a node outside the current tree), pick the "best" such edge and add it and its sink to the tree.
Repeat step 2 until no edges leave the tree.

Practically, steps 2 and 3 are implemented by having some sort of data structure that acts as a bucket for unprocessed edges. When a new node is added to the tree, all of its outgoing edges are thrown into the bucket. The "best" outgoing edge is obtained by applying some sort of pop, dequeue, or delete-min operation to the bucket, depending on which it provides; if this edge turns out to be an internal edge of the tree (maybe we added its sink after putting it in the bucket), we throw it away. Otherwise we mark the edge and its sink as belonging to the tree and repeat.

The output of the generic tree-growing algorithm typically consists of (a) marks on all the nodes that are reachable from s, and (b) for each such node v, a parent pointer back to the source of the edge that brought v into the tree. Often these two values can be combined by using a null parent pointer to represent the absence of a mark (this usually requires making the root point to itself so that we know it's in the tree). Other values that may be useful are a table showing the order in which nodes were added to the tree. For even more possibilities see DepthFirstSearch.

What kind of tree we get depends on what we use for the bucket---specifically, on what edge is returned when we ask for the "best" edge. Two easy cases are:

The bucket is a stack. When we ask for an outgoing edge, we get the last edge inserted. This has the effect of running along as far as possible through the graph before backtracking, since we always keep going from the last node if possible. The resulting algorithm is called DepthFirstSearch and yields a DepthFirstSearch tree. If we don't care about the lengths of the paths we consider, DepthFirstSearch is a perfectly good algorithm for testing connectivity, and has several other useful properties (described on the algorithm's own page).
The bucket is a queue. Now when we ask for an outgoing edge, we get the first edge inserted. This favors edges that are close to the root: we don't start consider edges from nodes adjacent to the root until we have already added all the root's successors to the tree, and similarly we don't start considering edges at distance k until we have already added all the closer nodes to the tree. This gives BreadthFirstSearch, which constructs a shortest-path tree in which every path from the root to a node in the tree has the minimum length.

Structurally, these algorithms are almost completely identical; indeed, if we organize the stack/queue so that it can pop from both ends, we can switch between DepthFirstSearch and BreadthFirstSearch just by choosing one operation or another. This is what is done in the implementation below. Since it's ugly to have a flag parameter to a function that radically changes its behavior, the combined search function is wrapped inside two separate functions dfs and bfs that are exported to the outside of the module.

The running time of either algorithm is very fast: we pay O(1) per vertex in setup costs and O(1) per edge during the search (assuming the input is in adjacency-list form), giving a linear O(n+m) total cost. Often it is more expensive to set up the graph in the first place than to run a search on it.

184.1. Depth-first and breadth-first search

   1 /* Typical usage:
   2  *
   3  *    struct search_info *s;
   4  *    int n;
   5  *
   6  *    s = search_info_create(g);
   7  *
   8  *    n = graph_vertices(g);
   9  *    for(i = 0; i < n; i++) {
  10  *        dfs(s, i);
  11  *    }
  12  *
  13  *    ... use results in s ...
  14  *
  15  *    search_info_destroy(s);
  16  *
  17  */
  18 
  19 /* summary information per node for dfs and bfs */
  20 /* this is not intended to be opaque---user can read it */
  21 /* (but should not write it!) */
  22 
  23 #define SEARCH_INFO_NULL (-1) /* for empty slots */
  24 
  25 struct search_info {
  26     Graph graph;
  27     int reached;        /* count of reached nodes */
  28     int *preorder;      /* list of nodes in order first reached */
  29     int *time;          /* time[i] == position of node i in preorder list */
  30     int *parent;        /* parent in DFS or BFS forest */
  31     int *depth;         /* distance from root */
  32 };
  33 
  34 /* allocate and initialize search results structure */
  35 /* you need to do this before passing it to dfs or bfs */
  36 struct search_info *search_info_create(Graph g);
  37 
  38 /* free search_info data---does NOT free graph pointer */
  39 void search_info_destroy(struct search_info *);
  40 
  41 /* perform depth-first search starting at root, updating results */
  42 void dfs(struct search_info *results, int root);
  43 
  44 /* perform breadth-first search starting at root, updating results */
  45 void bfs(struct search_info *results, int root);

search.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 
   4 #include "graph.h"
   5 #include "search.h"
   6 
   7 /* create an array of n ints initialized to SEARCH_INFO_NULL */
   8 static int *
   9 create_empty_array(int n)
  10 {
  11     int *a;
  12     int i;
  13 
  14     a = malloc(sizeof(*a) * n);
  15     assert(a);
  16 
  17     for(i = 0; i < n; i++) {
  18         a[i] = SEARCH_INFO_NULL;
  19     }
  20 
  21     return a;
  22 }
  23 
  24 /* allocate and initialize search results structure */
  25 /* you need to do this before passing it to dfs or bfs */
  26 struct search_info *
  27 search_info_create(Graph g)
  28 {
  29     struct search_info *s;
  30     int n;
  31 
  32     s = malloc(sizeof(*s));
  33     assert(s);
  34 
  35     s->graph = g;
  36     s->reached = 0;
  37 
  38     n = graph_vertex_count(g);
  39 
  40     s->preorder = create_empty_array(n);
  41     s->time = create_empty_array(n);
  42     s->parent = create_empty_array(n);
  43     s->depth = create_empty_array(n);
  44 
  45     return s;
  46 } 
  47 
  48 /* free search_info data---does NOT free graph pointer */
  49 void
  50 search_info_destroy(struct search_info *s)
  51 {
  52     free(s->depth);
  53     free(s->parent);
  54     free(s->time);
  55     free(s->preorder);
  56     free(s);
  57 }
  58 
  59 /* used inside search routines */
  60 struct edge {
  61     int u;          /* source */
  62     int v;          /* sink */
  63 };
  64 
  65 /* stack/queue */
  66 struct queue {
  67     struct edge *e;
  68     int bottom;
  69     int top;
  70 };
  71 
  72 static void
  73 push_edge(Graph g, int u, int v, void *data)
  74 {
  75     struct queue *q;
  76 
  77     q = data;
  78 
  79     assert(q->top < graph_edge_count(g) + 1);
  80 
  81     q->e[q->top].u = u;
  82     q->e[q->top].v = v;
  83     q->top++;
  84 }
  85 
  86 /* this rather horrible function implements dfs if use_queue == 0 */
  87 /* and bfs if use_queue == 1 */
  88 static void
  89 generic_search(struct search_info *r, int root, int use_queue)
  90 {
  91     /* queue/stack */
  92     struct queue q;
  93 
  94     /* edge we are working on */
  95     struct edge cur;
  96 
  97     /* start with empty q */
  98     /* we need one space per edge */
  99     /* plus one for the fake (root, root) edge */
 100     q.e = malloc(sizeof(*q.e) * (graph_edge_count(r->graph) + 1));
 101     assert(q.e);
 102 
 103     q.bottom = q.top = 0;
 104 
 105     /* push the root */
 106     push_edge(r->graph, root, root, &q);
 107 
 108     /* while q.e not empty */
 109     while(q.bottom < q.top) {
 110         if(use_queue) {
 111             cur = q.e[q.bottom++];
 112         } else {
 113             cur = q.e[--q.top];
 114         }
 115 
 116         /* did we visit sink already? */
 117         if(r->parent[cur.v] != SEARCH_INFO_NULL) continue;
 118 
 119         /* no */
 120         assert(r->reached < graph_vertex_count(r->graph));
 121         r->parent[cur.v] = cur.u;
 122         r->time[cur.v] = r->reached;
 123         r->preorder[r->reached++] = cur.v;
 124         if(cur.u == cur.v) {
 125             /* we could avoid this if we were certain SEARCH_INFO_NULL */
 126             /* would never be anything but -1 */
 127             r->depth[cur.v] = 0;
 128         } else {
 129             r->depth[cur.v] = r->depth[cur.u] + 1;
 130         }
 131 
 132         /* push all outgoing edges */
 133         graph_foreach(r->graph, cur.v, push_edge, &q);
 134     }
 135 
 136     free(q.e);
 137 }
 138 
 139 void
 140 dfs(struct search_info *results, int root)
 141 {
 142     generic_search(results, root, 0);
 143 }
 144 
 145 void
 146 bfs(struct search_info *results, int root)
 147 {
 148     generic_search(results, root, 1);
 149 }

search.c

And here is some test code: test_search.c. You will need to compile test_search.c together with both search.c and graph.c to get it to work.

184.2. Other variations on the basic algorithm

Stacks and queues are not the only options for the bucket in the generic search algorithm. Some other choices are:

A priority queue keyed by edge weights. If the edges have weights, the generic tree-builder can be used to find a tree containing s with minimum total edge weight.¹³ The basic idea is to always pull out the lightest edge. The resulting algorithm runs in O(n + m log m) time (since each heap operation takes O(log m) time), and is known as Prim's algorithm. See Prim's algorithm for more details.
A priority queue keyed by path lengths. Here we assume that edges have lengths, and we want to build a shortest-path tree where the length of the path is no longer just the number of edges it contains but the sum of their weights. The basic idea is to keep track of the distance from the root to each node in the tree, and assign each edge a key equal to the sum of the distance to its source and its length. The resulting search algorithm, known as Dijkstra's algorithm, will give a shortest-path tree if all the edge weights are non-negative. See ShortestPath or Dijkstra's algorithm.

CategoryProgrammingNotes

185. ShortestPath

The shortest path problem is to find a path in a graph with given edge weights that has the minimum total weight. Typically the graph is directed, so that the weight w_uv of an edge uv may differ from the weight w_vu of vu; in the case of an undirected graph, we can always turn it into a directed graph by replacing each undirected edge with two directed edges with the same weight that go in opposite directions. We will use the terms weight and length interchangeably, and use distance for the minimum total path weight between two nodes, even when the weights don't make sense as lengths (for example, when some are negative).

There are two main variants to the problem:

The single-source shortest path problem is to compute the distance from some source node s to every other node in the graph. This variant includes the case where what we really want is just the distance from s to some target node t.
The all-pairs shortest path problem is to compute the distance between every pair of nodes in the graph. This can be solved by running a single-source algorithm once for each starting vertex, but it can be solved more efficiently by combining the work for different starting vertices.

There are also two different assumptions about the edge weights that can radically change the nature of the problem:

All edge weights are non-negative. This is the natural case where edge weights represent distances, and allows a fast greedy solution for the single-source case.
Edge weights are arbitrary. This case typically arises when the edge weights represent the net cost of traversing some edge, which may be negative for profitable edges. Now greedy solutions fail; even though it may be very expensive to get to some distant node u, if there is a good enough edge leaving u you can make up all the costs by taking it. Shortest paths with negative edge weights are typically found by algorithms using techniques related to DynamicProgramming.

186. Single-source shortest paths

In the single-source shortest path problem, we want to compute the distance δ(s,t) from a single source node s to every target node t. (As a side effect, we might like to find the actual shortest path, but usually this can be done easily while we are computing the distances.) There are many algorithms for solving this problem, but most are based on the same technique, known as relaxation.

186.1. Relaxation

In general, a relaxation of an optimization problem is a new problem that replaces equality constraints in the original problem, like

δ(s,t) = min_u δ(s,u) + w_ut

with an inequality constraint, like

d(s,t) ≥ min_u d(s,u) + w_ut.

When we do this kind of replacement, we are also replacing the exact distances δ(s,t) with upper bounds d(s,t), and guaranteeing that d(s,t) is always greater than the correct answer δ(s,t).

The reason for relaxing a problem is that we can start off with very high upper bounds and lower them incrementally until they settle on the correct answer. For shortest paths this is done by setting d(s,t) initially to zero when t=s and +∞ when t≠s (this choice doesn't require looking at the graph). We then proceed to lower the d(s,t) bounds by a sequence of edge relaxations (a different use of the same term), where relaxing an edge uv sets the upper bound on the distance to v to the minimum of the old upper bound and the upper bound that we get by looking at a path through u, i.e.

d'(s,v) := min(d(s,v), d(s,u) + w_uv).

It is easy to see that if d(s,v) ≥ δ(s,v) and d(s,u) ≥ δ(s,u), then it will also be the case that d'(s,v) ≥ δ(s,v).

What is less obvious is that performing an edge relaxation on every edge in some shortest s-t path in order starting from the initial state of the d array will set d(s,t) to exactly δ(s,t), even if other relaxation operations occur in between. The proof is by induction on the number of edges in the path. With zero edges, d(s,t) = δ(s,t) = 0. With k+1 edges, the induction hypothesis says that d(s,u) = δ(s,u) after the first k relaxations, where u is the second-to-last vertex in the path. But then the last relaxation sets d(s,t) ≤ δ(s,u) + w_ut, which is the length of the shortest path and thus equals δ(s,t).

We mentioned earlier that it is possible to compute the actual shortest paths as a side-effect of computing the distances. This is done using relaxation by keeping track of a previous-vertex pointer p[v] for each vertex, so that the shortest path is found by following all the previous-vertex pointers and reversing the sequence. Initially, p[v] = NULL for all v; when relaxing an edge uv, it is set to u just in case d(s,u) + w_uv is less than the previously computed distance d(s,v). So in addition to getting the correct distances by relaxing all edges on the shortest path in order, we also find the shortest path.

This raises the issue of how to relax all the edges on the shortest path in order if we don't know what the shortest path is. There are two ways to do this, depending on whether the graph contains negative-weight edges.

186.2. Dijkstra's algorithm

If the graph contains no negative-weight edges, we can apply the GreedyMethod, relaxing at each step all the outgoing edges from the apparently closest vertex v that hasn't been processed yet; if this is in fact the closest vertex, we process all vertices in order of increasing distance and thus relax the edges of each shortest path in order. This method gives Dijkstra's algorithm for single-source shortest paths, one of the best and simplest algorithms for the problem. It requires a priority queue Q that provides an EXTRACT-MIN operation that deletes and returns the element v with smallest key, in this case the upper bound d(s,v) on the distance.

Dijkstra(G,w,s):
  Set d[s] = 0 and set d[v] = +infinity for all v != s.
  Add all the vertices to Q.
  while Q is not empty:
    u = EXTRACT-MIN(Q)
    for each each uv:
      d[v] = min(d[v], d[u] + w(u,v))
  return d

The running time of Dijkstra's algorithm depends on the implementation of Q. The simplest implementation is just to keep around an array of all unprocessed vertices, and to carry out EXTRACT-MIN by performing a linear-time scan for one with the smallest d[u]. This gives a cost to EXTRACT-MIN of O(V), which is paid V times (once per vertex), for a total of O(V²) time. The additional overhead of the algorithm takes O(V) time, except for the loop over outgoing edges from u, all of whose iterations together take O(E) time. So the total cost is O(V² + E) = O(V²). This can be improved for sparse graphs to O((V+E) log V) using a heap to represent Q (the extra log V on the E comes from the cost of moving elements within the heap when their distances drop), and it can be improved further to O(V log V + E) time using a FibonacciHeap.

Why does Dijkstra's algorithm work? Assuming there are no negative edge weights, there is a simple proof that d[u] = δ(s,u) for any v that leaves the priority queue. The proof is by induction on the number of vertices that have left the queue, and requires a rather complicated induction hypothesis, which is that after each pass through the outer loop:

If u is any vertex not in the queue, then d[u] = δ(s,u).
If u is any vertex not in the queue, and v is any vertex in the queue, then δ(s,u) ≤ δ(s,v)
If u is any vertex not in the queue, and v is any vertex in the queue, then d[v] ≤ δ(s,u) + w_uv (where w_uv is taken to be +∞ if uv is not an edge in the graph).

This invariant looks ugly but what it says is actually pretty simple: after i steps, we have extracted the i closest vertices and correctly computed there distance, and for any other vertex v, d[v] is at most the length of the shortest path that consists only of non-queue vertices except for v. If the first two parts of the invariant hold, the third is immediate from the relaxation step in the algorithm. So we concentrate on proving the first two parts.

The base case is obtained by consider the state where s is the only vertex not in the queue; we easily have d[s] = 0 = δ(s,s) ≤ δ(s,v) for any vertex v in the queue.

For later u, we'll assume that the invariant holds at the beginning of the loop, and show that it also holds at the end. For each v in the queue, d[v] is at most the length of the shortest s-v path that uses only vertices already processed. We'll show that the smallest d[v] is in fact equal to δ(s,v) and is no greater than δ(s,v') for any v' in the queue. Consider any vertex v that hasn't been processed yet. Let t be the last vertex before v on some shortest s-v path that uses only previously processed vertices. From the invariant we have that d[v] ≤ δ(s,t) + w_tv. Now consider two cases:

The s-t-v path is a shortest path. Then d[v] = δ(s,v)
The s-t-v path is not a shortest path. Then d[v] > δ(s,v) and there is some shorter s-v path whose last vertex before v is t'. But this shorter s-t'-v path can only exist if t' is still in the queue. If we let q be the first vertex in some shortest s-t'-v path that is still in the queue, the s-q part of the path is a shortest path that uses only non-queue vertices. So d[q] = δ(s,q) ≤ δ(s,v) < d[v].

Let u be returned by EXTRACT-MIN. If case 1 applies to u, part 1 of the invariant follows immediately. Case 2 cannot occur because in this case there would be some q with d[q] < d[u] and EXTRACT-MIN would have returned q instead. We have thus proved that part 1 continues to hold.

For part 2, consider the two cases for some v≠u. In case 1, δ(s,v) = d[v] ≥ d[u] = δ(s,u). In case 2, δ(s,v) ≥ δ(s,q) = d[q] ≥ d[u] = δ(s,u). Thus in either case δ(s,v) ≥ δ(s,u) and part 2 of the invariant holds.

Part 3 of the invariant is immediate from the code.

To complete the proof of correctness, observe that the first part of the induction hypothesis implies that all distances are correct when the queue is empty.

186.3. Bellman-Ford

What if the graph contains negative edges? Then Dijkstra's algorithm may fail in the usual pattern of misled GreedyAlgorithms: the very last vertex v processed may have spectacularly negative edges leading to other vertices that would greatly reduce their distances, but they have already been processed and it's too late to take advantage of this amazing fact (more specifically, it's not too late for the immediate successors of v, but it's too late for any other vertex reachable from such a successor that is not itself a successor of v).

But we can still find shortest paths using the technique of relaxing every edge on a shortest path in sequence. The Bellman-Ford algorithm does so under the assumption that there are no negative-weight cycles in the graph, in which case all shortest paths are simple---they contain no duplicate vertices---and thus have at most V-1 edges in them. If we relax every edge, we are guaranteed to get the first edge of every shortest path; relaxing every edge again gets the second edge; and repeating this operation V-1 gets all edges in order.

BellmanFord(G,s,w):
  Set d[s] = 0 and set d[v] = +infinity for all v != s.
  for i = 1 to V-1
    for each each edge uv in G:
      d[v] = min(d[v], d[u] + w(u,v))
  return d

The running time of Bellman-Ford is O(VE), which is generally slower than even the simple O(V²) implementation of Dijkstra's algorithm; but it handles any edge weights, even negative ones.

What if a negative cycle exists? In this case, there may be no shortest paths; any short path that reaches a vertex on the cycle can be shortened further by taking a few extra loops around it. The Bellman-Ford algorithm can be used to detect such cycles by running the outer loop one more time---if d[v] drops for any v, then a negative cycle reachable from s exists. The converse is also true; intuitively, this is because further reductions in distance can only propagate around the negative cycle if there is some edge that can be relaxed further in each state. CormenEtAl Section 24.1 contains a real proof.

187. All-pairs shortest paths

There is a very simple DynamicProgramming algorithm known as Floyd-Warshall that computes the distance between all V² pairs of vertices in Θ(V³) time. This is no faster than running Dijkstra's algorithm V times, but it works even if some of the edge weights are negative.

Like any other dynamic programming algorithm, Floyd-Warshall starts with a recursive decomposition of the shortest-path problem. The basic idea is to cut the path in half, by expanding d(i,j) as min_k (d(i,k) + d(k,j)), but this requires considering n-2 intermediate vertices k and doesn't produce a smaller problem. There are a couple of ways to make the d(i,k) on the right-hand side "smaller" than the d(i,j) on the left-hand side---for example, we could add a third parameter that is the length of the path and insist that the subpaths on the right-hand side be half the length of the path on the left-hand side---but most of these approaches still require looking at Θ(n) intermediate vertices. The trick used by Floyd-Warshall is to make the third parameter be the largest vertex that can be used in the path. This allows us to consider only one new intermediate vertex each time we increase this limit.

Define d(i,j,k) as the length of the shortest i-j path that uses only vertices with indices less than or equal to k. Then

d(i,j,0) = w_ij, d(i,j,k) = min(d(i,j,k-1), d(i,k,k-1) + d(k,j,k-1)).

The reason this decomposition works (for any graph that does not contain a negative-weight cycle) is that every shortest i-j path with no vertex greater than k either includes k exactly once (the second case) or not at all. The nice thing about this decomposition is that we only have to consider two values in the minimum, so we can evaluate d(i,j,k) in O(1) time if we already have d(i,k,k-1) and d(k,j,k-1) in our table. The natural way to guarantee this is to build the table in order of increasing k. We assume that the input is given as an array of edge weights with +∞ for missing edges; the algorithm's speed is not improved by using an adjacency-list representation of the graph.

FloydWarshall(w):
  // initialize first plane of table
  for i = 1 to V do
    for j = 1 to V do
      d[i,j,0] = w[i,j]
  // fill in the rest
  for k = 1 to V do
    for i = 1 to V do
      for j = 1 to V do
        d[i,j,k] = min(d[i,j,k-1], d[i,k,k-1] + d[k,j,k-1])
  // pull out the distances where all vertices on the path are <= n
  // (i.e. with no restrictions)
  return d' where d'[i,j] = d[i,j,k]

The running time of this algorithm is easily seen to be Θ(V³). As with Bellman-Ford, its output is guaranteed to be correct only if the graph does not contain a negative cycle; if the graph does contain a negative cycle, it can be detected by looking for vertices with d'[i,i] < 0.

188. Implementations

Below are C implementations of Bellman-Ford, Floyd-Warshall, and Dijkstra's algorithm (in a separate file). The Dijkstra's algorithm implementation uses the generic priority queue from the CS223/2005/Assignments/HW08 sample solutions. Both files use an extended version of the Graph structure from C/Graphs that supports weights.

Here are the support files:

graph.h graph.c pq.h pq.c

Here is some test code and a Makefile:

test_shortest_path.c Makefile

And here are the actual implementations:

   1 /* various algorithms for shortest paths */
   2 
   3 #define SHORTEST_PATH_NULL_PARENT (-1)
   4 
   5 /* Computes distance of each node from starting node */
   6 /* and stores results in dist (length n, allocated by the caller) */
   7 /* unreachable nodes get distance MAXINT */
   8 /* If parent argument is non-null, also stores parent pointers in parent */
   9 /* Allows negative-weight edges and runs in O(nm) time. */
  10 /* returns 1 if there is a negative cycle, 0 otherwise */
  11 int bellman_ford(Graph g, int source, int *dist, int *parent);
  12 
  13 /* computes all-pairs shortest paths using Floyd-Warshall given */
  14 /* an adjacency matrix */
  15 /* answer is returned in the provided matrix! */
  16 /* assumes matrix is n pointers to rows of n ints each */
  17 void floyd_warshall(int n, int **matrix);

shortest_path.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include <values.h>
   4 
   5 #include "graph.h"
   6 #include "shortest_path.h"
   7 
   8 /* data field for relax helper */
   9 struct relax_data {
  10     int improved;
  11     int *dist;
  12     int *parent;
  13 };
  14 
  15 static void
  16 relax(Graph g, int source, int sink, int weight, void *data)
  17 {
  18     int len;
  19     struct relax_data *d;
  20 
  21     d = data;
  22 
  23     if(d->dist[source] < MAXINT && weight < MAXINT) {
  24         len = d->dist[source] + weight;
  25 
  26         if(len < d->dist[sink]) {
  27             d->dist[sink] = len;
  28             if(d->parent) d->parent[sink] = source;
  29             d->improved = 1;
  30         }
  31     }
  32 }
  33 
  34 /* returns 1 if there is a negative cycle */
  35 int
  36 bellman_ford(Graph g, int source, int *dist, int *parent)
  37 {
  38     int round;
  39     int n;
  40     int i;
  41     struct relax_data d;
  42 
  43     assert(dist);
  44 
  45     d.dist = dist;
  46     d.parent = parent;
  47     d.improved = 1;
  48 
  49     n = graph_vertex_count(g);
  50 
  51     for(i = 0; i < n; i++) {
  52         d.dist[i] = MAXINT;
  53         if(d.parent) d.parent[i] = SHORTEST_PATH_NULL_PARENT;
  54     }
  55 
  56     d.dist[source] = 0;
  57     d.parent[source] = source;
  58 
  59     for(round = 0; d.improved && round < n; round++) {
  60         d.improved = 0;
  61 
  62         /* relax all edges */
  63         for(i = 0; i < n; i++) {
  64             graph_foreach_weighted(g, i, relax, &d);
  65         }
  66     }
  67 
  68     return d.improved;
  69 }
  70 
  71 void
  72 floyd_warshall(int n, int **d)
  73 {
  74     int i;
  75     int j;
  76     int k;
  77     int newdist;
  78 
  79     /* The algorithm:
  80      *
  81      * d(i, j, k) = min distance from i to j with all intermediates <= k
  82      *
  83      * d(i, j, k) = min(d(i, j, k-1), d(i, k, k-1) + d(k, j, k-1)
  84      *
  85      * We will allow shorter paths to sneak in to d(i, j, k) so that
  86      * we don't have to store anything extra.
  87      */
  88 
  89     /* initial matrix is essentially d(:,:,-1) */
  90     /* within body of outermost loop we compute d(:,:,k) */
  91     for(k = 0; k < n; k++) {
  92         for(i = 0; i < n; i++) {
  93             /* skip if we can't get to k */
  94             if(d[i][k] == MAXINT) continue;
  95             for(j = 0; j < n; j++) {
  96                 /* skip if we can't get from k */
  97                 if(d[k][j] == MAXINT) continue;
  98 
  99                 /* else */
 100                 newdist = d[i][k] + d[k][j];
 101                 if(newdist < d[i][j]) {
 102                     d[i][j] = newdist;
 103                 }
 104             }
 105         }
 106     }
 107 }

shortest_path.c

   1 #define DIJKSTRA_NULL_PARENT (-1)
   2 
   3 /* Computes distance of each node from starting node */
   4 /* and stores results in dist (length n, allocated by the caller) */
   5 /* unreachable nodes get distance MAXINT */
   6 /* If parent argument is non-null, also stores parent pointers in parent */
   7 /* Assumes no negative-weight edges */
   8 /* Runs in O(n + m log m) time. */
   9 /* Note: uses pq structure from pq.c */
  10 void dijkstra(Graph g, int source, int *dist, int *parent);

dijkstra.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include <values.h>
   4 
   5 #include "graph.h"
   6 #include "pq.h"
   7 #include "dijkstra.h"
   8 
   9 /* internal edge representation for dijkstra */
  10 struct pq_elt {
  11     int d;      /* distance to v */
  12     int u;      /* source */
  13     int v;      /* sink */
  14 };
  15 
  16 static int
  17 pq_elt_cmp(const void *a, const void *b)
  18 {
  19     return ((const struct pq_elt *) a)->d - ((const struct pq_elt *) b)->d;
  20 }
  21 
  22 struct push_data {
  23     PQ pq;
  24     int *dist;
  25 };
  26 
  27 static void push(Graph g, int u, int v, int wt, void *data)
  28 {
  29     struct push_data *d;
  30     struct pq_elt e;
  31 
  32     d = data;
  33 
  34     e.d = d->dist[u] + wt;
  35     e.u = u;
  36     e.v = v;
  37 
  38     pq_insert(d->pq, &e);
  39 }
  40 
  41 void
  42 dijkstra(Graph g, int source, int *dist, int *parent)
  43 {
  44     struct push_data data;
  45     struct pq_elt e;
  46     int n;
  47     int i;
  48 
  49     assert(dist);
  50 
  51     data.dist = dist;
  52     data.pq = pq_create(sizeof(struct pq_elt), pq_elt_cmp);
  53     assert(data.pq);
  54 
  55     n = graph_vertex_count(g);
  56 
  57     /* set up dist and parent arrays */
  58     for(i = 0; i < n; i++) {
  59         dist[i] = MAXINT;
  60     }
  61         
  62     if(parent) {
  63         for(i = 0; i < n; i++) {
  64             parent[i] = DIJKSTRA_NULL_PARENT;
  65         }
  66     }
  67 
  68     /* push (source, source, 0) */
  69     /* this will get things started with parent[source] == source */
  70     /* and dist[source] == 0 */
  71     push(g, source, source, -MAXINT, &data);
  72 
  73     while(!pq_is_empty(data.pq)) {
  74         /* pull the min value out */
  75         pq_delete_min(data.pq, &e);
  76 
  77         /* did we reach the sink already? */
  78         if(dist[e.v] == MAXINT) {
  79             /* no, it's a new edge */
  80             dist[e.v] = e.d;
  81             if(parent) parent[e.v] = e.u;
  82 
  83             /* throw in the outgoing edges */
  84             graph_foreach_weighted(g, e.v, push, &data);
  85         }
  86     }
  87 
  88     pq_destroy(data.pq);
  89 }

dijkstra.c

CategoryAlgorithmNotes CategoryProgrammingNotes

189. SuffixArrays

These are notes on practical implementations of suffix arrays, which are a data structure for searching quickly for substrings of a given large string. Some of these notes are adapted from the StringAlgorithms page from CS365.

190. Why do we want to do this?

Answer from the old days: Fast string searching is the key to dealing with mountains of information. Why, a modern (c. 1960) electronic computer can search the equivalent of hundreds of pages of text in just a few hours...
More recent answer:
- We still need to search enormous corpuses of text (see http://www.google.com).
- Algorithms for finding long repeated substrings or patterns can be useful for data compression (see Data_compression) or detecting plagiarism.
- Finding all occurrence of a particular substring in some huge corpus is the basis of statistical machine translation.
- We are made out of strings over a particular finite alphabet GATC. String search is a central tool in computational biology.

191. String search algorithms

Without preprocessing, searching an n-character string for an m-character substring can be done using various sophisticated algorithms, the worst of which run in time O(nm) (run strncmp on each position in the big string), and best of which run in time O(n+m) Boyer-Moore string search algorithm. But we are interested in the case where we can preprocess our big string into a data structure that will let us do lots of searches for cheap.

192. Suffix trees and suffix arrays

Suffix trees and suffix arrays are data structures for representing texts that allow substring queries like "where does this pattern appear in the text" or "how many times does this pattern occur in the text" to be answered quickly. Both work by storing all suffixes of a text, where a suffix is a substring that runs to the end of the text. Of course, storing actual copies of all suffixes of an n-character text would take O(n²) space, so instead each suffix is represented by a pointer to its first character.

A suffix array stores all the suffixes sorted in dictionary order. For example, the suffix array of the string abracadabra is shown below. The actual contents of the array are the indices in the left-hand column; the right-hand shows the corresponding suffixes.

11  \0
10  a\0
 7  abra\0
 0  abracadabra\0
 3  acadabra\0
 5  adabra\0
 8  bra\0
 1  bracadabra\0
 4  cadabra\0
 6  dabra\0
 9  ra\0
 2  racadabra\0

A suffix tree is similar, but instead using an array, we use some sort of tree data structure to hold the sorted list. A common choice given an alphabet of some fixed size k is a trie (see RadixSearch), in which each node at depth d represents a string of length d, and its up to k children represent all (d+1)-character extensions of the string. The advantage of using a suffix trie is that searching for a string of length m takes O(m) time, since we can just walk down the trie at the rate of one node per character in m. A further optimization is to replace any long chain of single-child nodes with a compressed edge labeled with the concatenation all the characters in the chain. Such compressed suffix tries can not only be searched in linear time but can also be constructed in linear time with a sufficiently clever algorithm. Of course, we could also use a simple balanced binary tree, which might be preferable if the alphabet is large.

The disadvantage of suffix trees over suffix arrays is that they generally require more space to store all the internal pointers in the tree. If we are indexing a huge text (or collection of texts), this extra space may be too expensive.

192.1. Building a suffix array

A straightforward approach to building a suffix array is to run any decent comparison-based sorting algorithm on the set of suffixes (represented by pointers into the text). This will take O(n log n) comparisons, but in the worst case each comparison will take O(n) time, for a total of O(n² log n) time. This is the approach used in the sample code below.

The original suffix array paper by Manber and Myers ("Suffix arrays: a new method for on-line string searches," SIAM Journal on Computing 22(5):935-948, 1993) gives an O(n log n) algorithm, somewhat resembling radix sort, for building suffix arrays in place with only a small amount of additional space. They also note that for random text, simple radix sorting works well, since most suffixes become distinguishable after about log_k n characters (where k is the size of the alphabet). Assuming random data would also give an O(n log² n) running time for a comparison-based sort.

The fastest approach is to build a suffix tree in O(n) time and extract the suffix array by traversing the tree. The only complication is that we need the extra space to build the tree, although we get it back when we throw the tree away.

192.2. Searching a suffix array

Suppose we have a suffix array corresponding to an n-character text and we want to find all occurrences in the text of an m-character pattern. Since the suffixes are ordered, the easiest solution is to do binary search for the first and last occurrences of the pattern (if any) using O(log n) comparisons. (The code below does something even lazier than this, searching for some match and then scanning linearly for the first and last maches.) Unfortunately, each comparison may take as much as O(m) time, since we may have to check all m characters of the pattern. So the total cost will be O(m log n) in the worst case.

By storing additional information about the longest common prefix of regisions of contiguous suffixes, it is possible to avoid having to re-examine every character in the pattern for every comparison, reducing the search cost to O(m + log n). With a sufficiently clever algorithm, this information can be computed in linear time, and can also be used to solve quickly such problems as finding the longest duplicate substrings, or most frequently occurring strings (GusfieldBook §7.14.4).

Using binary search on the suffix array, most searching tasks are now easy:

Finding if a subtring appears in the array uses binary search directly.
Finding all occurrences requires two binary searches, one for the first occurrence and one for the last. If we only want to count the occurrences and not return their positions, this takes O(m + log n) time. If we want to return their positions, it takes O(m + log n + k) time, where k is the number of times the pattern occurs.
Finding duplicate substrings of length m or more can be done by looking for adjacent entries in the array with long common prefixes, which takes O(mn) time in the worst case if done naively (and O(n) time if we have already computed longest common prefix information; see GusfieldBook).

193. Burrows-Wheeler transform

Closely related to suffix arrays is the Burrows-Wheeler transform (Burrows and Wheeler, A Block-Sorting Lossless Data Compression Algorithm, DEC Systems Research Center Technical Report number 124, 1994), which is the basis for some of the best currently known algorithms for text compression (it's the technique that is used, for example, in bzip2).

The idea of the Burrows-Wheeler Transform is to construct an array whose rows are all cyclic shifts of the input string in dictionary order, and return the last column of the array. The last column will tend to have long runs of identical characters, since whenever some substring (like the appears repeatedly in the input, shifts that put the first character t in the last column will put the rest of the substring he in the first columns, and the resulting rows will tend to be sorted together. The relative regularity of the last column means that it will compress well with even very simple compression algorithms like run-length encoding.

Below is an example of the Burrows-Wheeler transform in action, with $ marking end-of-text. The transformed value of abracadabra$ is $drcraaaabba, the last column of the sorted array; note the long run of a's (and the shorter run of b's).

abracadabra$     abracadabra$
bracadabra$a     abra$abracad
racadabra$ab     acadabra$abr
acadabra$abr     adabra$abrac
cadabra$abra     a$abracadabr
adabra$abrac     bracadabra$a
dabra$abraca --> bra$abracada
abra$abracad     cadabra$abra
bra$abracada     dabra$abraca
ra$abracadab     racadabra$ab
a$abracadabr     ra$abracadab
$abracadabra     $abracadabra

The most useful property of the Burrows-Wheeler transform is that it can be inverted; this distinguishes it from other transforms that produce long runs like simply sorting the characters. We'll describe two ways to do this; the first is less efficient, but more easily grasped, and involves rebuilding the array one column at a time, starting at the left. Observe that the leftmost column is just all the characters in the string in sorted order; we can recover it by sorting the rightmost column, which we have to start off with. If we paste the rightmost and leftmost columns together, we have the list of all 2-character substrings of the original text; sorting this list gives the first two columns of the array. (Remember that each copy of the string wraps around from the right to the left.) We can then paste the rightmost column at the beginning of these two columns, sort the result, and get the first three columns. Repeating this process eventually reconstructs the entire array, from which we can read off the original string from any row. The initial stages of this process for abracadabra$ are shown below:

$    a       $a    ab       $ab    abr
d    a       da    ab       dab    abr
r    a       ra    ac       rac    aca
c    a       ca    ad       cad    ada
r    a       ra    a$       ra$    a$a
a    b       ab    br       abr    bra
a -> b       ab -> br       abr -> bra
a    c       ac    ca       aca    cad
a    d       ad    da       ada    dab
b    r       br    ra       bra    rac
b    r       br    ra       bra    ra$
a    $       a$    $a       a$a    $ab

Rebuilding the entire array in this fashion takes O(n²) time and O(n²) space. In their paper, Burrows and Wheeler showed that one can in fact reconstruct the original string from just the first and last columns in the array in O(n) time.

Here's the idea: Suppose that all the characters were distinct. Then after reconstructing the first column we would know all pairs of adjacent characters. So we could just start with the last character $ and regenerate the string by appending at each step the unique successor to the last character so far. If all characters were distinct, we would never get confused about which character comes next.

The problem is what to do with pairs with duplicate first characters, like ab and ac in the example above. We can imagine that each a in the last column is labeled in some unique way, so that we can talk about the first a or the third a, but how do we know which a is the one that comes before b or d?

The trick is to look closely at how the original sort works. Look at the rows in the original transformation. If we look at all rows that start with a, the order they are sorted in is determined by the suffix after a. These suffixes also appear as the prefixes of the rows that end with a, since the rows that end with a are just the rows that start with a rotated one position. It follows that all instances of the same letter occur in the same order in the first and last columns. So if we use a stable sort to construct the first column, we will correctly match up instances of letters.

This method is shown in action below. Each letter is annotated uniquely with a count of how many identical letters equal or precede it. Sorting recovers the first column, and combining the last and first columns gives a list of unique pairs of adjacent annotated characters. Now start with $1 and construct the full sequence $1 a1 b1 r1 a3 c1 a4 d1 a2 b2 r2 a5 $1. The original string is obtained by removing the end-of-string markers and annotations: abracadabra.

$1     a1
d1     a2
r1     a3
c1     a4
r2     a5
a1     b1
a2 --> b2
a3     c1
a4     d1
b1     r1
b2     r2
a5     $1

Because we are only sorting single characters, we can perform the sort in linear time using counting sort. Extracting the original string also takes linear time if implemented reasonably.

193.1. Suffix arrays and the Burrows-Wheeler transform

A useful property of the Burrows-Wheeler transform is that each row of the sorted array is essentially the same as the corresponding row in the suffix array, except for the rotated string prefix after the $ marker. This means, among other things, that we can compute the Burrows-Wheeler transform in linear time using suffix trees. Ferragina and Manzini (http://www.imc.pi.cnr.it/~manzini/papers/focs00.html) have further exploited this correspondence (and some very clever additional ideas) to design compressed suffix arrays that compress and index a text at the same time, so that pattern searches can be done directly on the compressed text in time close to that needed for suffix array searches.

194. Sample implementation

As mentioned above, this is a pretty lazy implementation of suffix arrays, that doesn't include many of the optimizations that would be necessary to deal with huge source texts.

   1 /* we expose this so user can iterate through it */
   2 
   3 struct suffixArray {
   4     size_t n;               /* length of string INCLUDING final null */
   5     const char *string;     /* original string */
   6     const char **suffix;    /* suffix array of length n */
   7 };
   8 
   9 typedef struct suffixArray *SuffixArray;
  10 
  11 /* construct a suffix array */
  12 /* it is a bad idea to modify string before destroying this */
  13 SuffixArray suffixArrayCreate(const char *string);
  14 
  15 /* destructor */
  16 void suffixArrayDestroy(SuffixArray);
  17 
  18 /* return number of occurrences of substring */
  19 /* if non-null, index of first occurrence is place in first */
  20 size_t
  21 suffixArraySearch(SuffixArray, const char *substring, size_t *first);
  22 
  23 /* return the Burrows-Wheeler transform of the underlying string 
  24  * as malloc'd data of length sa->n */
  25 /* note that this may have a null in the middle somewhere */
  26 char *suffixArrayBWT(SuffixArray sa);
  27 
  28 /* invert BWT of null-terminated string, returning a malloc'd copy of original */
  29 char *inverseBWT(size_t len, const char *s);

suffixArray.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include <string.h>
   4 #include <limits.h>
   5 
   6 #include "suffixArray.h"
   7 
   8 static int
   9 saCompare(const void *s1, const void *s2)
  10 {
  11     return strcmp(*((const char **) s1), *((const char **) s2));
  12 }
  13 
  14 SuffixArray
  15 suffixArrayCreate(const char *s)
  16 {
  17     size_t i;
  18     SuffixArray sa;
  19 
  20     sa = malloc(sizeof(*sa));
  21     assert(sa);
  22 
  23     sa->n = strlen(s) + 1;
  24     sa->string = s;
  25 
  26     sa->suffix = malloc(sizeof(*sa->suffix) * sa->n);
  27     assert(sa->suffix);
  28 
  29     /* construct array of pointers to suffixes */
  30     for(i = 0; i < sa->n; i++) {
  31         sa->suffix[i] = s+i;
  32     }
  33 
  34     /* this could be a lot more efficient */
  35     qsort(sa->suffix, sa->n, sizeof(*sa->suffix), saCompare);
  36 
  37     return sa;
  38 }
  39 
  40 void
  41 suffixArrayDestroy(SuffixArray sa)
  42 {
  43     free(sa->suffix);
  44     free(sa);
  45 }
  46 
  47 size_t
  48 suffixArraySearch(SuffixArray sa, const char *substring, size_t *first)
  49 {
  50     size_t lo;
  51     size_t hi;
  52     size_t mid;
  53     size_t len;
  54     int cmp;
  55 
  56     len = strlen(substring);
  57 
  58     /* invariant: suffix[lo] <= substring < suffix[hi] */
  59     lo = 0;
  60     hi = sa->n;
  61 
  62     while(lo + 1 < hi) {
  63         mid = (lo+hi)/2;
  64         cmp = strncmp(sa->suffix[mid], substring, len);
  65 
  66         if(cmp == 0) {
  67             /* we have a winner */
  68             /* search backwards and forwards for first and last */
  69             for(lo = mid; lo > 0 && strncmp(sa->suffix[lo-1], substring, len) == 0; lo--);
  70             for(hi = mid; hi < sa->n && strncmp(sa->suffix[hi+1], substring, len) == 0; hi++);
  71 
  72             if(first) {
  73                 *first = lo;
  74             }
  75 
  76             return hi - lo + 1;
  77         } else if(cmp < 0) {
  78             lo = mid;
  79         } else {
  80             hi = mid;
  81         }
  82     }
  83 
  84     return 0;
  85 }
  86 
  87 char *
  88 suffixArrayBWT(SuffixArray sa)
  89 {
  90     char *bwt;
  91     size_t i;
  92 
  93     bwt = malloc(sa->n);
  94     assert(bwt);
  95 
  96     for(i = 0; i < sa->n; i++) {
  97         if(sa->suffix[i] == sa->string) {
  98             /* wraps around to nul */
  99             bwt[i] = '\0';
 100         } else {
 101             bwt[i] = sa->suffix[i][-1];
 102         }
 103     }
 104 
 105     return bwt;
 106 }
 107 
 108 char *
 109 inverseBWT(size_t len, const char *s)
 110 {
 111     /* basic trick: stable sort of s gives successor indices */
 112     /* then we just thread through starting from the nul */
 113 
 114     size_t *successor;
 115     int c;
 116     size_t count[UCHAR_MAX+1];
 117     size_t offset[UCHAR_MAX+1];
 118     size_t i;
 119     char *ret;
 120     size_t thread;
 121 
 122     successor = malloc(sizeof(*successor) * len);
 123     assert(successor);
 124 
 125     /* counting sort */
 126     for(c = 0; c <= UCHAR_MAX; c++) {
 127         count[c] = 0;
 128     }
 129 
 130     for(i = 0; i < len; i++) {
 131         count[(unsigned char) s[i]]++;
 132     }
 133 
 134     offset[0] = 0;
 135 
 136     for(c = 1; c <= UCHAR_MAX; c++) {
 137         offset[c] = offset[c-1] + count[c-1];
 138     }
 139 
 140     for(i = 0; i < len; i++) {
 141         successor[offset[(unsigned char) s[i]]++] = i;
 142     }
 143 
 144     /* find the nul */
 145     for(thread = 0; s[thread]; thread++);
 146 
 147     /* thread the result */
 148     ret = malloc(len);
 149     assert(ret);
 150 
 151     for(i = 0, thread = successor[thread]; i < len; i++, thread = successor[thread]) {
 152         ret[i] = s[thread];
 153     }
 154 
 155     return ret;
 156 }

suffixArray.c

Here is a Makefile and test code: Makefile, testSuffixArray.c.

The output of make test shows all occurences of a target string, the Burrows-Wheeler transform of a the source string (second-to-last line), and its inversion (last line, which is just the original string):

$ make test
/bin/echo -n abracadabra-abracadabra-shmabracadabra | ./testSuffixArray abra
Count: 6
abra
abra-abr
abra-shm
abracada
abracada
abracada
aaarrrdddm\x00-rrrcccaaaaaaaaaaaashbbbbbb-
abracadabra-abracadabra-shmabracadabra

CategoryProgrammingNotes

195. C++

Contents

Hello world
References
Function overloading
Classes
Operator overloading
Templates
Exceptions
Storage allocation
1. Storage allocation inside objects
Standard library
Things we haven't talked about

Here we will describe some basic features of C++ that are useful for implementing abstract data types. Like all programming languages, C++ comes with an ideology, which in this case emphasizes object-oriented features like inheritance. We will be ignoring this ideology and treating C++ as an improved version of C.

The goal here is not to teach you all of C++, which would take a while, but instead to give you some hints for why you might want to learn C++ on your own. If you decide to learn C++ for real, Bjarne Stroustrup's The C++ Programming Language is the definitive source. A classic tutorial here aimed at C programmers introduces C++ features one at a time (some of these features have since migrated into C). The web site http://www.cplusplus.com has extensive tutorials and documentation.

196. Hello world

The C++ version of "hello world" looks like this:

   1 #include <iostream>
   2 
   3 int
   4 main(int argc, const char **argv)
   5 {
   6     std::cout << "hi\n";
   7 
   8     return 0;
   9 }

helloworld.cpp

Compile this using g++ instead of gcc. Make shows how it is done:

$ make helloworld
g++     helloworld.cpp   -o helloworld

Or we could use an explicit Makefile:

CPP=g++
CPPFLAGS=-g3 -Wall

helloworld: helloworld.o
        $(CPP) $(CPPFLAGS) -o $@ $^

Now the compilation looks like this:

$ make helloworld
g++  -g3 -Wall  -c -o helloworld.o helloworld.cpp
g++ -g3 -Wall -o helloworld helloworld.o

The main difference from the C version:

#include <stdio.h> is replaced by #include <iostream>, which gets the C++ version of the stdio library.
printf("hi\n") is replaced by std::cout << "hi\n". The stream std::cout is the C++ wrapper for stdout; you should read this variable name as cout in the std namespace. The << operator is overloaded for streams so that it sends its right argument out on its left argument (see the discussion of operator overloading below). You can also do things like std::cout << 37, std::cout << 'q', std::cout << 4.7, etc. These all do pretty much what you expect.

If you don't like typing std:: before all the built-in functions and variables, you can put using namespace std somewhere early in your program, like this:

   1 #include <iostream>
   2 
   3 using namespace std;
   4 
   5 int
   6 main(int argc, const char **argv)
   7 {
   8     cout << "hi\n";
   9 
  10     return 0;
  11 }

helloworld_using.cpp

197. References

Recall that in C we sometime pass objects into function by reference instead of by value, by using a pointer:

   1 void increment(int *x)
   2 {
   3     (*x)++;
   4 }

This becomes even more useful in C++, since many of the objects we are dealing with are quite large, and can defend themselves against dangerous modifications by restricting access to their components. So C++ provides a special syntax allowing function parameters to be declared as call-by-reference rather than call-by-value. The function above could be rewritten in C++ as

   1 void increment(int &x)
   2 {
   3     x++;
   4 }

The int &x declaration says that x is a reference to whatever variable is passed as the argument to increment. A reference acts exactly like a pointer that has already had * applied to it. You can even write &x to get a pointer to the original variable if you want to for some reason.

As with pointers, it's polite to mark a reference with const if you don't intend to modify the original object:

   1 void reportWeight(const SumoWrestler &huge)
   2 {
   3     cout << huge.getWeight();
   4 }

References are also used as a return type to chain operators together; in the expression

   1     cout << "hi" << '\n';

the return type of the first << operator is an ostream & reference (as is cout); this means that the '\n' gets sent to the same object. We could make the return value be just an ostream, but then cout would be copied, which could be expensive and would mean that the copy was no longer working on the same internal state as the original. This same trick is used when overloading the assignment operator.

198. Function overloading

C++ lets you define multiple functions with the same name, where the choice of which function to call depends on the type of its arguments. Here is a program that demonstrates this feature:

   1 #include <iostream>
   2 
   3 using namespace std;
   4 
   5 const char *
   6 typeName(int x)
   7 {
   8     return "int";
   9 }
  10 
  11 const char *
  12 typeName(double x)
  13 {
  14     return "double";
  15 }
  16 
  17 const char *
  18 typeName(char x)
  19 {
  20     return "char";
  21 }
  22 
  23 int
  24 main(int argc, const char **argv)
  25 {
  26     cout << "The type of " << 3 << " is " << typeName(3) << ".\n";
  27     cout << "The type of " << 3.1 << " is " << typeName(3.1) << ".\n";
  28     cout << "The type of " << 'c' << " is " << typeName('c') << ".\n";
  29 
  30     return 0;
  31 }

functionOverloading.cpp

And here is what it looks like when we compile and run it:

$ make functionOverloading
g++     functionOverloading.cpp   -o functionOverloading
$ ./functionOverloading 
The type of 3 is int.
The type of 3.1 is double.
The type of c is char.

Internally, g++ compiles three separate functions with different (and ugly) names, and when you use typeName on an object of a particular type, g++ picks the one whose type matches. This is similar to what happens with built-in operators in straight C, where + means different things depending on whether you apply it to a pair of ints, a pair of doubles, or a pointer and an int, but C++ lets you do it with your own functions.

199. Classes

C++ allows you to declare classes that look suspiciously like structs. The main differences between a class and a C-style struct are that (a) classes provide member functions or methods that operate on instances of the class and that are called using a struct-like syntax; and (b) classes can distinguish between private members (only accessible to methods of the class) and public members (accessible to everybody).

In C, we organize abstract data types by putting the representation in a struct and putting the operations on the data type in functions that work on this struct, often giving the functions a prefix that hints at the type of its target (mostly to avoid namespace collisions). Classes in C++ make this connection between a data structure and the operations on it much more explicit.

Here is a simple example of a C++ class in action:

   1 #include <iostream>
   2 
   3 using namespace std;
   4 
   5 /* counters can be incremented or read */
   6 class Counter {
   7     int value;            /* private value */
   8 public:
   9     Counter();            /* constructor with default value */
  10     Counter(int);         /* constructor with specified value */
  11     int read();           /* get the value of the counter */
  12     void increment();     /* add one to the counter */
  13 };
  14 
  15 Counter::Counter() { value = 0; }
  16 Counter::Counter(int initialValue) { value = initialValue; }
  17 int Counter::read() { return value; }
  18 void Counter::increment() { value++; }
  19 
  20 int
  21 main(int argc, const char **argv)
  22 {
  23     Counter c;
  24     Counter c10(10);
  25 
  26     cout << "c starts at " << c.read() << '\n';
  27     c.increment();
  28     cout << "c after one increment is " << c.read() << '\n';
  29 
  30     cout << "c10 starts at " << c10.read() << '\n';
  31     c.increment();
  32     c.increment();
  33     cout <<"c10 after two increments is " << c10.read() << '\n';
  34 
  35     return 0;
  36 }

counter.cpp

Things to notice:

In the class Counter declaration, the public: label introduces the public members of the class. The member value is only accessible to member functions of Counter. This enforces much stronger information hiding than the default in C, although one can still use void * trickery to hunt down and extract supposedly private data in C++ objects.
In addition to the member function declarations in the class declaration, we also need to provide definitions. These look like ordinary function definitions, except that the class name is prepended using :: as in Counter::read.
Member functions are called using struct access syntax, as in c.read(). Conceptually, each instance of a class has its own member functions, so that c.read is the function for reading c while c10.read is the function for reading c10. Inside a member function, names of class members refer to members of the current instance; value inside c.read is c.value (which otherwise is not accessible, since c.value is not public).
Two special member functions are Counter::Counter() and Counter::Counter(int). These are constructors, and are identifiable as such because they are named after the class. A constructor is called whenever a new instance of the class is created. If you create an instance with no arguments (as in the declaration Counter c;), you get the constructor with no arguments. If you create an instance with arguments (as in the declaration Counter c10(10);), you get the version with the appropriate arguments. This is just another example of function overloading. If you don't define any constructors, C++ supplies a default constructor that takes no arguments and does nothing. Note that constructors don't have a return type (you don't need to preface them with void).
The special member function Counter::~Counter() is a destructor; it is called when an object of type Counter is de-allocated (say, when returning from a function with a local variable of this type). This particular destructor is not very useful. Destructors are mostly important for objects that allocate their own storage that needs to be de-allocated when the object is; see the section on storage allocation below.

Compiling and running this program gives the following output. Note that the last two lines are produced by the destructor.

c starts at 0
c after one increment is 1
c10 starts at 10
c10 after two increments is 10
counter de-allocated with value 10
counter de-allocated with value 3

One subtle difference between C and C++ is that C++ uses empty parentheses () for functions with no arguments, where C would use (void). This is a bit of a historical artifact, having to do with C allowing () for functions whose arguments are not specified in the declaration (which was standard practice before ANSI C).

Curiously, C++ also allows you to declare structs, with the interpretation that a struct is exactly like a class except that all members are public by default. So if you change class to struct in the program above, it will do exactly the same thing. In practice, nobody who codes in C++ does this; the feature is mostly useful to allow C code with structs to mix with C++ code.

200. Operator overloading

Sometimes when you define a new class, you also want to define new interpretations of operators on that class. Here is an example of a class that defines elements of the max-plus algebra over ints. This gives us objects that act like ints, except that the + operator now returns the larger of its arguments and the * operator now returns the sum.¹⁴

The mechanism in C++ for doing this is to define member functions with names operatorsomething where something is the name of the operator we want to define. These member functions take one less argument that the operator they define; in effect, x + y becomes syntactic sugar for x.operator+(y) (which, amazingly, is actually legal C++). Because these are member functions, they are allowed to access members of other instances of the same class that would normally be hidden.

This same mechanism is also used to define automatic type conversions out of a type: the MaxPlus::operator int() function allows C++ to convert a MaxPlus object to an int whenever it needs to (for example, to feed it to cout). (Automatic type conversions into a type happen if you provide an appropriate constructor.)

   1 #include <iostream>
   2 #include <algorithm> // for max
   3 
   4 using namespace std;
   5 
   6 /* act like ints, except + does max and * does addition */
   7 class MaxPlus {
   8     int value;
   9 public:
  10     MaxPlus(int);
  11     MaxPlus operator+(const MaxPlus &);
  12     MaxPlus operator*(const MaxPlus &);
  13     operator int();
  14 };
  15 
  16 MaxPlus::MaxPlus(int x) { value = x; }
  17 
  18 MaxPlus 
  19 MaxPlus::operator*(const MaxPlus &other)
  20 {
  21     return MaxPlus(value + other.value);
  22 }
  23 
  24 MaxPlus 
  25 MaxPlus::operator+(const MaxPlus &other)
  26 {
  27     /* std::max does what you expect */
  28     return MaxPlus(max(value, other.value));
  29 }
  30 
  31 MaxPlus::operator int() { return value; }
  32 
  33 int
  34 main(int argc, const char **argv)
  35 {
  36     cout << "2+3 == " << (MaxPlus(2) + MaxPlus(3)) << '\n';
  37     cout << "2*3 == " << (MaxPlus(2) * MaxPlus(3)) << '\n';
  38 
  39     return 0;
  40 }

maxPlus.cpp

Avoid the temptation to overuse operator overloading, as it can be dangerous if used to obfuscate what an operator normally does:

   1 MaxPlus::operator--() { godzilla.eat(tokyo); }

The general rule of thumb is that you should probably only do operator overloading if you really are making things that act like numbers (yes, cout << violates this).

Automatic type conversions can be particularly dangerous. The line

   1     cout << (MaxPlus(2) + 3) << '\n';

is ambiguous: should the compiler convert MaxPlus(2) to an int using the MaxPlus(int) constructor and use ordinary integer addition or convert 3 to a MaxPlus using MaxPlus::operator int() and use funky MaxPlus addition? Fortunately most C++ compilers will complain about the ambiguity and fail rather than guessing wrong.

201. Templates

One of the things we kept running into in CS223 was that if we defined a container type like a hash table, binary search tree, or priority queue, we had to either bake in the type of the data it held or do horrible tricks with void * pointers to work around the C type system. C++ includes a semi-principled work-around for this problem known as templates. These are essentially macros that take a type name as an argument, that are expanded as needed to produce functions or classes with specific types (see C/Macros for an example of how to do this if you only have C).

Typical use is to prefix a definition with template <class T> and then use T as a type name throughout:

   1 template <class T>
   2 T add1(T x)
   3 {
   4     return x + ((T) 1);
   5 }

Note the explicit cast to T of 1; this avoids ambiguities that might arise with automatic type conversions.

If you put this definition in a program, you can then apply add1 to any type that has a + operator and that you can convert 1 to. For example, the output of this code fragment:

   1     cout << "add1(3) == " << add1(3) << '\n';
   2     cout << "add1(3.1) == " << add1(3.1) << '\n';
   3     cout << "add1('c') == " << add1('c') << '\n';
   4     cout << "add1(MaxPlus(0)) == " << add1(MaxPlus(0)) << '\n';
   5     cout << "add1(MaxPlus(2)) == " << add1(MaxPlus(2)) << '\n';

add1(3) == 4
add1(3.1) == 4.1
add1('c') == d
add1(MaxPlus(0)) == 1
add1(MaxPlus(2)) == 2

By default, C++ will instantiate a template to whatever type fits in its argument. If you want to force a particular version, you can put the type in angle brackets after the name of whatever you defined. For example,

   1     cout << "add1<int>(3.1) == " << add1<int>(3.1) << '\n';

produces

add1<int>(3.1) == 4

because add1<int> forces its argument to be converted to an int (truncating to 3) before adding one to it.

Because templates are really macros that get expanded as needed, it is common to put templates in header (.h) files rather than in .cpp files. See the stack implementation below for an example of this.

202. Exceptions

C provides no built-in mechanism for signaling that something bad happened. So C programmers are left to come up with ad-hoc mechanisms like:

Calling abort to kill the program, either directly or via assert.
Calling exit with a nonzero exit code.
Returning a special error value from a function. This is often done in library routines, because it's rude for a library routine not to give the caller a chance to figure out how to deal with the error. But it means coming up with some special error value that won't be returned normally, and these can vary widely from one routine to another (null pointers, -1, etc.)

C++ provides a standard mechanism for signaling unusual events known as exceptions. The actual mechanism is similar to return: the throw statement throws an exception that may be caught by a try..catch statement anywhere above it on the execution stack (not necessarily in the same function). Example:

   1 #include <iostream>
   2 
   3 using namespace std;
   4 
   5 int fail()
   6 { 
   7     throw "you lose";
   8 
   9     return 5;
  10 }
  11 
  12 int
  13 main(int argc, const char **argv)
  14 {
  15     try {
  16         cout << fail() << '\n';
  17     } 
  18     catch(const char *s) {
  19         cerr << "Caught error: " << s << '\n';
  20     }
  21 
  22     return 0;
  23 }

exception.cpp

In action:

$ make exception
g++  -g3 -Wall   exception.cpp   -o exception
$ ./exception
Caught error: you lose

Note the use of cerr instead of cout. This sends the error message to stderr.

A try..catch statement will catch an exception only if the type matches the type of the argument to the catch part of the statement. This can be used to pick and choose which exceptions you want to catch. See http://www.cplusplus.com/doc/tutorial/exceptions/ for some examples and descriptions of some C++ standard library exceptions.

203. Storage allocation

C++ programs generally don't use malloc and free, but instead use the built-in C++ operators new and delete. The advantage of new and delete is that they know about types: not only does this mean that you don't have to play games with sizeof to figure out how much space to allocate, but if you allocate a new object from a class with a constructor, the constructor gets called to initialize the object, and if you delete an object, its destructor (if it has one) is called.

There are two versions of new and delete, depending on whether you want to allocate just one object or an array of objects, plus some special syntax for passing constructor arguments:

To allocate a single object, use new type.
To allocate an array of objects, use new type[size]. As with malloc, both operations return a pointer to type.
If you want to pass arguments to a constructor for type, use new type(args). This only works with the single-object version, so you can't do new SomeClass[12] unless SomeClass has a constructor that takes no arguments.
To de-allocate a single object, use delete pointer-to-object.
To de-allocate an array, use delete [] pointer-to-base-of-array. Mixing new with delete [] or vice versa is an error that may or may not be detected by the compiler. Mixing either with malloc or free is a very bad idea.

The program below gives examples of new and delete in action:

   1 #include <iostream>
   2 #include <cassert>
   3 
   4 using namespace std;
   5 
   6 int
   7 main(int argc, const char **argv)
   8 {
   9     int *p;
  10     int *a;
  11     const int n = 100;
  12 
  13     p = new int;
  14     a = new int[n];
  15 
  16     *p = 5;
  17     assert(*p == 5);
  18 
  19     for(int i = 0; i < n; i++) {
  20         a[i] = i;
  21     }
  22 
  23     for(int i = 0; i < n; i++) {
  24         assert(a[i] == i);
  25     }
  26 
  27     delete [] a;
  28     delete p;
  29 
  30     return 0;
  31 }

allocation.cpp

203.1. Storage allocation inside objects

Inside objects, storage allocation gets complicated. The reason is that if the object is copied, either by an assignment or by being passed as a call-by-value parameter, the storage pointed to by the object will not be copied. This can lead to two different objects that share the same internal data structures, which is usually not something you want. Furthermore, when the object is deallocated, it's necessary to also deallocate any space it allocated, which can be done inside the object's destructor.

To avoid all these problems, any object of type T that uses new needs to have all of:

A destructor T::~T().
A copy constructor T::T(const T &), which is a constructor that takes a reference to another object of the same type as an argument and copies its contents.
An overloaded assignment operator T::operator=(const T &) that does the same thing, but also deallocates any internal storage of the current object before copying new data in place of it (or possibly just copies the contents of internal storage without doing any allocation and deallocation). The overloaded assignment operator is particularly tricky, because you have to make sure it doesn't destroy the contents of the object if somebody writes the useless self-assignment a = a, and you also need to return a reference to *this so that you can chain assignments together as in a = b = c.

Here is an example of a Stack class that includes all of these members. Note that it is defined using templates so we can make a stack of any type we like.

   1 template <class T>
   2 class Stack {
   3     static const int initialSize = 32;   /* static means this is shared across entire class */
   4     int top;
   5     int size;
   6     T* contents;
   7 public:
   8     Stack();          /* create a new empty stack */
   9 
  10     /* the unholy trinity of complex C++ objects */
  11     ~Stack();         /* destructor */
  12     Stack(const Stack &);     /* copy constructor */
  13     Stack& operator=(const Stack &); /* overloaded assignment */
  14 
  15     void push(T);     /* push an element onto the stack */
  16     int isEmpty();    /* return 1 if empty */
  17     T pop();          /* pop top element from stack */
  18 };
  19 
  20 template <class T>
  21 Stack<T>::Stack() 
  22 { 
  23     size = initialSize;
  24     top = 0;
  25     contents = new T[size];
  26 }
  27 
  28 template <class T> 
  29 Stack<T>::~Stack()
  30 { 
  31     delete [] contents;
  32 }
  33 
  34 template <class T>
  35 Stack<T>::Stack(const Stack<T> &other)
  36 {
  37     size = other.size;
  38     top = other.top;
  39     contents = new T[size];
  40 
  41     for(int i = 0; i < top; i++) {
  42         contents[i] = other.contents[i];
  43     }
  44 }
  45 
  46 template <class T>
  47 Stack<T> &
  48 Stack<T>::operator=(const Stack<T> &other)
  49 {
  50     if(&other != this) {
  51         /* this is a real assignment */
  52 
  53         delete [] contents;
  54 
  55         size = other.size;
  56         top = other.top;
  57         contents = new T[size];
  58 
  59         for(int i = 0; i < top; i++) {
  60             contents[i] = other.contents[i];
  61         }
  62     }
  63     
  64     return *this;
  65 }
  66 
  67 template <class T>
  68 void 
  69 Stack<T>::push(T elt)
  70 {
  71     if(top >= size) {
  72         int newSize = 2*size;
  73         T *newContents = new T[newSize];
  74 
  75         for(int i = 0; i < top; i++) {
  76             newContents[i] = contents[i];
  77         }
  78 
  79         delete [] contents;
  80 
  81         contents = newContents;
  82         size = newSize;
  83     }
  84         
  85     contents[top++] = elt;
  86 }
  87 
  88 template <class T>
  89 T
  90 Stack<T>::pop()
  91 {
  92     if(top > 0) {
  93         return contents[--top];
  94     } else {
  95         throw "stack empty";
  96     }
  97 }

stack.h

   1 #include <iostream>
   2 
   3 #include "stack.h"
   4 
   5 using namespace std;
   6 
   7 int
   8 main(int argc, const char **argv)
   9 {
  10     Stack<int> s;
  11     Stack<int> s2;
  12 
  13     try {
  14         s.push(1);
  15         s.push(2);
  16         s.push(3);
  17 
  18         s2 = s;
  19 
  20         cout << s.pop() << '\n';
  21         cout << s.pop() << '\n';
  22         cout << s.pop() << '\n';
  23 
  24         cout << s2.pop() << '\n';
  25         cout << s2.pop() << '\n';
  26         cout << s2.pop() << '\n';
  27 
  28         try {
  29             s2.pop();
  30         } catch(const char *err) {
  31             cout << "Caught expected exception " << err << '\n';
  32         }
  33 
  34         for(int i = 0; i < 1000; i++) {
  35             s.push(i);
  36         }
  37 
  38         cout << s.pop() << '\n';
  39     } catch(const char *err) {
  40         cerr << "Caught error " << err << '\n';
  41     }
  42 
  43     return 0;
  44 }

testStack.cpp

204. Standard library

C++ has a large standard library that includes implementations of many of the data structures we've seen in CS223. In most situations, it is easier to use the standard library implementations than roll your own, although you have to be careful to make sure you understand just what the standard library implementations do. For example, here is a reimplementation of the main routine from stack.cpp using the stack template from #include <stack>.

   1 #include <iostream>
   2 #include <stack>
   3 
   4 using namespace std;
   5 
   6 int
   7 main(int argc, const char **argv)
   8 {
   9     stack<int> s;
  10     stack<int> s2;
  11 
  12     s.push(1);
  13     s.push(2);
  14     s.push(3);
  15 
  16     s2 = s;
  17 
  18     cout << s.top() << '\n'; s.pop();
  19     cout << s.top() << '\n'; s.pop();
  20     cout << s.top() << '\n'; s.pop();
  21 
  22     cout << s2.top() << '\n'; s2.pop();
  23     cout << s2.top() << '\n'; s2.pop();
  24     cout << s2.top() << '\n'; s2.pop();
  25 
  26     for(int i = 0; i < 1000; i++) {
  27         s.push(i);
  28     }
  29 
  30     cout << s.top() << '\n';
  31 
  32     return 0;
  33 }

stdStack.cpp

One difference between the standard stack and our stack is that std::stack's pop member function doesn't return anything. So we have to use top to get the top element before popping it.

There is a chart of all the standard library data structures at http://www.cplusplus.com/reference/stl/.

205. Things we haven't talked about

The main thing we've omitted here is any discussion of object-oriented features of C++, particularly inheritance. These are not immediately useful for the abstract-data-type style of programming we've used in CS223, but can be helpful for building more complicated systems, where we might want to have various specialized classes of objects that can all be approached using a common interface represented by a class that they inherit from. If you are interested in exploring these tools further, the CS department occasionally offers a class on object-oriented programming; Mike fischer's lecture notes from the last time this course was offered can be found at http://zoo.cs.yale.edu/classes/cs427/2011a/lectures.html.

CategoryProgrammingNotes

these are also used to identify things that are not variables like functions and user-defined types. (1)
The long long type wasn't added to the language officially until C99, but was supported by most compilers anyway. (2)
Certain ancient versions of C ran on machines with a different character set encoding, like EBCDIC. The C standard does not guarantee ASCII encoding. (3)
C99 also provides atoll for converting to long long. (4)
In this case you will get lucky most of the time, since the odds are that malloc will give you a block that is slightly bigger than strlen(s) anyway. But bugs that only manifest themselves occasionally are even worse than bugs that kill your program every time, because they are much harder to track down. (5)
Some programs (e.g. /c/cs223/bin/submit) will use this to change their behavior depending on what name you call them with. (6)
Arguably, this is a bug in the design of the language: if the compiler knows that sp has type struct string *, there is no particular reason why it can't interpret sp.length as sp->length. But it doesn't do this, so you will have to remember to write sp->length instead. (7)
A small child of my acquaintance once explained that this wouldn't work, because you would hit your head on the ceiling. (8)
The notation [x, y) means all numbers z such that x ≤ z < y. (9)
The actual analysis is pretty complicated, since we are more likely to land in a bigger pile, but it's not hard to show that on average even the bigger pile has no more than 3/4 of the elements. (10)
But it's linear in the numerical value of the output, which means that Fib(n) will actually terminate in a reasonable amount of time on a typical modern computer when run on any n small enough that F(n) fits in 32 bits. Running it using 64-bit (or larger) integer representations will be slower. (11)
I.e., functional. (12)
This only works if the graph is undirected, i.e. if for every edge uv there is a matching edge vu with the same weight. (13)
This otherwise insane-looking modification is useful for modeling scheduling problems, where a+b is the time to do a and b in parallel, and a*b is the time to do a and b sequentially. The reason for making the first case + and the second case * is because this makes the distributive law a*(b+c) = (a*b)+(a*c) work. It also allows tricks like matrix multiplication using the standard definition. See http://maxplus.org for more than you probably want to know about this. (14)