Abstraction
One of the hard parts about computer programming is that, in general, programs are bigger than brains. Unless you have an unusally capacious brain, it is unlikely that you will be able to understand even a modestly large program in its entirety. So in order to be able to write and debug large programs, it is important to be able to break it up into pieces, where each piece can be treated as a tool whose use and description is simpler (and therefor fits in your brain better) than its actual code. Then you can forget about what is happening inside that piece, and just treat it as an easily-understood black box from the outside.
This process of wrapping functionality up in a box and forgetting about its internals is called abstraction, and it is the single most important concept in computer science. In these notes we will describe a particular kind of abstraction, the construction of abstract data types or ADTs. Abstract data types are data types whose implementation is not visible to their user; from the outside, all the user knows about an ADT is what operations can be performed on it and what those operations are supposed to do.
ADTs have an outside and an inside. The outside is called the interface; it consists of the minimal set of type and function declarations needed to use the ADT. The inside is called the implementation; it consists of type and function definitions, and sometime auxiliary data or helper functions, that are not visible to users of the ADT.
Example of an abstract data type
Too much abstraction at once can be hard to take, so let's look at a concrete example of an abstract data type. This ADT will represent an infinite sequence of ints. Each instance of the Sequence type supports a single operation seq_next that returns the next int in the sequence. We will also need to provide one or more constructor functions to generate new Sequences, and a destructor function to tear them down.
Here is an example of a typical use of a Sequence:
Note that seq_print doesn't need to know anything at all about what a Sequence is or how seq_next works in order to print out all the values in the sequence until it hits one greater than or equal to limit. This is a good thing--- it means that we can use with any implementation of Sequence we like, and we don't have to change it if Sequence or seq_next changes.
Interface
In C, the interface of an abstract data type will usually be declared in a header file, which is included both in the file that implements the ADT (so that the compiler can check that the declarations match up with the actual definitions in the implementation. Here's a header file for sequences:
sequence.h
1 /* opaque struct: hides actual components of struct sequence,
2 * which are defined in sequence.c */
3 typedef struct sequence *Sequence;
4
5 /* constructors */
6 /* all our constructors return a null pointer on allocation failure */
7
8 /* returns a Sequence representing init, init+1, init+2, ... */
9 Sequence seq_create(int init);
10
11 /* returns a Sequence representing init, init+step, init+2*step, ... */
12 Sequence seq_create_step(int init, int step);
13
14 /* destructor */
15 /* destroys a Sequence, recovering all interally-allocated data */
16 void seq_destroy(Sequence);
17
18 /* accessor */
19 /* returns the first element in a sequence not previously returned */
20 int seq_next(Sequence);
Here we have defined two different constructors for Sequences, one of which gives slightly more control over the sequence than the other. If we were willing to put more work into the implementation, we could imagine building a very complicated Sequence type that supported a much wider variety of sequences (for example, sequences generated by functions or sequences read from files); but we'll try to keep things simple for now. We can always add more functionality later, since the users won't notice if the Sequence type changes internally.
Implementation
The implementation of an ADT in C is typically contained in one (or sometimes more than one) .c file. This file can be compiled and linked into any program that needs to use the ADT. Here is our implementation of Sequence:
sequence.c
1 #include <stdlib.h>
2
3 #include "sequence.h"
4
5 struct sequence {
6 int next; /* next value to return */
7 int step; /* how much to increment next by */
8 };
9
10 Sequence
11 seq_create(int init)
12 {
13 return seq_create_step(init, 1);
14 }
15
16 Sequence
17 seq_create_step(int init, int step)
18 {
19 Sequence s;
20
21 s = malloc(sizeof(*s));
22 if(s == 0) return 0;
23 s->next = init;
24 s->step = step;
25 return s;
26 }
27
28 void
29 seq_destroy(Sequence s)
30 {
31 free(s);
32 }
33
34 int
35 seq_next(Sequence s)
36 {
37 int ret; /* saves the old value before we increment it */
38
39 ret = s->next;
40 s->next += s->step;
41
42 return ret;
43 }
Things to note here: the definition of struct sequence appears only in this file; this means that only the functions defined here can (easily) access the next and step components. This protects Sequences to a limited extent from outside interference, and defends against users who might try to "violate the abstraction boundary" by examining the components of a Sequence directly. It also means that if we change the components or meaning of the components in struct sequence, we only have to fix the functions defined in sequence.c.
Compiling and linking
Now that we have sequence.h and sequence.c, how do we use them? Let's suppose we have a simple main program:
main.c
1 #include <stdio.h>
2
3 #include "sequence.h"
4
5
6 void
7 seq_print(Sequence s, int limit)
8 {
9 int i;
10
11 for(i = seq_next(s); i < limit; i = seq_next(s)) {
12 printf("%d\n", i);
13 }
14 }
15
16
17 int
18 main(int argc, char **argv)
19 {
20 Sequence s;
21 Sequence s2;
22
23 puts("Stepping by 1:");
24
25 s = seq_create(0);
26 seq_print(s, 5);
27 seq_destroy(s);
28
29 puts("Now stepping by 3:");
30
31 s2 = seq_create_step(1, 3);
32 seq_print(s2, 20);
33 seq_destroy(s2);
34
35 return 0;
36 }
We can compile main.c and sequence.c together into a single binary with the command gcc main.c sequence.c. Or we can build a Makefile which will compile the two files separately and then link them. Using make may be more efficient, especially for large programs consisting of many components, since if we make any changes make will only recompile those files we have changed. So here is our Makefile:
Makefile
1 CC=gcc
2 CFLAGS=-g3 -ansi -pedantic -Wall
3
4 all: seqprinter
5
6 seqprinter: main.o sequence.o
7 $(CC) $(CFLAGS) -o $@ $^
8
9 test: seqprinter
10 ./seqprinter
11
12 # these rules say to rebuild main.o and sequence.o if sequence.h changes
13 main.o: main.c sequence.h
14 sequence.o: sequence.c sequence.h
15
16 clean:
17 $(RM) -f seqprinter *.o
And now running make test produces this output. Notice how the built-in make variables $@ and $^ expand out to the left-hand side and right-hand side of the dependency line for building seqprinter.
$ make test gcc -g3 -ansi -pedantic -Wall -c -o main.o main.c gcc -g3 -ansi -pedantic -Wall -c -o sequence.o sequence.c gcc -g3 -ansi -pedantic -Wall -o seqprinter main.o sequence.o ./seqprinter Stepping by 1: 0 1 2 3 4 Now stepping by 3: 1 4 7 10 13 16 19
Designing abstract data types
Now we've seen how to implement an abstract data type. How do we choose when to use when, and what operations to give it? Let's try answering the second question first.
Parnas's Principle
Parnas's Principle is a statement of the fundamental idea of information hiding, which says that abstraction boundaries should be as narrow as possible:
- The developer of a software component must provide the intended user with all the information needed to make effective use of the services provided by the component, and should provide no other information.
- The developer of a software component must be provided with all the information necessary to carry out the given responsibilities assigned to the component, and should be provided with no other information.
(David Parnas, "On the Criteria to Be Used in Decomposing Systems into Modules," Communications of the ACM, 15(12): 1059--1062, 1972.)
For ADTs, this means we should provide as few functions for accessing and modifying the ADT as we can get away with. The Sequence type we defined early has a particularly narrow interface; the developer of Sequence (whoever is writing sequence.c) needs to know nothing about what its user wants except for the arguments passed in to seq_create or seq_create_step, and the user only needs to be able to call seq_next. More complicated ADTs might provide larger sets of operations, but in general we know that an ADT provides a successful abstraction when the operations are all "natural" ones given our high-level description. If we find ourselves writing a lot of extra operations to let users tinker with the guts of our implementation, that may be a sign that either we aren't taking our abstraction barrier seriously enough, or that we need to put the abstraction barrier in a different place.
When to build an abstract data type
The short answer: Whenever you can.
A better answer: The best heuristic I know for deciding what ADTs to include in a program is to write down a description of how your program is going to work. For each noun or noun phrase in the description, either identify a built-in data type to implement it or design an abstract data type.
For example: a grade database maintains a list of students, and for each student it keeps a list of grades. So here we might want data types to represent:
- A list of students,
- A student,
- A list of grades,
- A grade.
If grades are simple, we might be able to make them just be ints (or maybe doubles); to be on the safe side, we should probably create a Grade type with a typedef. The other types are likely to be more complicated. Each student might have in addition to his or her grades a long list of other attributes, such as a name, an email address, etc. By wrapping students up as abstract data types we can extend these attributes if we need to, or allow for very general implementations (say, by allowing a student to have an arbitrary list of keyword-attribute pairs). The two kinds of lists are likely to be examples of sequence types; we'll be seeing a lot of ways to implement these as the course progresses. If we want to perform the same kinds of operations on both lists, we might want to try to implement them as a single list data type, which then is specialized to hold either students or grades; this is not always easy to do in C, but we'll see examples of how to do this, too.
Whether or not this set of four types is the set we will finally use, writing it down gives us a place to start writing our program. We can start writing interface files for each of the data types, and then evolve their implementations and the main program in parallel, adjusting the interfaces as we find that we have provided too little (or too much) data for each component to do what it must.