B-trees are a data structure designed to take advantage of the block structure of memory devices, particularly disks.
1. The memory hierarchy
Most of the time, programmers will assume that memory is flat, meaning that all memory references are equally expensive. While hardware designers work very hard to make memory references cheap, they are not all equally cheap.
Typical CPUs implement several layers of caching backed by system memory and then virtual memory. The cache layers are very fast memory (typically on the same die as the CPU); system memory is slower; virtual memory is slower by 6 orders of magnitude. Typically data is moved to the fastest available level of the cache when it is accessed; data that hasn't been accessed in a while is pushed out to slower levels to make room. To amortize the overhead costs of this process, data is managed in blocks, which may range from as little as 64 bytes for the on-chip caches to as much as 4096 bytes or more for the virtual memory system.
The effect of storing data in caches is that it's cheaper—sometimes much cheaper—to access data at an address close to other addresses you've looked at recently than to access data far away.
2. Caching and data structures
The structure of the cache system means that, all else being equal, a data structure will run faster if it has good locality, which means that pieces of the data structure accessed around the same time tend to be in about the same place. For example, given the choice between searching through an unsorted array and an unsorted linked list, you are better off searching the array: each position you look at in the array comes right after the previous position, so is likely to be stored in the same block. In contrast, the elements of the linked list will be wherever malloc managed to scrounge up space: in the worst case, one element per virtual memory block, each of which needs to be paged in from disk if you are running low on space. The disadvantage of the array over the linked list is that calling realloc may be expensive; one way to avoid this is to build a linked list of large nodes each of which holds an array, as done in this sample code: stackBlockList.c. This is actually slightly slower than just building a stack as an array (stackArray.c) but much faster than the simple linked-list approach (stackList.c).
A similar issue arises with binary search trees. Because we are only jumping through O(log n) nodes, the added cost of bad locality is not as noticeable as with linked lists, but we'd like to reduce it, especially if we expect our tree nodes to be stored on a slow device like a disk.
3. B-trees
B-trees solve this problem by grouping what would be many nodes together into a single giant node. The idea is that each node contains an array k keys and pointers to k+1 children, for a value of k that lies somewhere between m/2 and m (except for the root) where m is a parameter of the tree called the order. Searching a B-tree consists of finding the child in between the two keys that our target lies between. A picture of one node in a B-tree:
14 38 57 A B C D
Here there are three keys and four children A, B, C, and D. Node A is the root of a B-tree holding all keys less than 14, B the root of a B-tree holding all keys strictly between 14 and 38, etc.
At the bottom of the tree are leaf nodes that don't hold any pointers. A sophisticated implementation of a B-tree may take advantage of this to pack more keys into the leaves while still keeping the size of each node fitting neatly inside the block size (the implementation given below is not that sophisticated). At the top is a root node, which is just like any other internal node except that it may contain fewer than m/2 keys.
3.1. Insertions
To insert a key, we start by inserting it at the leaf, by moving any larger keys up to make room and then writing it into the sorted array. If the leaf becomes full, this may require some rebalancing.
B-trees are always completely balanced, with each leaf the same distance from the root. The mechanism for balancing a B-tree is that when a node becomes full, it splits into two nodes and pushes a key up into its parent. For example, we might split a three-key node into two one-key nodes like this:
14 38 57 A B C D (38) 14 57 A B C D
Pushing the key up into the parent will increase the number of keys in the parent and may force the parent to split as well. If we find ourselves splitting the root, we create a new root with one key and two children. This is the only way the height of the tree can increase.
3.2. Deletions
Deletion is mostly just the reverse of insertion, but has a lot of messy special cases. As in a binary search tree, it's complicated to delete a key from an internal node; the solution is to delete a key from a nearby leaf instead. We do not implement deletion in the code below, but you can read about it in HorowitzEtAl §11.2.4.
4. Implementation
Here is a basic implementation, not-very-tuned implementation of a B-tree. Despite the horrendous amount of copying in btInsert, the test program runs about four times faster with this implementation than an adapted version runs on the AVL tree code from C/AvlTree. There are many possible reasons for this (the AVL tree code is not very optimized), but cache performance and avoiding malloc overhead are probably the big factors.
1 /* implementation of a B-tree */
2 typedef struct btNode *bTree;
3
4 /* create a new empty tree */
5 bTree btCreate(void);
6
7 /* free a tree */
8 void btDestroy(bTree t);
9
10 /* return nonzero if key is present in tree */
11 int btSearch(bTree t, int key);
12
13 /* insert a new element into a tree */
14 void btInsert(bTree t, int key);
15
16 /* print all keys of the tree in order */
17 void btPrintKeys(bTree t);
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <assert.h>
4 #include <string.h>
5
6 #include "bTree.h"
7
8 #define MAX_KEYS (1024)
9
10 struct btNode {
11 int isLeaf; /* is this a leaf node? */
12 int numKeys; /* how many keys does this node contain? */
13 int keys[MAX_KEYS];
14 struct btNode *kids[MAX_KEYS+1]; /* kids[i] holds nodes < keys[i] */
15 };
16
17 bTree
18 btCreate(void)
19 {
20 bTree b;
21
22 b = malloc(sizeof(*b));
23 assert(b);
24
25 b->isLeaf = 1;
26 b->numKeys = 0;
27
28 return b;
29 }
30
31 void
32 btDestroy(bTree b)
33 {
34 int i;
35
36 if(!b->isLeaf) {
37 for(i = 0; i < b->numKeys + 1; i++) {
38 btDestroy(b->kids[i]);
39 }
40 }
41
42 free(b);
43 }
44
45 /* return smallest index i in sorted array such that key <= a[i] */
46 /* (or n if there is no such index) */
47 static int
48 searchKey(int n, const int *a, int key)
49 {
50 int lo;
51 int hi;
52 int mid;
53
54 /* invariant: a[lo] < key <= a[hi] */
55 lo = -1;
56 hi = n;
57
58 while(lo + 1 < hi) {
59 mid = (lo+hi)/2;
60 if(a[mid] == key) {
61 return mid;
62 } else if(a[mid] < key) {
63 lo = mid;
64 } else {
65 hi = mid;
66 }
67 }
68
69 return hi;
70 }
71
72 int
73 btSearch(bTree b, int key)
74 {
75 int pos;
76
77 /* have to check for empty tree */
78 if(b->numKeys == 0) {
79 return 0;
80 }
81
82 /* look for smallest position that key fits below */
83 pos = searchKey(b->numKeys, b->keys, key);
84
85 if(pos < b->numKeys && b->keys[pos] == key) {
86 return 1;
87 } else {
88 return(!b->isLeaf && btSearch(b->kids[pos], key));
89 }
90 }
91
92 /* insert a new key into a tree */
93 /* returns new right sibling if the node splits */
94 /* and puts the median in *median */
95 /* else returns 0 */
96 static bTree
97 btInsertInternal(bTree b, int key, int *median)
98 {
99 int pos;
100 int mid;
101 bTree b2;
102
103 pos = searchKey(b->numKeys, b->keys, key);
104
105 if(pos < b->numKeys && b->keys[pos] == key) {
106 /* nothing to do */
107 return 0;
108 }
109
110 if(b->isLeaf) {
111
112 /* everybody above pos moves up one space */
113 memmove(&b->keys[pos+1], &b->keys[pos], sizeof(*(b->keys)) * (b->numKeys - pos));
114 b->keys[pos] = key;
115 b->numKeys++;
116
117 } else {
118
119 /* insert in child */
120 b2 = btInsertInternal(b->kids[pos], key, &mid);
121
122 /* maybe insert a new key in b */
123 if(b2) {
124
125 /* every key above pos moves up one space */
126 memmove(&b->keys[pos+1], &b->keys[pos], sizeof(*(b->keys)) * (b->numKeys - pos));
127 /* new kid goes in pos + 1*/
128 memmove(&b->kids[pos+2], &b->kids[pos+1], sizeof(*(b->keys)) * (b->numKeys - pos));
129
130 b->keys[pos] = mid;
131 b->kids[pos+1] = b2;
132 b->numKeys++;
133 }
134 }
135
136 /* we waste a tiny bit of space by splitting now
137 * instead of on next insert */
138 if(b->numKeys >= MAX_KEYS) {
139 mid = b->numKeys/2;
140
141 *median = b->keys[mid];
142
143 /* make a new node for keys > median */
144 /* picture is:
145 *
146 * 3 5 7
147 * A B C D
148 *
149 * becomes
150 * (5)
151 * 3 7
152 * A B C D
153 */
154 b2 = malloc(sizeof(*b2));
155
156 b2->numKeys = b->numKeys - mid - 1;
157 b2->isLeaf = b->isLeaf;
158
159 memmove(b2->keys, &b->keys[mid+1], sizeof(*(b->keys)) * b2->numKeys);
160 if(!b->isLeaf) {
161 memmove(b2->kids, &b->kids[mid+1], sizeof(*(b->kids)) * (b2->numKeys + 1));
162 }
163
164 b->numKeys = mid;
165
166 return b2;
167 } else {
168 return 0;
169 }
170 }
171
172 void
173 btInsert(bTree b, int key)
174 {
175 bTree b1; /* new left child */
176 bTree b2; /* new right child */
177 int median;
178
179 b2 = btInsertInternal(b, key, &median);
180
181 if(b2) {
182 /* basic issue here is that we are at the root */
183 /* so if we split, we have to make a new root */
184
185 b1 = malloc(sizeof(*b1));
186 assert(b1);
187
188 /* copy root to b1 */
189 memmove(b1, b, sizeof(*b));
190
191 /* make root point to b1 and b2 */
192 b->numKeys = 1;
193 b->isLeaf = 0;
194 b->keys[0] = median;
195 b->kids[0] = b1;
196 b->kids[1] = b2;
197 }
198 }
See also testBTree.c and Makefile.