These are some quick notes on algorithms for searching strings for particular substrings, patterns, repetitions, etc., and related topics like data compression. For more details see GusfieldBook from CS365/ReservedBooks or the more recent textbook by Smyth (Bill Smyth, Computing Patterns in Strings, Pearson, 2003).
1. Why string algorithms?
- Answer from the old days: Fast string searching is the key to dealing with mountains of information. Why, a modern (c. 1960) electronic computer can search the equivalent of hundreds of pages of text in just a few hours...
- More recent answer:
We still need to search enormous corpuses of text (see http://www.google.com).
Algorithms for finding long repeated substrings or patterns can be useful for data compression (see Data_compression) or detecting plagiarism.
- We are made out of strings over a particular finite alphabet GATC; string algorithms are a central tool in computational biology.
2. What string algorithms?
There are many exciting algorithms for searching an n-character string for an m-character substring, ranging from simple brute-force place-by-place comparison (O(nm) time; see LevitinBook Section 3.2) to more sophisticated linear-time algorithms like Boyer-Moore (LevitinBook Section 7.2). The details of these more sophisticated algorithms will most likely not help you solve any problem that the algorithms themselves don't, and if there is not a canned implementation of Boyer-Moore somewhere in your libraries, you need to get better libraries. So we will not talk about these algorithms much.
Instead, we'll talk about some DataStructures for representing large strings that allow many common tasks to be done very quickly once the original string has been preprocessed to produced the data structure (which can generally be done in linear time). Examples of such tasks are:
- Find a given substring anywhere in the string.
- Find or count all occurrences of a given substring.
- Find all maximal substrings that occur more than once in the string.
3. Suffix trees and suffix arrays
Suffix trees and suffix arrays are data structures for representing texts that allow substring queries like "where does this pattern appear in the text" or "how many times does this pattern occur in the text" to be answered quickly. Both work by storing all suffixes of a text, where a suffix is a substring that runs to the end of the text. Of course, storing actual copies of all suffixes of an n-character text would take O(n2) space, so instead each suffix is represented by a pointer to its first character.
A suffix array stores all the suffixes sorted in dictionary order. For example, the suffix array of the string abracadabra is shown below. The actual contents of the array are the indices in the left-hand column; the right-hand shows the corresponding suffixes.
10 a 7 abra 0 abracadabra 3 acadabra 5 adabra 8 bra 1 bracadabra 4 cadabra 6 dabra 9 ra 2 racadabra
A suffix tree is similar, but instead using an array, we use some sort of tree data structure to hold the sorted list. A common choice given an alphabet of some fixed size k is a trie, in which each node at depth d represents a string of length d, and its up to k children represent all (d+1)-character extensions of the string. The advantage of using a suffix trie is that searching for a string of length m takes O(m) time, since we can just walk down the trie at the rate of one node per character in m. A further optimization is to replace any long chain of single-child nodes with a compressed edge labeled with the concatenation all the characters in the chain. Such compressed suffix tries can not only be searched in linear time but can also be constructed in linear time with a sufficiently clever algorithm (see Chapter 6 of GusfieldBook for details). Of course, we could also use a simple balanced binary tree, which might be preferable if the alphabet is large.
The disadvantage of suffix trees over suffix arrays is that they generally require more space to store all the internal pointers in the tree. If we are indexing a huge text (or collection of texts), this extra space may be too expensive.
3.1. Searching a suffix array
Suppose we have a suffix array corresponding to an n-character text and we want to find all occurrences in the text of an m-character pattern. Since the suffixes are ordered, the easiest solution is to do binary search for the first and last occurrences of the pattern (if any) using O(log n) comparisons. Unfortunately, each comparison may take as much as O(m) time, since we may have to check all m characters of the pattern. So the total cost will be O(m log n) in the worst case.
In practice, we can reduce the cost of the binary search for many inputs by taking advantage of the fact that if A[lo] and A[hi] have a longest common prefix w, then every row between A[lo] and A[hi has the same prefix. This allows us to avoid comparing initial segments of the pattern to the text once we have narrowed the search down enough. However, the worst-case cost of this approach is still O(m log n): for example, imagine doing binary search for bbbbbbbb where A[lo] is stuck at aaaaaaaa and A[hi] drops through cccccccc, cccccccb, ccccccbb, cccccbbb, ccccbbbb, etc., so there is no common prefix between A[lo] and A[hi] at any step. But in a typical case this approach is one of the fastest known.
By storing additional information about the longest common prefix of regisions of contiguous suffixes, it is possible to avoid having to re-examine every character in the pattern for every comparison, reducing the search cost to O(m + log n). With a sufficiently clever algorithm, this information can be computed in linear time, and can also be used to solve quickly such problems as finding the longest duplicate substrings, or most frequently occurring strings (GusfieldBook ยง7.14.4).
Using binary search on the suffix array, most searching tasks are now easy:
- Finding if a subtring appears in the array uses binary search directly.
- Finding all occurrences requires two binary searches, one for the first occurrence and one for the last. If we only want to count the occurrences and not return their positions, this takes O(m + log n) time. If we want to return their positions, it takes O(m + log n + k) time, where k is the number of times the pattern occurs.
Finding duplicate substrings of length m or more can be done by looking for adjacent entries in the array with long common prefixes, which takes O(mn) time in the worst case if done naively (and O(n) time if we have already computed longest common prefix information; see GusfieldBook).
3.2. Building a suffix array
A straightforward approach to building a suffix array is to run any decent comparison-based sorting algorithm on the set of suffixes (represented by pointers into the text). This will take O(n log n) comparisons, but in the worst case each comparison will take O(n) time, for a total of O(n2 log n) time.
The original suffix array paper by Manber and Myers ("Suffix arrays: a new method for on-line string searches," SIAM Journal on Computing 22(5):935-948, 1993) gives an O(n log n) algorithm, somewhat resembling radix sort, for building suffix arrays "in place" with only a small amount of additional space. They also note that for random text, simple radix sorting works well, since most suffixes become distinguishable after about logk n characters (where k is the size of the alphabet). Assuming random data would also give an O(n log2 n) running time for a comparison-based sort.
The fastest approach is to build a suffix tree in O(n) time and extract the suffix array by traversing the tree. The only complication is that we need the extra space to build the tree, although we get it back when we throw the tree away.
4. Burrows-Wheeler transform
Closely related to suffix arrays is the Burrows-Wheeler transform (Burrows and Wheeler, A Block-Sorting Lossless Data Compression Algorithm, DEC Systems Research Center Technical Report number 124, 1994), which is the basis for some of the best currently known algorithms for text compression (it's the technique that is used, for example, in bzip2).
The idea of the Burrows-Wheeler Transform is to construct an array whose rows are all cyclic shifts of the input string in dictionary order, and return the last column of the array. The last column will tend to have long runs of identical characters, since whenever some substring (like the appears repeatedly in the input, shifts that put the first character t in the last column will put the rest of the substring he in the first columns, and the resulting rows will tend to be sorted together. The relative regularity of the last column means that it will compress well with even very simple compression algorithms like run-length encoding.
Below is an example of the Burrows-Wheeler transform in action, with $ marking end-of-text. The transformed value of abracadabra$ is $drcraaaabba, the last column of the sorted array; note the long run of a's (and the shorter run of b's).
abracadabra$ abracadabra$ bracadabra$a abra$abracad racadabra$ab acadabra$abr acadabra$abr adabra$abrac cadabra$abra a$abracadabr adabra$abrac bracadabra$a dabra$abraca --> bra$abracada abra$abracad cadabra$abra bra$abracada dabra$abraca ra$abracadab racadabra$ab a$abracadabr ra$abracadab $abracadabra $abracadabra
The most useful property of the Burrows-Wheeler transform is that it can be inverted; this distinguishes it from other transforms that produce long runs like simply sorting the characters. We'll describe two ways to do this; the first is less efficient, but more easily grasped, and involves rebuilding the array one column at a time, starting at the left. Observe that the leftmost column is just all the characters in the string in sorted order; we can recover it by sorting the rightmost column, which we have to start off with. If we paste the rightmost and leftmost columns together, we have the list of all 2-character substrings of the original text; sorting this list gives the first two columns of the array. (Remember that each copy of the string wraps around from the right to the left.) We can then paste the rightmost column at the beginning of these two columns, sort the result, and get the first three columns. Repeating this process eventually reconstructs the entire array, from which we can read off the original string from any row. The initial stages of this process for abracadabra$ are shown below:
$ a $a ab $ab abr d a da ab dab abr r a ra ac rac aca c a ca ad cad ada r a ra a$ ra$ a$a a b ab br abr bra a -> b ab -> br abr -> bra a c ac ca aca cad a d ad da ada dab b r br ra bra rac b r br ra bra ra$ a $ a$ $a a$a $ab
Rebuilding the entire array in this fashion takes O(n2) time and O(n2) space. In their paper, Burrows and Wheeler showed that one can in fact reconstruct the original string from just the first and last columns in the array in O(n) time.
Here's the idea: Suppose that all the characters were distinct. Then after reconstructing the first column we would know all pairs of adjacent characters. So we could just start with the last character $ and regenerate the string by appending at each step the unique successor to the last character so far. If all characters were distinct, we would never get confused about which character comes next.
The problem is what to do with pairs with duplicate first characters, like ab and ac in the example above. We can imagine that each a in the last column is labeled in some unique way, so that we can talk about the first a or the third a, but how do we know which a is the one that comes before b or d?
The trick is to look closely at how the original sort works. Look at the rows in the original transformation. If we look at all rows that start with a, the order they are sorted in is determined by the suffix after a. These suffixes also appear as the prefixes of the rows that end with a, since the rows that end with a are just the rows that start with a rotated one position. It follows that all instances of the same letter occur in the same order in the first and last columns. So if we use a stable sort to construct the first column, we will correctly match up instances of letters.
This method is shown in action below. Each letter is annotated uniquely with a count of how many identical letters equal or precede it. Sorting recovers the first column, and combining the last and first columns gives a list of unique pairs of adjacent annotated characters. Now start with $1 and construct the full sequence $1 a1 b1 r1 a3 c1 a4 d1 a2 b2 r2 a5 $1. The original string is obtained by removing the end-of-string markers and annotations: abracadabra.
$1 a1 d1 a2 r1 a3 c1 a4 r2 a5 a1 b1 a2 --> b2 a3 c1 a4 d1 b1 r1 b2 r2 a5 $1
Because we are only sorting single characters, we can perform the sort in linear time using counting sort. Extracting the original string also takes linear time if implemented reasonably.
4.1. Suffix arrays and the Burrows-Wheeler transform
A useful property of the Burrows-Wheeler transform is that each row of the sorted array is essentially the same as the corresponding row in the suffix array, except for the rotated string prefix after the $ marker. This means, among other things, that we can compute the Burrows-Wheeler transform in linear time using suffix trees. Ferragina and Manzini (http://www.imc.pi.cnr.it/~manzini/papers/focs00.html) have further exploited this correspondence (and some very clever additional ideas) to design compressed suffix arrays that compress and index a text at the same time, so that pattern searches can be done directly on the compressed text in time close to that needed for suffix array searches.
5. Older notes
The file strings.pdf contains an older version of these notes.