[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. A table-based compression algorithm

For this assignment, you are to implement a data compressor and decompressor designed for files containing many duplicate lines.

The input to your compressor is a sequence of lines of text on stdin, each terminated by a newline. The state of the compressor is a table of all distinct lines that have previously been read. When a line is read, the compressor first checks to see if it is in the table.

If the line is already in the table, the compressor outputs to stdout the rank of the line, defined as the number of lines strictly less than it in the table (in strcmp order). For example, if the table contains a, aa, b, and c, then a has rank 0, aa rank 1, b rank 2, etc. This rank is formatted as follows:

If the line is not in the table, the compressor adds it to the table, and outputs the character 0xff, followed by the line (not including the terminating newline), followed by the character 0x00 (i.e. '\0').

For example, on input

abc
abc
def
def
abc
aaa
abc

the output would be (in hexadecimal) ff 61 62 63 00 00 ff 64 65 66 00 01 00 ff 61 61 61 00 01. The actual output would be an unprintable 18-byte string.

2. Your task

Your task is to write both a compressor and a decompressor for this format. Your submission should consist of whatever source files you need together with a Makefile that creates two programs compress and decompress when run with make with no arguments. The compress program should implement the above algorithm. The decompress should take the output of compress and reconstruct the original input: it should be the case that the output of ./compress | ./decompress is equal to its input for any file in which all lines end with a newline. In the case that the last line ends with EOF, your compress program should process it as if it contained a final terminating newline.

3. Hints

4. Sample solution

Here is the sample solution in tar.bz2 format; unpack it with tar jxvf 223-05-09-solution.tar.bz2: 223-05-09-solution.tar.bz2.


2014-06-17 11:58