The Burrows-Wheeler Block-sorting transformation (henceforth BWT, to save my
keyboard) is a
reversible,
lossless data manipulation algorithm originally discovered by
David Wheeler at
Cambridge in 1983 but not described until a paper written jointly with
Mike Burrows for
DEC in 1994. Although the algorithm is unavoidably slightly
expansive, is sometimes used as a
pre-processor in
compression software (most notably
bzip2) because its
output is (usually) significantly more compressible than its input.
Sadly, the reason why this is so is not immediately apparrent when the algorithm
is applied to short strings. Nevertheless, since following through an example is
the easiest way to explain how the algorithm works, I will demonstrate its
application on the word "EFFERVESCENCE"
Stage 1: The Sort
A data block of length
n is imagined as a cycle of the same
circumference, so that the last byte of the original string is immediately
followed by the first byte. This is then expanded into an
n x n matrix,
with each row shifted by one relative to the previous row:
E F F E R V E S C E N C E
F F E R V E S C E N C E E
F E R V E S C E N C E E F
E R V E S C E N C E E F F
R V E S C E N C E E F F E
V E S C E N C E E F F E R
E S C E N C E E F F E R V
S C E N C E E F F E R V E
C E N C E E F F E R V E S
E N C E E F F E R V E S C
N C E E F F E R V E S C E
C E E F F E R V E S C E N
E E F F E R V E S C E N C
The rows of this matrix are then sorted into
lexical order:
C E E F F E R V E S C E N
C E N C E E F F E R V E S
E E F F E R V E S C E N C
E F F E R V E S C E N C E
E N C E E F F E R V E S C
E R V E S C E N C E E F F
E S C E N C E E F F E R V
F E R V E S C E N C E E F
F F E R V E S C E N C E E
N C E E F F E R V E S C E
R V E S C E N C E E F F E
S C E N C E E F F E R V E
V E S C E N C E E F F E R
We take as our output of this stage the final column of this matrix, to which is
appended an
integer containing the
zero-based index of the original first
byte of the data block (This is the cause of the slight expansion mentioned
above, which is sadly unavoidable as we need to know this number in order to
reverse the transformation). In our example here, therefore, our output from
stage one of the process is:
"NSCECFVFEEEER",8
The move-to-front coder is a simple
locally adaptive one-for-one coding mechanism. You start with a stack containing the
ASCII character set, and then, for each entry in your string, you return that particular byte's current position in the stack and then move it to position zero in the stack, pushing everything underneath it down as you go. In
Hexadecimal notation, the output of encoding the string above, and the final uncompressed output of the BWT, is:
4D, 52, 44, 46, 1, 47, 55, 1, 3, 0, 0, 0, 53, F
Undoing the procedure
Reversing the Move-to-front coding is a simple procedure that simply involves applying the previous section again in reverse. Once this is done, however, there is no simple
unsort() function that can transform "NSCECFVFEEEER",8 back to "EFFERVESCENT". To do this we need to know the
transformation vector T{} which tells us which rows of the initial matrix ended up where.
Whe begin by determining which character comes after each character of the input in the initial string. Because this is simply the first column of the initial matrix, which, by definition, was sorted into lexical order, we can determine this simply by sorting the string:
N C
S C
C E
E E
C E
F E
V E
F F
E F
E N
E R
E S
R V
From this, whe can calculate the transformation vector. the 'N' in row 0 of column 1 clearly moves to row 9 of column 2, so we can say that
T{9}=0
Difficulty arises, however, when considering which of the 'C's, 'F's, and 'E's to match with which. Fortunately, however, the choice is not ambiguous as the 'Es' in the first column also appear in sorted order, as they all begin with the same character and are sorted on the second character. Thus the 'F' in row 5 of column 1 matches that in row 7 of column 2, and so on. Following this through to its conclusion, we end up with the vector:
T = { 2, 4, 3, 8, 9, A, B, 5, 7, 0, C, 1, 6 }
From this vector and the output of the sort, we can work out the initial input of the process as follows: Let I0 be the integer we appended to the stage 1 output, and then let Output1 equal the I0th character of the input string and I1 equal T{I0}. Thus:
I0 = 8
O1 = 'E', I1 = 7
O2 = 'F', I2 = 5
O3 = 'F', I3 = A
O4 = 'E', I4 = C
... and so on, until the original input "EFFERVESCENCE" is restored.
So what's the point?
Well, like I said at the start, it's sort of hard to see with only a short string. However, if you look at the output of the move-to-front coder in stage 2, you may notice the beginnings of a definite skew towards low numbers in general and the number 0 in particular. As the size of the block being sorted increases and repeating patterns become more frequent, this skew increases dramatically. And, not by any form of coincedence at all seeing as this is why the algorithm was designed, this same skew is precisely what an
entropy-based compression utility requires in order to achieve a good
compression ratio.
The choice of compression algorithm used with the BWT is almost limitless. Sadly, however, the highly efficient arithmetic compressor reccommended by Burrows and Wheeler in their paper is covered by several software patents which limit its utility. The most popular implementation of the BWT, the open-source program bzip2, therefore uses the slightly less efficient, but still quite good and patent-free, Huffman coding method.
Sources:
"A Block-sorting Lossless Data Compression Algorithm", M. Burrows, D.J. Wheeler, 10 May 1994: ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.Z (Compressed PostScript)
"Data Compression with the Burrows-Wheeler Transform", article by Mark Nelson for Dr. Dobbs Journal, Sept. 1996: http://www.dogma.net/markn/articles/bwt/bwt.htm
The comp.compression FAQ: http://www.faqs.org/faqs/compression-faq/part2/section-9.html