This is how
pkzip and many other popular
compression programs work. I shall give an
example, for those like me who have a hard time just
visualizing the process:
I wish to encode my alias, bitter_engineer.
- I look at the frequency of the letters, and create a table:
e - 4
i - 2
n - 2
r - 2
t - 2
b - 1
g - 1
_ - 1
- I start building a table, from the ground-up. I can do this recursively, by combining the two lowest nodes, then listing the new symbol in the new table:
e - 4
g_ i - 2
/ \ n - 2
g _ r - 2
t - 2
g_ - 2
b - 1
- So now I can create a new symbol, which is bg_. The next one would be rt, then ni, then rtgb_, then nie, then finally rtgb_nie:
15
/ \
/ \
/ \
/ \
7 8
/ \ / \
/ \ / \
3 4 4 4
/ \ / \ |\ e
2 1 2 2 2 2
/| b r t n i
1 1
g _
Numbers at the junctions are sums of all appearances of symbols below in the string. It turns out that it makes no difference in the final compressoin if the tree looks any different, as long as it follows the rule of combining the least-used nodes.
- Now I create a table of letters, giving the path from the top of the tree down to each letter. Left hand turns are zeros, right hand turns are ones:
e - 11
i - 101
n - 100
r - 010
t - 011
b - 001
g - 0000
_ - 0001
Note that even though the codes are of different lengths, they all lead to a dead end if followed from the top. As long as you don't lose your place in the bitstream, you will always get a unique answer.
- Finally, I encode my name into a bitstream, ignoring byte width. Spaces are byte separators. different encodings are separated by bold and normal fonts.
b i t t e r _ e n g i n e e r
00110101 10111101 00001111 00000010 11001111 01000000
The last five 0s are just padding to an even byte width.
The resulting string is 6 bytes long (if you exclude the table). In ASCII, it would be 15 bytes long. Even if I used an even 3 bit encoding for each of the 8 symbols that appear in the string, the encoded string would be 15*3 = 45 bits, 2 bits longer than Huffman.
Huffman encoding works best on inputs that have a heavy statistical distribution of certain symbols over others (such as text or bitmaps), so that advantage can be taken of giving shorter strings to more common symbols. In fact, Huffman encoding will not work at all unless the most frequent symbol is greater than the frequency of the two least frequent symbols. Like most compression algorithms, Huffman encoding fails miserably on pseudorandom input data in which all symbol frequencies are similar. Since jpeg images already use Huffman encoding, they usually encode very poorly using this method.