Huffman came up with this algorithm: you build a tree where each node's value is the sum of all child nodes, and each leaf contains an element to be compressed and a relative frequency. The tree is built by making a new node which is the parent to the two lowest-valued nodes. By traversing the tree, 0 for left and 1 for right, you get unique, optimal binary codes for each compressed element (the lower the frequency, the longer the code). There exist many variations as well.

This is how pkzip and many other popular compression programs work. I shall give an example, for those like me who have a hard time just visualizing the process:

I wish to encode my alias, bitter_engineer.

  1. I look at the frequency of the letters, and create a table:
    e - 4
    i - 2
    n - 2
    r - 2
    t - 2
    b - 1
    g - 1
    _ - 1
    
  2. I start building a table, from the ground-up. I can do this recursively, by combining the two lowest nodes, then listing the new symbol in the new table:
                    e - 4
      g_            i - 2
     /  \           n - 2
    g    _          r - 2
                    t - 2
                   g_ - 2
                    b - 1
    
  3. So now I can create a new symbol, which is bg_. The next one would be rt, then ni, then rtgb_, then nie, then finally rtgb_nie:
                          15
                         /  \
                        /    \
                       /      \
                      /        \
                     7          8
                    / \        / \
                   /   \      /   \
                  3     4    4     4
                 / \   / \   |\     e
                 2 1   2  2  2 2
                /| b   r  t  n  i
               1 1
               g _  
    
    Numbers at the junctions are sums of all appearances of symbols below in the string. It turns out that it makes no difference in the final compressoin if the tree looks any different, as long as it follows the rule of combining the least-used nodes.
  4. Now I create a table of letters, giving the path from the top of the tree down to each letter. Left hand turns are zeros, right hand turns are ones:
    e - 11
    i - 101
    n - 100
    r - 010
    t - 011
    b - 001
    g - 0000
    _ - 0001
    
    Note that even though the codes are of different lengths, they all lead to a dead end if followed from the top. As long as you don't lose your place in the bitstream, you will always get a unique answer.
  5. Finally, I encode my name into a bitstream, ignoring byte width. Spaces are byte separators. different encodings are separated by bold and normal fonts.
     b  i  t   t e  r   _  e  n   g   i   n e e  r
    00110101 10111101 00001111 00000010 11001111 01000000

    The last five 0s are just padding to an even byte width.

The resulting string is 6 bytes long (if you exclude the table). In ASCII, it would be 15 bytes long. Even if I used an even 3 bit encoding for each of the 8 symbols that appear in the string, the encoded string would be 15*3 = 45 bits, 2 bits longer than Huffman.

Huffman encoding works best on inputs that have a heavy statistical distribution of certain symbols over others (such as text or bitmaps), so that advantage can be taken of giving shorter strings to more common symbols. In fact, Huffman encoding will not work at all unless the most frequent symbol is greater than the frequency of the two least frequent symbols. Like most compression algorithms, Huffman encoding fails miserably on pseudorandom input data in which all symbol frequencies are similar. Since jpeg images already use Huffman encoding, they usually encode very poorly using this method.

Huffman codes are a widely used and very effective technique for the compression of electronic data. Using the Huffman codes technique, savings of 20% to 90% are the norm depending on the characteristics of the file being compressed.

In order to understand what Huffman codes are and how exactly they can compress data so well, we need to consider the underlying problem of designing a binary character code. A binary character code is a type of code in which each character in the alphabet is represented by a unique binary string (i.e., a unique sequence of 0's and 1's). Huffman codes are a particular type of binary code that are especially useful in some situations.

For a simple example of a binary character code, let's assume we have a text file 100,000 characters in length. The entire file consists of just the letters a, b, c, d, e, and f, so the alphabet size of the file is 6.

Since our alphabet is of size 6, it makes sense that we would only need six different binary strings (repeated over and over again in a particular order) to represent this text file on an electronic system. Let's assume that for simplicity's sake the binary strings should all be the same length, for ease of processing.

If that is the case, each character must be represented with a binary string of length 3; length 2 doesn't provide enough unique strings to represent each letter, and length 4 wastes space with extra unused strings. Let's take a look at the following table, summarizing the contents of this file:

                             |  a  |  b  |  c  |  d  |  e  |  f  |
-----------------------------|-----|-----|-----|-----|-----|-----|
Frequency (in thousands)     | 53  | 11  | 10  | 14  |  7  |  5  |
Fixed-length codeword        | 000 | 001 | 010 | 011 | 100 | 101 |

Using this scheme, it would take 300,000 (i.e., 3 bits/character x 100,000 characters) bits to store the entire file. This is pretty good, but this can easily be topped by using a variable length code. Let's use the following variable length strings to represent the data file:

                             |   a   |   b   |   c   |   d   |   e   |   f   |
-----------------------------|-------|-------|-------|-------|-------|-------|
Frequency (in thousands)     |  53   |  11   |  10   |  14   |   7   |   5   |
Fixed-length codeword        |  000  |  001  |  010  |  011  |  100  |  101  |
variable-length codeword     |  0    |  101  |  100  |  111  |  1101 |  1100 |

With this variable length codeword, it would take 53,000 x 1 + (11,000 + 10,000 + 14,000) x 3 + (7,000 + 5,000) x 4 = 206,000 bytes to store the file, a savings of over 31% in storage space.

This seems impressive, but there is a problem: how can we make sure that we can tell that each of the variable-length codewords can easily be told apart? With a fixed length codeword, it's easy to make sure which is which, as they can be delimited by their specific length. With a variable-length codeword, length doesn't matter.

In order to do this, we have to make sure that the variable-length code we use is a prefix code. In other words, we have to make sure that each codeword in the code is not a prefix of another codeword in that code. For example, since in the code above, 0 is used to represent a, no other code can begin with 0. Since 101, 100, and 111 are used to represent b, c, and d, no other codes can start with 101, 100, or 111, meaning the other codes must all start with 110, which the other two do (1101 and 1100 for e and f, respectively).

A Huffman code is, then, an optimal variable length code, meaning that the most common member of the code has the shortest length, and the lengths of the codewords increase as the frequency of the codewords in the file decrease. A Huffman code is usually generated by constructing a binary tree using a simple greedy algorithm, which can be simply demonstrated as follows.

Let's say we need to construct a binary tree using the characters from the file shown above:

(a: 53)  (b: 11)  (c: 10)  (d: 14)  (e: 7)  (f: 5)

Using a greedy algorithm to build a tree out of these means at each step, we choose the root that has the lowest value and merge it with the tree with the second lowest root value, creating a new root. We don't re-balance the tree; instead, we leave it as is. Let's get going; first, we merge e and f:

(a: 53)  (b: 11)  (c: 10)  (d: 14)       (12)
                                         /  \
                                        /    \
                                    (e: 7)  (f: 5)

Next, we choose the two roots with the lowest value and merge again. This time, c and b are the two lowest:

(a: 53)        (21)        (d: 14)       (12)
               /  \                      /  \
              /    \                    /    \
         (b: 11)  (c: 10)           (e: 7)  (f: 5)

Now, d and the tree made up of e and f have the two lowest values, so we merge them as follows (note that switching left and right on each individual subtree does not matter while merely constructing the tree as long as proper parent-child relations remain):

(a: 53)        (21)                      (26)
               /  \                      /  \
              /    \                    /    \
         (b: 11)  (c: 10)         (d: 14)    (12)
                                             /  \
                                            /    \
                                        (e: 7)  (f: 5)

Merging 21 and 26, then merging this tree with a gives our final tree:

                    (100)
                    /   \
                   /     \
              (a: 53)    (47)
                         /  \
                        /    \
                       /      \
                      /        \
                   (21)        (26)
                   /  \        /  \
                  /    \      /    \
             (b:11) (c:10) (12)  (d:14)
                           /  \
                          /    \
                      (e: 7)  (f: 5)

Once you have assembled this tree, assigning a 1 to each right branch and a 0 to each left branch builds your Huffman codes for you, and builds them optimally, for the least frequent characters are the furthest from the root and thus have the longest strings. Also, given the nature of the binary tree assembled, each bitstring is unique and is not the prefix of another.

Huffman codes, usually assembled by a binary tree as above, are used extremely regularly in modern computing for compression of data. Whenever you listen to an mp3 or use a zip file, you are utilizing Huffman codes.

Representing a Huffman Encoding

Since Huffman codes are built dynamically by considering the actual data to be compressed, an analysis of memory savings must also include the space required to represent the encoding itself. In a sufficiently random file a Huffman encoding would not generate any savings (all the leaves could be at the same height resulting in a fixed-length code), and storing the encoding would actually cause the resulting 'compressed' file to be larger. Fortunately real-world files rarely have such an even distribution of symbols.

As it turns out, a Huffman encoding requires (at least) 2n - 1 + n ceiling(lg n) bits to represent it where n = the number of symbols. Basically we are trying to store the structure of a Huffman tree as efficiently as possible. The critical thing to note about this tree is that any given node either has 2 children or no children. Since each node has only 2 possible configurations, we can represent each node with one bit. The structure of the tree will then be implied by the order of the bits. For example, take the following tree:

    ( )
    / \
   /   \
 ( )   ( )
       / \
      /   \
    ( )   ( )      

Say we assign a 1 for leaves (ie. nodes with no children) and 0 for non-leaves. If we define a method of traversing the tree, we can infer its structure. For example, say we start at the top and go left to right until we reach the bottom-rightmost node. Such an encoding would be 01011 for the example given. Now if you read the string left to right, you can reconstruct the tree starting with the root node, and adding children every time you come to a 0.

You can see this takes 2n-1 bits because there are 2n-1 nodes in any tree that is built this way. Another way to think of it is that you have n objects that need to be connected together. Each connection requires exactly one new node and merges exactly two previously seperated groups. Therefore to connect n objects, you need n-1 parent nodes, hence we have 2n-1.

The remaining n ceiling(lg n) bits required to complete the encoding constitute a mapping from your Huffman codes to fixed-length encodings that are globally recognized (eg. ASCII). ceiling(lg n) is the number of bits it takes to provide a unique encoding for n different symbols and we have n of them. Again, these can be mapped to the leaves in any clearly defined order to restore full meaning to the tree.

This is a straightforward matter when you use a set of symbols with fixed-length encodings, but for more complex compression, the symbols may need to be defined in a longer format for efficiency. For example, if you were building a compressor for English text, you might want to define symbols for common words such as 'the' and 'of' in addition to single letters. You wouldn't want to use this simple representation, because it would require you to use enough bits to represent the largest element for all the elements. In this case you could add one more layer of abstraction connecting your ceiling(lg n) length codes to the actual symbols.

Log in or register to write something here or to contact authors.