Representing a Huffman Encoding

Since Huffman codes are built dynamically by considering the actual data to be compressed, an analysis of memory savings must also include the space required to represent the encoding itself. In a sufficiently random file a Huffman encoding would not generate any savings (all the leaves could be at the same height resulting in a fixed-length code), and storing the encoding would actually cause the resulting 'compressed' file to be larger. Fortunately real-world files rarely have such an even distribution of symbols.

As it turns out, a Huffman encoding requires (at least) 2n - 1 + n ceiling(lg n) bits to represent it where n = the number of symbols. Basically we are trying to store the structure of a Huffman tree as efficiently as possible. The critical thing to note about this tree is that any given node either has 2 children or no children. Since each node has only 2 possible configurations, we can represent each node with one bit. The structure of the tree will then be implied by the order of the bits. For example, take the following tree:

    ( )
    / \
   /   \
 ( )   ( )
       / \
      /   \
    ( )   ( )      

Say we assign a 1 for leaves (ie. nodes with no children) and 0 for non-leaves. If we define a method of traversing the tree, we can infer its structure. For example, say we start at the top and go left to right until we reach the bottom-rightmost node. Such an encoding would be 01011 for the example given. Now if you read the string left to right, you can reconstruct the tree starting with the root node, and adding children every time you come to a 0.

You can see this takes 2n-1 bits because there are 2n-1 nodes in any tree that is built this way. Another way to think of it is that you have n objects that need to be connected together. Each connection requires exactly one new node and merges exactly two previously seperated groups. Therefore to connect n objects, you need n-1 parent nodes, hence we have 2n-1.

The remaining n ceiling(lg n) bits required to complete the encoding constitute a mapping from your Huffman codes to fixed-length encodings that are globally recognized (eg. ASCII). ceiling(lg n) is the number of bits it takes to provide a unique encoding for n different symbols and we have n of them. Again, these can be mapped to the leaves in any clearly defined order to restore full meaning to the tree.

This is a straightforward matter when you use a set of symbols with fixed-length encodings, but for more complex compression, the symbols may need to be defined in a longer format for efficiency. For example, if you were building a compressor for English text, you might want to define symbols for common words such as 'the' and 'of' in addition to single letters. You wouldn't want to use this simple representation, because it would require you to use enough bits to represent the largest element for all the elements. In this case you could add one more layer of abstraction connecting your ceiling(lg n) length codes to the actual symbols.