Let's delve a little deeper, shall we?

It will help if you first read the node on Huffman coding and understand the motivation behind it. (Also take a gander at the node on statistical compression.) Our basic situation is this: There are two parties, a sender and a receiver. The sender wishes to transmit a file to the receiver as quickly as possible. Since network transmissions are (typically) much slower than calculations that occur inside the machine, the sender will want to compress the file first. When the receiver gets the compressed file, he will then decompress it on his end. To make this exchange, the sender needs a compression program (an encoder) and the receiver needs a decompression program (a decoder). Typically the encoder and decoder are written together, and they are designed to understand the same format — when the decoder is run on an encoded file, the original file is the result. You have doubtless heard of some compression formats: ZIP files are perhaps the most common in general, while in the Unix world gzip and bzip2 see wide use. Compression formats like these involve many steps; what we will describe here is but one step in a general compression format.

The easiest way to send a file is one byte at a time; when we do this, no encoding or decoding is even required! The downside is that for many files, much of the bandwidth used in sending the file will be wasted. Suppose our files contain typical English text (in ASCII, not in any word processor format). A byte can represent 256 possible values (since it contains eight bits), but we do not need that many to represent text! We only need 52 to represent letters (uppercase and lowercase) and a few more to represent punctuation, spaces, etc. It turns out that 128 values will more than suffice to hold our text. So, when we are going to send our files, we can say, at the beginning of our transmission, something like "these files are all text". Then when we send them, we can use seven bits for each character instead of eight. (Typically the most significant bit is dropped, because that's easiest; but there are many other ways it could be done.) The decoder will understand what is going on, and it will know that it is actually receiving seven-bit characters. Before it saves the file, it will append the missing bit to each character. For a large file, this will reduce the transmission time by about one-eighth (12.5%), since we are sending seven bits every time instead of eight. If you have used FTP a lot, you may recognize this from the ASCII/BINARY distinction when transmitting files.

Although we talk about senders, receivers and saving bandwidth, compression is often used just to save disk space. For instance, imagine a simple image format where the color for each pixel is stored as an integer: 0=white, 1=black, 2=red, 3=blue, etc. If we only allow, say, 16 colors, then using one byte for each pixel would be quite wasteful: four bits in the byte could store the color, and the other four bits would not even be needed. To avoid wasting disk space, we could cram two pixels into each byte, in effect halving the space that we need to use to store the image. Perhaps when the file is loaded into an image editor, the editor's internal representation will use a whole byte for each color (this can make some computations easier). Even if it does, there is no need for this wasted space to be written to disk: the save function can compress it, and the load function can decompress it.

There is another factor that we have not considered: if some values occur more frequently than others, we ought to be able to use a shorthand for them. Suppose we used our image editor to make a picture that has one black pixel followed by all white pixels. Our above format represents it as `0001000000000000...`

, where the initial `0001`

represents the black pixel, and every subsequent block `0000`

represents a white pixel. Since we are only using two colors anyway, why can't we just write `1000000...`

?

We can, of course. And if we were using four colors (say 0, 1, 2, 3, representing white, black, red, blue), we could represent them in the file as pairs of bits: `00, 01, 10, 11`

. But if one color is used much more often than the others, we can do even better: suppose most of the image is white, while black/red/blue are used only sparingly. We could use the following code when we save the file:

white: 0
black: 10
red: 110
blue: 111

This way, each white pixel can be represented in a single bit. The trade-off is that the other colors can be more expensive to represent (here, red and blue require three bits now). But red and blue occur infrequently, so that for the most part we will be using close to one bit per pixel, on average!

This type of code is called a prefix code. How can we decode something that has been encoded this way? It is not as simple as it was before, since now the values in the file are not stored as blocks of the same size. If we read it from the beginning, though, we can decode it. The defining quality of a prefix code is that no codeword (here 0, 10, 110, 111) begins with another codeword. Whenever this is true, there is an unambiguous way to perform the decoding. For example, given the encoded file 0110101110, we can tell that it was encoded as 0 110 10 111 0; there is no other way to decode it that makes sense.

Prefix codes are easily generated by Huffman coding; Huffman's algorithm is simple and elegant, but the particulars of his algorithm do not concern us here. We will only point out two weaknesses of Huffman codes. The first is that each value in the input must be encoded to at least one bit in the output. Before we get to the nitty-gritty, let us define our scenario more precisely. Instead of using text files or images as our examples, from now on we will talk about sending *messages*, where each message consists of a sequence of *letters* from an alphabet `a, b, c,`

etc. To make things simpler we will use small alphabets; for now, let us use a three-letter alphabet {`a`

, `b`

, `c`

). We will now talk of encoding and decoding messages. Each letter in the alphabet occurs with a certain probability: we can think of this probability as the letter's frequency (or anticipated frequency) in the message, relative to the frequencies of the other letters. Suppose for now that `a`

occurs with probability 0.95 — that is, 95% of the letters in our messages are `a`

, or we expect about 95% of the letters in a typical message to be `a`

— suppose `b`

occurs with probability 0.03, and suppose `c`

occurs with probability 0.02. We can build a Huffman code for this alphabet. Read the node on Huffman coding and try it yourself! If you don't want to, I'll give it here:

a 0
b 10
c 11

Thus the message

`bacca`

will be encoded to the

bitstring (sequence of bits) 10011110. A typical message (according to our probabilities) will have lots of

`a`

's, so that an encoded message should use approximately 1 bit per letter. (The actual value is 1.05 bits per letter, given by a weighted average of the

codeword lengths: 0.95 * 1 + 0.03 * 2 + 0.02 * 2.) Not bad... but we would like to do even better! Intuitively, we should be able to have a symbol that means "this part is all

`a`

's, nothing to see here", while another symbol would say "we have some

`b`

's or

`c`

's here, so pay attention!" That way, a message consisting almost entirely of

`a`

's would be really short. We could try making another Huffman code:

aaaaa 0
a 10
b 110
c 111

and use the codeword 0 to encode five

`a`

's at a time. This gets harder if we have a larger alphabet with more disparate probabilities. Besides, we only guessed that it would work out nicely to group

`a`

's into blocks of five; maybe it will not be effective, or possibly even

counter-productive.

A related weakness can emerge even with less extreme probabilities. Let's consider the *least* extreme situation we can have: suppose `a, b,`

and `c`

occur with equal probability (each probability is one-third). Then our Huffman code would again be 0, 10, 11, and the average number of bits used per letter will be about 1.67 (the average of the codeword lengths 1, 2, 2). It is unfortunate that we had to make two of the codewords (10 and 11) so much
longer than the other one (0). Isn't there a way to even them out? For equal probabilities, there is a neat trick. In our example, we can treat the letters `a, b, c`

as the digits 0, 1, 2, and we can treat the message as a number in base 3 (also called "ternary"). Then our message `bacca`

becomes the ternary number 10220, which in decimal is the number 105. In binary this becomes 1101001, and this is bitstring that we will send. For a longer message, say containing *N* letters, we obtain a number that is at most 3^{N}. The number of bits needed to represent a number is given by its base-2 logarithm, written log_{2}; here, the number of bits we will need for our encoded message is log_{2} 3^{N}, approximately 1.58*N. This factor 1.58 is much better than the factor of 1.67 we obtained earlier!

This is a nice way to compress when all the letters in the alphabet occur with equal probability. Typically, though, the probabilities aren't equal. We'd like a way to apply this technique to more general probability distributions on the letters.

* * * * * * * * * * * * * * * * * * * * * * * * * * * *

Enter arithmetic coding. Arithmetic coding provides another way to encode a message, made up of a sequence of letters from an alphabet, into a string of bits. Recall that each character occurs with a particular probability; we want shorter bitstrings to correspond to messages with higher probability. Now, for simplicity, our example will just use a two-character alphabet, `a`

and `b`

, where `a`

occurs with probability 0.8 and `b`

occurs with probability 0.2. (There is no problem with using more letters, but the arithmetic becomes harder to follow. To simplify things as much as possible we will stick to two letters.) We assume that the letters in each position in the message are determined independently; under this assumption, the probability of a message is the product of the probabilities of each letter. For example, out of all the messages of length 3, the message `a`

`a`

`b`

occurs with probability 0.8 * 0.8 * 0.2 = 0.128, and the message `b`

`b`

`b`

occurs with probability 0.2 * 0.2 * 0.2 = 0.008. Therefore, the message `a`

`a`

`b`

should map to a shorter bitstring than `b`

`b`

`b`

. Does it? We'll find out.

Each letter in the alphabet is assigned to a half-open subinterval of the interval [0, 1) — that is, all the real numbers from 0 to 1, including 0 but excluding 1. The letter `a`

is assigned to an interval of width 0.8, while the letter `b`

is assigned to an interval of width 0.2. Let's say `a`

is assigned to [0, 0.8) and `b`

is assigned to [0.8, 1). Another way to think about this is that `a`

is assigned the first 80%, while `b`

is assigned the last 20%. To encode a message, we start with the interval [0, 1) and take subintervals of it based on the letters in the message. Let's see what happens with the message `a`

`a`

`b`

. Our initial interval is [0, 1). We first read in `a`

; this tells us to choose the first 80% of our current interval, so that our new interval is [0, 0.8). Our next letter is also `a`

; we choose the first 80% of our current interval, giving us [0, 0.64). Our final letter is `b`

, telling us to choose the last 20% of the interval; this gives us our final interval of [0.512, 0.64). That wasn't too bad. Look below for a breathtakingly beautiful diagram that explains the steps we just took.

[0, 1) *****************************************************************************************************************************
a: [0, 0.8) ****************************************************************************************************_________________________
a: [0, 0.64) ********************************************************************************_____________________________________________
b: [0.512, 0.64) ________________________________________________________________****************_____________________________________________

Now we need to transmit this interval. While we could send these values exactly, in general this is not a wise thing to do. Our frequencies might not be written easily as decimals, and if we try to submit the fraction in numerator-denominator form, the denominators may be too large to send. Imagine briefly that our probabilities had been 63/64 and 1/64; then on a message with *n* letters, the denominator would always be 64^{n}, which would defeat the purpose!

Instead, we send a bitstring that, when interpreted as a binary fraction, unambiguously decodes to the interval we obtained — in our case, [0.512, 0.64). Our bitstring should have the property that no matter what bits we place after it, the number that results always falls within that interval.^{*} So the bitstring 101 would not work, even though the binary fraction **.101** equals 0.625 (we will write binary numbers in bold), because when we add a 1, we get **.1011** = 0.6875, which falls outside our interval. A bitstring that works is 1001: we have **.1001** = 0.5625, and even if we put infinitely many 1's after it, we get **.1001 111...** = 1/2 + 1/8 = 0.625, which is still in the interval. Thus we encode our message as the bitstring 1001.

^{*} *Note: Technically, we only have to worry about putting a ***finite** number of bits after the string. For instance, if our interval were [0.375, 0.5), then we could use the bitstring 011 for our message, since any finite number of bits we place after **.011** will still give a number less than 0.5. If it easier to check, however, with an infinite sequence of 1's (**.011 111...** = 0.5), since this has the effect of adding a 1 to the final position in the number (**.011** + **.001**). So in our scheme, we will instead use 0110, since adding an infinite sequence of 1's gives us **.0110 111...** = 0.4375 < 0.5. We end up using one extra bit that we didn't need to; but for long messages, the time needed to handle one extra bit will hardly be noticed.

Now we are ready to send our compressed message! There are some initial control data that we must send first: the length of our message (3), the length of our bitstring (4), and the probabilities for each letter (`a`

-> 0.8, `b`

-> 0.2). Finally, we write out the bits 1001. The receiver gets this information and knows that the bitstring 1001 was used to represent a message of 3 letters. How will he decode it? From the list of probabilities, he knows that we used the intervals [0, 0.8) and [0.8, 1) for `a`

and `b`

, respectively. The bitstring 1001 gives us the number **.1001** = 0.5625. Just as the sender did, the receiver starts with the interval [0, 1). What is the next letter? Since 0.5625 falls within the first 80% of our interval, the first letter must have been `a`

. Our current interval now becomes [0, 0.8). What is the next letter? Since 0.5625 again falls within the first 80%, the second letter is `a`

as well. The current interval is updated to [0, 0.64). Now 0.5625 falls within the last 20% of the interval, so that the next character is `b`

. The receiver has determined that the original message was `a`

`a`

`b`

. Wow!

You may have noticed that our final interval [0.512, 0.64) has width 0.128, equal to the probability of our message `a`

`a`

`b`

. This is no coincidence. By design, a message with probability *p* is mapped to an interval of width *p*. For a bitstring *s* to encode the interval [*x*, *y*), we need two things to be true: {1} the binary fraction represented by *s* should fall in the interval, and {2} the binary fraction *.s*111... — *s* followed by infinitely many 1's — should also fall in the interval. Thus we need the interval [**.***s*, **.***s*111...) to be a subinterval of [*x*, *y*). The width *w* of this interval is equal to (1/2)^{L(s)}, where *L*(*s*) denotes the length of *s* (the number of bits). Subject to these conditions, we want *w* to be as large as possible, since this gives us the shortest bitstring possible; but it must also be a power of 1/2. So let us look at the numbers 0, *w*, 2**w*, 3**w*, ..., 1, but when we write out each number, we use binary and give each one the same number of bits after the binary point. (For *w* = 0.125, we would have **.000**, **.001**, **.010**, **.011**, **.100**, etc.) If 2**w* < *p*, our interval will contain at least two of these numbers, say the first one *u* and the next one *v*. Now, if we add infinitely many 1's to the end of *u*, we obtain a value equal to *v*! Thus we can use *u* as our bitstring. This method requires *w* < *p*/2; let's take the value *p*/2 and push it down to the next largest power of 1/2. This power will be at least half of *p*/2; thus some value of *w* that is at least *p*/4 will work. We choose *w* to be the smallest power of 1/2 that is at least *p*/4. The length of our bitstring comes out to be at most log_{1/2} (*p*/4) = 2 - log_{2} *p*, which is the bound that pjc51 mentioned in his writeup above. (Note that -log_{2} *p* is positive, since *p* < 1.)

Really quickly, let us encode the message `b`

`b`

`b`

and see what happens. We choose the right 20% each time, getting the interval [0.992, 1). What bitstring can we use? It is clear that we should be looking for a string of all 1's; the shortest one that works is the string 1111111 — we need 7 of them! The 1-letter messages `a`

and `b`

are represented as 0 and 111, respectively. The 10-letter message `a`

`a`

`a`

`a`

`a`

`a`

`a`

`a`

`a`

`a`

`a`

gives us the interval [0, 0.1073741824), which can be represented as the bitstring 0000. Impressive, isn't it? We were able to encode 10 letters into 4 bits — this is because those letters all had high probabilities. For a sequence of *n* `a`

's, we will get an interval of width (0.8)^{n}; if we use this as the value of *p* in the formula 2 - log_{2} *p*, we get a bit length of about 2 + 0.33*n. On the other hand, encoding the message `b`

needed 3 bits! This is as it should be: most of the time we expect `a`

's, so our code optimizes for the common case.

I see a few of you in the back with your hands raised. There are some points I skipped, which you probably noticed. I said that we must send control data before we send the bitstring; for long messages, the bitstring will be much larger than the control data, and so the cost of sending the latter is negligible. For short messages, the compression doesn't really pay off. We are better off sending the message uncompressed, or perhaps grouping a bunch of short messages together into a long one and sending that one. Also, in practice, the length of the bitstring is not included in the control data, since it can be deduced from the other control data; the process is simpler if the bit length is sent explicitly, though, and that's why I included it in the scheme.

Our sending and receiving processes do work correctly, but to make them faster, some modifications are usually made. If we use probabilities whose denominators are powers of 2, we can use integer math operations instead of floating-point operations; on many systems, integer math is faster. In our example, instead of using probabilities of 0.8 and 0.2, we could alter these to 0.796875 = 51/64 and 0.203125 = 15/64. This will make the bitstrings a little longer than they would be with the true probabilities, but the trade-off is usually worth it. (For a human being, of course, the shorter decimals are easier to handle.)

A more urgent problem is that for a long message, the process creates numbers that are expensive to store in memory: if we store them as fractions, then the numerator and denominator will have many digits; if we store them as binary numbers, we will need a lot of bits. The solution is to write out the initial bits of our encoding as soon as we know them for certain. For example, when we encode `bbbbb`

using the true probabilities, our first interval is [0.8, 1). It is clear that whatever bitstring will emerge at the end, it will need to start with the bits 11. We can therefore send 11 immediately and update our interval to [1/6, 1).
Yes, that is one-sixth. Where did it come from? Well, when we sent the bits 11, we told the receiver that our final interval will fall within [0.75, 1). Our current interval [0.8, 1) consists of the last five-sixths of [0.75, 1). Therefore, we set our current interval to be the last five-sixths of [0, 1). This process will give the same results as our original process. Moreover, if we use probabilities whose denominators are powers of 2, we avoid having to use ungainly fractions like 1/6; we will always obtain other fractions whose denominators are powers of 2.

The receiver can use a similar trick. Be warned, this gets a little more difficult. Recall that the receiver starts with the interval [0, 1). As soon as he receives the first bit 1, he can shorten the current interval to [0.5, 1). The ranges for each letter are now `a`

-> [0.5, 0.8) and `b`

-> [0.8, 1). To keep the numbers from growing to too many digits, we now stretch the interval [0.5, 1) to cover [0, 1). Note that `a`

's interval is the first 60% of [0.5, 1), and `b`

's interval is the last 40% of [0.5, 1). Therefore, our new situation is this: the current interval is [0, 1), with `a`

having the interval [0, 0.6) and `b`

having the interval [0.6, 1). After the next bit 1 is read, we will have current interval [0, 1), `a`

interval [0, 0.2), `b`

interval [0.2, 1). The third bit, another 1, places us squarely within the `b`

interval. The receiver now knows that the first letter in the message is `b`

. Tentatively we take our new current interval to be [0.2, 1), with `a`

corresponding to the first 80% of it, [0.2, 0.84), and `b`

corresponding to the last 20% of it, [0.84, 1). However, since we just read in 1, the part of the "current" interval less than 0.5 cannot be reached. We therefore truncate the current interval to [0.5, 1) and truncate the `a`

interval to [0.5, 0.84). Now, as we did initially, we stretch the current interval [0.5, 1) to cover [0, 1); the `a`

interval is stretched from [0.5, 0.84) to [0, 0.68), and the `b`

interval is stretched from [0.84, 1) to [0.68, 1).
(As you might imagine, this part is not easy to program!) We keep going like this until the entire message has been reconstructed. Again, if we alter the probabilities to have denominators that are all powers of 2, we avoid having to store numbers like 0.6, 0.2, 0.84, 0.68 that cannot be represented exactly in binary. In exchange for the more complex procedure, we avoid having to store long numbers in memory; though the logic can be confusing, the arithmetic involved is quite simple at heart, and a good programmer can get the decoder to run quite fast.

If the probabilities of the letters are not constant but instead vary over the length of the message (perhaps `b`

's become more likely near the end), we can use an *adaptive* encoding. In an adaptive encoding, whenever a letter occurs, we increase its probability slightly. Thus if we encounter many `b`

's in quick succession, it will widen the interval assigned to `b`

, anticipating that the frequency of `b`

's has increased. Another idea is to make the interval for a letter depend on the *previous* letter encoded. For instance, if we are encoding English text, we could indicate that when the previous letter was `q`

, the next letter has a high probability (nearly 100%) of being `u`

.

Nothing forces us to use an alphabet of two letters. We can use more; for instance, we may have an alphabet {`a`

, `b`

, `c`

} where `a`

occurs with probability 0.4, `b`

with 0.35, `c`

with 0.25. Then `a`

is represented by the interval [0, 0.4); `b`

by [0.4, 0.75); and `c`

by [0.75, 1). In other words, `a`

gets the first 40% of the whole interval, `b`

gets the next 35%, and `c`

gets the last 25%. The encoding and decoding work the same way as before.

In past years, arithmetic coding was covered by several patents; most of these have expired, but formats developed during those years tended to avoid arithmetic coding. You have probably heard of bzip2: the original bzip compressor used arithmetic coding, but bzip2 replaced it with Huffman coding because of concerns about patents. As mentioned in the introduction, Huffman codes can approximate the efficiency of arithmetic coding if we block groups of letters together and encode the blocks instead of the individual letters. For example, with our two character alphabet {`a`

, `b`

}, we might group a message `a`

`a`

`a`

`a`

`b`

`a`

`a`

`a`

`a`

... into blocks of three: (`a`

`a`

`a`

)(`a`

`b`

`a`

)(`a`

`a`

`a`

).... Then the Huffman code will assign `a`

`a`

`a`

a short bitstring, because it has a high probability. There is no quick way, though, to determine an optimal block size; experimentation is probably best.

SOURCES:

The technical details are from my past studies, stored in my brain.

Patent info: http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents

bzip vs bzip2: http://web.archive.org/web/19980704181204/http://www.muraroa.demon.co.uk/ (search for "arithmetic coding")

ScienceQuest 2012