Statistical compression algorithms are a class of lossless compression algorithms based on the probability that certain characters will occur.

## RLE

RLE, or run length encoding, is perhaps the most primitive of these sorts of algorithms. It is based on the assumption that characters in the input stream will occur in clumps. This assumption is rarely true in things like text, but more often is true when considering bitmapped images, especially when using indexed color schemes (as in GIF and BMP. It turns out that RLE compression was one of the few forms of compression that Windows 3.0 knew about, which was fine, because many displays at the time were only capable of having 16 colors (and therefore images tended to have large areas that were mono-chromatic).

## Shannon-Fano Coding

Independently discovered by Claude Shannon and Simon Fano in 1949, Shannon-Fano coding is a way of assigning variable-length bit patterns to encode characters in such a way that the characters being decoded are unique. This encoding is often more efficient than a fixed-length encoding, but it's not too hard to do better.

## Huffman Coding

A method attributed to David Huffman. The basic idea is that you have some sort of probability distribution for how often characters occur in the input stream. From this, you can build a binary tree which tells what bits you should use to encode that character.

If each character has a probability of appearing that's equal to 1/2n for some n, Huffman coding is optimal.

## Arithmetic Coding

Arithmetic coding is a further improvement of these techniques. It turns out that it's possible to think of any stream of binary digits as encoding a number between 0 and 1. (Just imagine sticking a "0." in front of the digits, and you've got a fractional binary number.)

With this in mind, it's possible to divide up the interval between 0 and 1 on the real number line into segments according to which letter is supposed to come next. For example, if 'a' occurred half the time, and 'b' and 'c' both occurred a quarter of the time, you can split up the interval [0,1) into the intervals [0,0.5) for "a", [0.5,0.75) for "b", and [0.75,1) for "c". For the next letter, simply continue subdividing the interval for the given message, and what you're effectively doing is finding the interval that corresponds to your message. The last step is to convert that interval into a binary number somewhere in that range, and there's your message. (Most implementations will do this conversion as they're narrowing down on the interval, and will do it using integer arithmetic, which tends to be much faster than floating point arithmetic.)

One of the strange consequences of this is that you can actually encode a single character in less than one bit (amortized over time, of course.) In order to remove ambiguity, it's might also be necessary to encode an "end-of-file" character as well as all the possible characters you might be encoding. Unfortunately, this advantage comes at the cost of increased computation. Typically, Huffman codes are more common, even though they're not as good at compressing, because they are easier to compute (both from the standpoint of getting the code right and the amount of processor power necessary to encode and decode the streams).

• A `/msg` from kozmund (about using integer math for arithmetic coding).