Lempel Ziv compression - Everything2.com

It's fairly common to hear that some compression scheme is a variant of Lempel-Ziv. Since pretty much any sliding window compression scheme or any dictionary-based compression scheme can be claimed to be a variant of LZ, this leaves an awful lot of difference between the variants.

LZ77

The original algorithm

In 1977, Jakob Ziv and Abraham Lempel published their first algorithm, which is now known as LZ77. One advantage this algorithm had over existing algorithms was that it was adaptive: as the compression process continues, the algorithm gets better and better about compressing the input stream.

Essentially, the way LZ77 works is to encode pointers to earlier in the input stream. As a motivating example, if we wanted to encode the string "Welcome to Everything@Everything2.com", we could encode it as "Welcome to Everything@<11,10>.com", which is three characters shorter. (Hopefully, we can come up with some better way of encoding things than writing the ASCII representations of numbers, but it demonstrates the point.)

Once we've noticed that we might be able to compress data by encoding pointers to earlier strings, we need to figure out how we're going to write the pointers. In LZ77, every time we write some output, we're writing out three pieces of information:

a pointer to where the string begins
the length of the string
the character that comes after the string.

The paper includes various arguments for how many bits should be used to encode the pointer and the length; in the example that follows, I'll assume that we're using 4-bit numbers.

So, let's say we're in the middle of compression, and we're encoding Hamlet.

 To be, or not to be: that is the question
0123456789012345x

The x indicates how far we've gotten in the compression. We look back in the last 16 characters we've encoded, and try and find the longest string that is a prefix of 'o be: that is...'. The longest such string is the 4-character long string that starts at location 2 ('o be'). In this case, we'll write out the values: (2,4,':'). In this case, it only takes 16 bits to encode 5 characters.

It should be pretty obvious now that using such small numbers to encode the pointers and lengths isn't going to buy you very much: pretty often we're not going to find any strings that are prefixes, so we'll have to encode something like (0,0,'q'), meaning that we're encoding a 0-length string followed by the letter q. This takes 2 bytes to describe a single character, which is clearly suboptimal.

It's also important to know how to begin the encoding process: what do we do at the beginning of a bunch of text? LZ77 assumes that the beginning of the text is preceded by null-characters.

It turns out that with appropriate choices for the number of bits used to encode the pointers and lengths, LZ77 does about as well as many other compression schemes that needed a little bit more information about what they were encoding.

Algorithms based on LZ77

LZSS

The LZSS algorithm was published in 1982 by James Storer and Thomas Szymanski. They noticed that in LZ77, it was common to encode 0-length string prefixes as described above. Rather than always encoding a pointer, only encode a pointer if we found a string that will be longer than the pointer itself. This also means that we'll need some way to distinguish between pointers and regular characters, so we can send a bit before each symbol to tell whether it's a pointer or a character. In the case above, we'd be outputting 17 bits every time we output a pointer, but only 9 every time we output a character.

This generally achieves a better compression ratio than LZ77, and is about the same difficulty to encode or decode.

Modifications of LZSS

Variants of LZSS typically apply additional compression to the output of the LZSS compressor.

LZB: a simple variable-length code.
LZH: dynamic Huffman coding
Version 1.x of ZIP uses Shannon-Fano coding.

LZO

LZO was designed as a real-time compression library. The O stands for Oberhumer, after the author (Markus Franz Xaver Johannes Oberhumer). It appears that LZO actually implements many different variations, and there's limited documentation on the web site about what these variations are, but it's open source, so if you're really interested, use the source.

LZ78

Typically, when people refer to "Lempel Ziv," they mean this algorithm.

The original algorithm

LZ78 abandons the idea of the sliding window. Instead, the algorithm keeps track of a dictionary of previously seen phrases. When an entry in the dictionary is seen, it outputs the dictionary index instead. This eliminates the need to output the length of the encoded string, since it's part of the dictionary.

In LZ78, we find the largest sequence of characters that's already in the dictionary. We then add that sequence of characters, plus the next character in the input stream, into the dictionary. So, to encode the word "tattle" we'd do the following steps:

Start out with the empty string "" in the dictionary.
The longest prefix of "tattle" we've got in the dictionary is "". Output the dictionary index of "", followed by the code for the character "t", and add "t" to the dictionary.
The longest prefix of "attle" we've got is "". Output the dictionary index of "", followed by the code for the character "a". Add "a" to the dictionary.
Now we've got a different prefix... "ttle" starts with "t". Output its dictionary index, followed by the code for the character "t". Add "tt" to the dictionary.
...

LZW

The most common modification of LZ78 is LZW, first described by Terry Welch in 1984. LZW initializes the first 256 entries of the dictionary with all 256 possible 8-bit values. The LZW algorithm is used in the TIFF and GIF graphic formats, as well as in the v.42bis modem specification. The patent for LZW is owned by Unisys, who didn't assert their ownership of the patent until after Compuserve had made the GIF file format standard.

LZFG

LZFG is an algorithm first described by Edward R. Fiala and Daniel H. Greene in 1989. Rather than storing strings in a dictionary, they store the strings in a suffix tree. Each branch of the tree at each level corresponds to a unique letter (note that suffix trees are not binary trees: if we were encoding a suffix tree of 8-bit characters, we could have as many as 256 branches at each node.) Leaves are collapsed to form suffixes. An example of a suffix tree follows:

      ^
    T/ \A
    ^   (APPLE) 
  H/ \O
(HE) (OY)

A description of how to make such a data structure efficient is beyond the scope of this write-up. The main advantage of this algorithm is that it uses suffix trees and a specialized encoding scheme to indicate strings in the tree to achieve better compression. The biggest advantage of this scheme over LZ78 is that only a single copy of each sequence is inserted into the tree, thereby saving storage space.

References

http://wombat.doc.ic.ac.uk/foldoc/ - free online dictionary of computing
http://compression.graphicon.ru/download/articles/lz/ziv_lempel_1977_universal_algorithm.pdf - LZ77
http://compression.graphicon.ru/download/articles/lz/ziv_lempel_1978_variable-rate.pdf - LZ78
ce.sharif.ac.ir/~ghodsi/archive/Course-Materials/Data%20Structures%20and%20Algorithms/ p490-fiala.pdf - LZFG
http://www.oberhumer.com/opensource/lzo/ - the lzo homepage
The Data Compression Book, by Mark Nelson and Jean-Loup Gailly.
Various other sites found via Google

self-extracting executable	An Introduction to Data Compression, MPEG Video, and DVD Technology	lossless compression	saint
LZW	Indeo	information theory	Lempel Ziv
WinRAR	File Compression	Huffman coding	compression
Abraham Lempel	LZH	suffix tree	arithmetic coding
paronomasiac	Huffman	RAR	DCT