It can be proven that under any lossless compression method some strings (those which are nearly completely random) zero- or negative compression must occur.

Under a lossless compression method any one valid pattern must extract to only one output pattern. Without this, the algorithm would not be unique and would therefore be lossy. There are (28)8 (1.84x1016) different combinations of eight character uncompressed strings. Assume that one of these strings is compressed from an eight character code to a seven character code, and all others remain identical. There is now not a one-to-one relationship between compressed and uncompressed codes, as one seven letter code has to stand for it's original code and the longer eight letter code. This means the compression algorithm is lossy. If the seven- and eight character codes swap position, this problem does not arise. However, the seven character code now takes eight characters to write - in one case, a longer output pattern occurs.
However, most data is not completely random. This text has a high percentage of the phrases 'compress', 'character' as well as other English language features. Only seven-bit characters are used. Compression programs can exploit these factors to reduce the size of nearly all likely combinations. However, in some cases, the file size will grow, or it will be stored uncompressed. However, storing it uncompressed requires additional metadata to say that it is uncompressed and should not be extracted. Hence file size will rise.


Example: see compression for compressing aaaaa and aaa to 5a and 3a respectively. However, how would both 5a and 3a be represented? By 151a and 131a? It may be stated that numbers will not exist in the input data, but this means that instead of increasing the size of some information, you increase the domain of the output.

The writeup above shows that under every compression scheme, you get strings that do not compress and some that do.

Random data doesn't compress well because compression algorithms reduce redundancy in data. In Information Theory terms, the compressed output has a higher entropy than the input. However, random data has maximum entropy (or close to the maximum) and so cannot be increased.

That is, 1000 bits of random data might have an entropy of 999 bits, while normal text might have 400 bits (that is, considering characters' frequencies). Huffman would compress the text to little more than 400 bits, while the random data would get around 1000 bits (depending on the distribution of the characters).

Considering the extra file size required to decompress the file is not that important, because as the original file size grows, it will become more and more insignificant.

Anyway, it would be interesting to discuss what is random. For example, suppose we encode text with 7-bit ascii, and we interpret that data in chunks of 8 bits. Usually, it would appear to be random data, and it would not compress much using Huffman encoding. But it is very likely that Huffman encoding considering 7 bits at a time would compress it.

Therefore, randomness in data is a subjective concept, that is, it is random relative to some supposed organization.

Log in or register to write something here or to contact authors.