The above is a logically flawed scheme that people frequently invent when they don't know the basics of information theory.
Any compression algorithm is a function that maps all the possible 2^N input bitstrings of length N onto corresponding compressed strings of some length. Most of those output strings are longer than the input. A very small fraction are shorter.
A successful algorithm is one that manages to shrink "interesting" strings (like "mango" for e.g.) and expands useless ones (like "zxbvqw" for e.g.)
There is no universal compression algorithm that can compress any input string. It follows from a simple argument called the counting theorem or pigeonhole principle ( However you hole a bunch of pigeons, each pigeon needs one hole ).

In essence :

  • For an input string of N bits, there are 2^N possible strings. Each one has to be mapped one to one to the output compressed strings.
  • There are fewer than 2^(N-1) strings which are shorter than N ( half of them are just 1 bit shorter than N ).
  • Therefore a huge number of them have to be longer than the input string.

Lets take the case of the King James Bible.

We use :

  • 128 bits for MD5 hash
  • 160 bits for SHA1 hash
  • 128 bits for "crib text"
  • 1472 bits for the compressed stream

Giving a total of 1888 bits

  • Actual text : 35235560 bits
  • Possible combinations of text possible (Pigeons) = 2^35235560
  • Possible combinations of compressed data (Holes) = 2^1888

Thats a huge number of pigeons with hole deficiency.

Whatever hash algorithm is used, all it can guarantee is that it will map a large input set into a small output set evenly. Using the first 32 characters ( crib text ) is nothing more than a crude hashing scheme.For an input length of 35235560 bits, both MD5 and SHA1 should generate identical hashes for 2^(35235560 - 128) and 2^(35235560 - 160) inputs respectively. A combination of the two will simply collide 2^(35235560 - (128+160)) times. Adding crib text will only reduce the collisions to 2^(35235560 - 1888).

What that means is there are 2^(35235560 - 1888) strings that will hash to exactly the same SHA1, MD5 signatures, and will also have the first 32 characters matching the King James Bible.

In fact since no hashing algorithm is perfect, the number of such "rogue" versions of the holy text will be even more than the aforementioned number and might very likely contain all manner of blasphemy, heresy and profanity condemning the compressor and its author to eternal damnation!

Log in or register to write something here or to contact authors.