Over lunch today, I was talking to Gwyn and Laurie, two of the Ruby coders from New Bamboo (http://new-bamboo.co.uk/) - I work with them on a Top Secret Web Project for work.

We started talking about geeky things, and somehow ended up discussing a compression algorithm, where you would alphabetise each character of a message, and then replacing the number of instances of a character with a number, thereby drastically compressing how much space it takes up.

The first paragraph of Lorem Ipsum, which looks a little something like this:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nulla at neque. Maecenas id velit id lacus volutpat dictum. Sed tellus erat, semper eget, euismod ac, feugiat vitae, nisi. Integer eget purus vel nulla adipiscing commodo. Ut id ante. Nullam lobortis mollis lacus. Quisque ac sem eu lectus scelerisque molestie. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Fusce urna. Ut a felis et est hendrerit hendrerit.

Could be expressed as:

a24 b4 c20 d12 e51 f4 g6 h2 i38 l27 m16 n16 o15 p9 q4 r19 s32 t32 u35 v5

Which is a reduction of 457 to 60 characters - or a 86.9% reduction in space.

Of course, this version is quite crude, and puts all text into lower case, and destroys spacing, which means that you lose all readability. We can easily add 3 more characters, for space, comma and full stop, which adds a bit of length to the compressed data, but makes the post-uncompressed data easier to read:

a24 b4 c20 d12 e51 f4 g6 h2 i38 l27 m16 n16 o15 p9 q4 r19 s32 t32 u35 v5 A70 B5 C10

If you wanted to, you could also allow for more characters and capital letters, but this would immediately double the length of the compressed string, which ruins the point a bit.

Some compression examples

100,000 words of Lorem Ipsum:

a4530 b660 c2406 d1675 e6456 f377 g706 h312 i5462 j71 l3332 m2594 n3478 o2417 p1466 q772 r3045 s4848 t4437 u4983 v863 A9890 B1328 C1630

MD5 hash is 4a1a63d4b2b17733cc64422d77eaa496
SHA1 hash is d063ea401b508e2a8b0467bdb2be67f6e5c6d693

Encoding took about 13 milliseconds.

That's a compression rate from 67643 to 114 characters, or a reduction of 99.8%

The entire King James V bible:

a282602 b50007 c58750 d159527 e422235 f83096 g57191 h286915 i196033 j13754 k25476 l132226 m85246 n227142 o247290 p46617 q953 r175732 s197573 t318954 u86542 v32428 w65213 x2663 y58249 z4784 A789645 B70684 C26147

MD5 hash is 6336aff90949d987ae7c36ee4c868399
SHA1 hash is 13b0391b632f4300eed7d77a6fc6719fba1832b0

Encoding took just over a second.

Compressing the King James bible like this reduces it from 4,404,445 characters down to 184 characters... In comparison, TAR/GZ, which is one of the industry standards of file compression, gets the same text file down to 1,316,216 characters...

Uncompressing this algorithm

Of course, it's no good only being able to compress something: you need to be able to decompress it as well. By its very nature, the above compression is lossy (you lose all capitalisation plus special characters, for example), and by the time you alphabetise a string by character, all hope is lost... Right?

The discussion around the lunch table continued. I came up with the suggestion of using a hash of the original file, taken before the file is sorted, but after it has had all its lossy conversions applied. In theory, if you had a good enough Crypographic hash function, you could tell a computer the hash and the number of times each character needs to appear, and the computer could brute-force the message, eventually finding the correct message. The process would be like so:

The computer...

  1. takes the compressed message and decompresses it by replacing 'a3b2' with aaabb
  2. generates the first possibility of the order of these letters
  3. confirms whether this particular possibility is correct.

If the hash matches, you have, in theory, uncompressed and decoded the message.

But how can you be sure that you have the right message, and not a freak hash collision?

Gwyn came up with the suggestions of using two hashes - if you hash the final file with, say, MD5 and SHA1, it becomes extremely unlikely that a solution which matches with both hashes would be a false positive.

Laurie suggested to, instead, add the first, say, 30 characters of the message in plaintext - this would be about as secure as using two hashes, but a lot less expensive in terms of processing power. In addition, it'd be a cool nod to what Alan Turing's team did when they were breaking the Enigma codes of World War 2, as we're essentially using a crib.

So, in effect, the computer...

  1. takes the compressed message and decompresses it by replacing 'a3 b2' with aaabb
  2. generates the first possibility of the order of these letters
  3. checks the possibility against the hash
  4. if the hash matches, checks the second hash
  5. if the second hash matches, checks the crib text
  6. if all three match up, we've successfully decompressed the original text

Wow, that's amazing! Have you guys revolutionised the world of text compression?

Why thank you. To be honest, probably not.

The problem is that brute-forcing through large swathes of text is bloody difficult: The string like the Lorem Ipsum example above has 457 characters. To put them in the right order, there are 45729 different combinations - that's a 1 with 77 zeroes behind it. (and this is also the reason why capital letters is a bad idea. If we were to allow capitals, the number would grow to 45755, which is an even more ludicrously large number of possibilities)

Under ideal circumstances, we can check about 5 million hashes per second (see http://www.16software.com/md5crackfast/), which sounds pretty good, but to re-organise the simple 457 character item, it would still take one of today's computers 8*1056 million years.

So the great news is that we've found a way of drastically reducing the size of plaintext. The second item of great news is that it's relatively easy to decompress this compression algorithm. The massive red flag of bad news is that, while decompression is easy, it takes an unfathomable amount of time to do so, which makes the whole exercise prohibitively difficult.

So is all of this a waste of time?

Well, yes and no. There's a competition out there (see http://prize.hutter1.net/) which aims to compress a 100MB plain text version of Wikipedia. The current record stands at about 16MB, which is about twice as efficient as the commonly-used ZIP compression format. The GLH compression algorithm outlined here trounces the record - instead of taking up 16,481,655 bytes, we can describe all of the Hutter Prize data in only 225 bytes:

a6051355 b1212851 c2624828 d2569957 e8178365 f1509695 g1611390 h2988658 i5479606 j159395 k458992 l3169420 m2129600 n5028882 o5266127 p1724317 q270967 r4648610 s4643363 t6463244 u2095880 v739642 w1032926 x234215 y1088981 z124570 A13519824 B787826 C794548

Now if only we could find a way of decompressing this again in under 10 hours (hahahahahah), we could claim the 50,000 euro bounty!

(of course, none of this is new (see http://www.hutter1.net/ait.htm)... But it does make me happy that we managed to come up with this stuff on our own!)

Fancy having a go yourself?

No problem, head to the GLH compression booth (http://kamps.org/hidden/glh)!

I want to see your code!

Sure thing. It cleans up a string called $toencode, and passes it to this function:

	$encoding = strtolower($toencode); // all to lower case

	$avoid   = array(" ", ",", "."); // Replace whitespace w/ A,B,C
	$replace = array("A", "B", "C");
	$encoding = str_replace($avoid, $replace, $encoding);

// strip all non-alphanumerics
	$encoding = preg_replace("/^ABCa-z/", "", $encoding); 
	
	$checkchars = array("a", "b", "c", "d", "e", "f", "g", "h", 
	"i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", 
	"v", "w", "x", "y", "z", "A", "B", "C");
	
	foreach ($checkchars as &$character) {
	    $numberfound = substr_count($encoding, $character);
		if ($numberfound) { $outputstring .= $character . $numberfound . " "; }		
		unset ($number);
	}

Now, $outputstring contains the compressed data, and all you need to do is to md5($encoding) and sha1($encoding) to get the hashes for both of them.

This piece of code is all you need to compress stuff. The bible and the 100MB of Wikipedia were done with this code too, taking about 1 second and 27 seconds, respectively, but this needed quite a serious amount of memory and processing power, so in the interest of keeping slicehost happy, maximum upload is a lot less than those two files :)

Could this be improved any further?

Well, yes, you could, in fact, in theory, remove the original text altogether. Take the original text and hash it against a series of different hashing mechanisms. To help it out, tell it a) how many characters there were in the original text, b) what the first 50 characters are, and c) what the character set is (i.e. only lower-case letters). You could then brute-force the whole message, which means that you could compress a message even further, down to just a series of hashes.

The problem here is that your reconstitution of the message goes from being silly to ludicrous: You're now, in effect, taking blind guesses at the message, and checking it against a series of hashes. The bigger the message, the longer it will take. I predict it's going to take a very long time before we can start using hashes-only to compress our data.

... Unless we get those quantum computers, of course.

Originally posted on (and with clarifying comments and further musings) my website: http://kamps.org/haje/glh-compression/

-30-

The above is a logically flawed scheme that people frequently invent when they don't know the basics of information theory.
Any compression algorithm is a function that maps all the possible 2^N input bitstrings of length N onto corresponding compressed strings of some length. Most of those output strings are longer than the input. A very small fraction are shorter.
A successful algorithm is one that manages to shrink "interesting" strings (like "mango" for e.g.) and expands useless ones (like "zxbvqw" for e.g.)
There is no universal compression algorithm that can compress any input string. It follows from a simple argument called the counting theorem or pigeonhole principle ( However you hole a bunch of pigeons, each pigeon needs one hole ).

In essence :

  • For an input string of N bits, there are 2^N possible strings. Each one has to be mapped one to one to the output compressed strings.
  • There are fewer than 2^(N-1) strings which are shorter than N ( half of them are just 1 bit shorter than N ).
  • Therefore a huge number of them have to be longer than the input string.

Lets take the case of the King James Bible.

We use :

  • 128 bits for MD5 hash
  • 160 bits for SHA1 hash
  • 128 bits for "crib text"
  • 1472 bits for the compressed stream

Giving a total of 1888 bits

  • Actual text : 35235560 bits
  • Possible combinations of text possible (Pigeons) = 2^35235560
  • Possible combinations of compressed data (Holes) = 2^1888

Thats a huge number of pigeons with hole deficiency.

Whatever hash algorithm is used, all it can guarantee is that it will map a large input set into a small output set evenly. Using the first 32 characters ( crib text ) is nothing more than a crude hashing scheme.For an input length of 35235560 bits, both MD5 and SHA1 should generate identical hashes for 2^(35235560 - 128) and 2^(35235560 - 160) inputs respectively. A combination of the two will simply collide 2^(35235560 - (128+160)) times. Adding crib text will only reduce the collisions to 2^(35235560 - 1888).

What that means is there are 2^(35235560 - 1888) strings that will hash to exactly the same SHA1, MD5 signatures, and will also have the first 32 characters matching the King James Bible.

In fact since no hashing algorithm is perfect, the number of such "rogue" versions of the holy text will be even more than the aforementioned number and might very likely contain all manner of blasphemy, heresy and profanity condemning the compressor and its author to eternal damnation!

Log in or register to write something here or to contact authors.