Rsync is a utility written by the samba team that can recover almost any broken download.

Instead of binary searching progressively smaller sections of the file, it simply checksums small blocks from the outset. The checksum is so much smaller than the original block that there's not much point in interactively searching for errors (The checksums for an entire CD image are only 600k or so - 1000 times smaller than the whole image). A fixed block size means that the server can precompute and cache the checksums, greatly reducing its processor load.

Rsync has its own protocol, but can also transfer files over a unix pipe, enabling it to be tunnelled using almost any protocol. As an example, Rsync can be tunneled over the internet using SSH to provide secure off-site backups without the need for a leased line. Using Rsync means that a slower (cheaper) internet connection can be used without compromising backup time.

Most mirror sites on the internet are kept up to date using rsync - transmitting the same data again and again costs real money, both for the mirror, and the host, so anything that can reduce bandwidth costs is very welcome.

Apart from fixing failed downloads, and rapidly mirroring incremental backups, it has some unforseen uses. The bulk of the Debian CD images can be downloaded from the faster (and more numerous) package mirrors, using rsync and some shell scripts, using a method dubbed 'jigsaw downloading' - The necessary files for the cd image are downloaded, padded to a multiple of 8k (the size of a sector on a CD), and concatenating into one big file in the order they appear in the ISO image. When the image is Rsynced with the copy on the CD mirror, it produces a perfect, bootable copy of the CD image, while only downloading 10-20Mb from the slow CD mirror. Rsync will fix any files that downloaded incorrectly, or were missing from the package mirrors automaticly.


Unfortunately, there's no way to get an infinitely fast download by generating random data until it fits a checksum. For every checksum there are 2^(size of file)/2^(size of checksum) possible bit combinations that will match it. So if we are going through every permutation, we'd get every possible match, the vast majority of which are incorrect and meaningless. This is commonly known as the pigeonhole principle - 'if there are n pigeons, and n-1 holes, at least one hole has two pigeons in it'.

For there to be only one possible match, we'd need a checksum the same size as the file. So we might as well download it.