It is quite easy to prove the sampling theorem, or at least to make it plausible, using the convolution theorem. In brief, it states that the convolution of two functions is equal to the product of their respective Fourier transforms, and vice versa.

Now sampling a function (in time or space or whatever) means to multiply it by a comb function. Thus we could just as well convolute the spectrum of the function in question (which is its Fourier transform) with the Fourier transform of the comb. Which is incidentally another comb, but with inverse spacing - the finer is it in the time domain, the larger are the distances in the frequency domain.

The comb is essentially a collection of delta functions arranged in a regular grid. Delta functions are very easy to convolute: Just imagine you put a copy of the other function around every delta peak. And here we are: Obviously, if we want to retrieve our sampled function perfectly, the different copies of the spectrum may not overlap - which is to say that the distance between the delta functions must be at least twice as large as the highest frequency in the function to be sampled!

Ok, now did that make sense to anyone who doesn't already know what I'm talking about?