You can optimise bis's solution even further.
The first two lines are the same:

i = (i & 0x55555555) + (i>>1 & 0x55555555);
i = (i & 0x33333333) + (i>>2 & 0x33333333);

Now at this point we have 8 4-bit counters and wish to add them in pairs to get 4 8-bit counters. The maximum number in each counter will be 8, because it's the sum of 8 bits. The trick is that the number 8 fits into 4 bits, which means we can do the mask operation after the add. This is the same as call's optimization, except we're doing it earlier.

i = i + (i >> 4) & 0x0F0F0F0F

We now have our 4 8-bit counters. Here's the clever part: We can add these together in a single operation.

i = i * 0x01010101

How does this work? Imagine the multiplication done by hand.

01020304
x 01010101
----------------
01020304
0102030400
010203040000
01020304000000
----------------
000103060a090704

Notice that the A is the sum of the four digits. Actually, I lied a bit, we still have to shift.

i = i >> 24

the whole thing:

i = (i & 0x55555555) + (i>>1 & 0x55555555);
i = (i & 0x33333333) + (i>>2 & 0x33333333);
i = i + (i >> 4) & 0x0F0F0F0F
i = (i * 0x01010101) >> 24

64 bit version:

i = (i & 0x5555555555555555) + (i>>1 & 0x5555555555555555);
i = (i & 0x3333333333333333) + (i>>2 & 0x3333333333333333);
i = (i & 0x0F0F0F0F0F0F0F0F) + (i>>4 & 0x0F0F0F0F0F0F0F0F);
i = (i * 0x0101010101010101) >> 56

I didn't invent this - search for bit population count on Google.