The algorithm in bis' writeup above can be further optimised as the result of a couple of observations
- The size of the sum is proportional to the log2 of the number of bits being counted. This sum is continuously stored in an n-bit space within the 32-bit integer range.
- On most architectures simple 8-bit constants are easier to construct than 32- or 16-bit ones.
On this basis, the final stage in bis
i = (i&MASK16) + (i>>16&MASK16);
Can be reduced to:
i = (i + (i>>16))&MASK6;
0x1F, and eliminates the need for MASK16.
Typically this will save two operations: one to set up the larger constant, and one AND operation.
A small optimsation, but useful if you're relying on your compiler to generate your code, in tight timing constraints, and with no hardware population count instruction.