The algorithm in bis' writeup above can be further optimised as the result of a couple of observations

- The size of the sum is proportional to the log
_{2} of the number of bits being counted. This sum is continuously stored in an n-bit space within the 32-bit integer range.
- On most architectures simple 8-bit constants are easier to construct than 32- or 16-bit ones.

On this basis, the final stage in

bis's code:

`i = (i&MASK16) + (i>>16&MASK16);`

Can be reduced to:

`i = (i + (i>>16))&MASK6;`

Where `MASK6`

is `(1<<6)-1`

or `0x1F`

, and eliminates the need for MASK16.

Typically this will save two operations: one to set up the larger constant, and one AND operation.

A small optimsation, but useful if you're relying on your compiler to generate your code, in tight timing constraints, and with no hardware population count instruction.