Not covered by The Jargon File, the usage of the term 'bit twiddling'
with which I, (and everyone I've mentioned this to) am most familiar
does not refer to specific performance optimisations, but merely to the
activity of writing code which is aware of and exploits in some way the
nature of integers stored as two's complement numbers, or deals with a
number treating it as a vector of bits rather than an integer.

It just so happens that a lot of applications of the first form of
bit-twiddling are done for performance reasons. The classic examples
are probably implementations of find first set, and population count, simple bit-related functions that are appearing
as single instructions in modern computer architectures.

Code which simply treats numbers as bits need not be written for
performance reasons, though it frequently happens in performance
critical code, and is common when interfacing with
hardware, or dealing with networking protocols. Calculating checksums
is probably a good example, as is the arbitrary bit-remapping function
provided above by sydnius.

To provide an example of the two different meanings of the phrase
'bit twiddling', here's my (more efficient) re-write of sydnius
bit-remapping code. This is bit-twiddling not only in the sense that
it deals with integer quantities as vectors of bits, but also bit
twiddling in the Jargon File's sense of being written for performance
reasons (sydnius' code implies a lot of branches, which are A Bad
Thing^{TM} in modern computer architectures). The fact it's
a lot easier to read and maintain reliably is merely a coincidence.

**unsigned** **char** BitTwiddle2 (**unsigned** **char** input)
{
/* Remap bits:
"2->5, 0->7, 1->3, 3->4, 4->0, 5->2, 6->1, 7->6"
Or, to express the result in terms of the input:
result[0] = input[4];
result[1] = input[6];
result[2] = input[5];
result[3] = input[1];
result[4] = input[3];
result[5] = input[2];
result[6] = input[7];
result[7] = input[0];
The obvious basis of the remapping would be to form a single bit
value. (0 or 1) corresponding with each bit of the input by shifting
such that the input bit is in position 0, and masking with 0x1,
then shifting up into position.
For each bit, this involves the calculation of:
((input >> input_bit) & 0x1) << output_bit
then combining by addition (+) or bitwise OR () with each other
bit's value. This implies 4 operations per bit:
Shift right to position the bits,
AND to mask all but the lowest bit
Shift left to position the new value into the output bit position.
OR (or addition) to combine with the other bit's values.
However, we can do one better than this by making the following observations:
1) 8-bit constants are cheap (free) to construct in almost any
architecture.
2) The first shift can be removed by using a constant value
shifted, reducing the second shift to a smaller one.
Using these, we can calculate each bit as:
(( input_bit > output_bit)?
// right shift required
((input & (1<<input_bit)) >> (input_bit - output_bit) ) :
// left shift rquired
((input & (1<<input_bit)) << (output_bit - input_bit) ) )
The two different cases are required since the new bit position
may lie to the right or to the left of the new bit position, and
there's no generalised shift-left-or-right construct.
Naïvely, this appears more complex to calculate. However, since
input_bit and output_bit will be constant, the comparison will
have a constant outcome, and the shift indices will be constant
(also the 1<<input_bit term is constant).
Assuming the compiler is smart enough to remove the dead code for
the unused condition outcome, there are only three operations per
bit required:
AND with 8-bit constant to get single-bit in input bit position.
Shift left or right by appropriate constant to place bit in
output bit position.
OR to combine with other bit contributions.
The calculations for each bit are dependent on each other, with
no opportunity for parallelisation within each bit. However, since
no branches are involved, a compiler should be able to schedule
the operations for succesive bits so that a superscalar processor
can compute different bits in parallel.
Since it's fairly complex, we'll define the element of our funciton
as a macro:
*/
__#define__ MAP_BIT(input, output_bit, input_bit) \
( ( input_bit > output_bit)? \
/* right shift required */ \
((input & (1<<input_bit)) >> (input_bit - output_bit) ) : \
/* left shift rquired */ \
((input & (1<<input_bit)) << (output_bit - input_bit) ) )
/* Now, again define the output in terms of the input... */
**return**
MAP_BIT(input, 0, 4) |
MAP_BIT(input, 1, 6) |
MAP_BIT(input, 2, 5) |
MAP_BIT(input, 3, 1) |
MAP_BIT(input, 4, 3) |
MAP_BIT(input, 5, 2) |
MAP_BIT(input, 6, 7) |
MAP_BIT(input, 7, 0);
/* It should be noted that this method is only efficient because the
indices are constant; if these were variable, there'd be a lot of
extra calculation required (which the compiler will do at compile
time for constants). The 'obvious' method mentioned above will
be more effective in those situations.
Also, on ARM, the implementation given by sydnius above is likely
to be more efficient, since each bit's operations will probably be
reduced to simply:
TEQ rInput, #(1<<input_bit)
ORRNE rOutput, #(1<<output_bit)
*/
}

A quick timing analysis on a 475MHz AMD K6-II, using gcc -O4, running each algorithm
over an array of 512 random 8-bit values (to enure most memory operations hit cache), repeated 3 times, gives:

- Baseline comparison (identity function): 0m0.220s (0s vs. baseline)
- BitTwiddle: 0m5.630s (5.410 vs. baseline)
- BitTwiddle2: 0m3.180s (2.960 vs. baseline)

Giving figures for BitTwiddle of 5.410, and BitTwiddle2 of 2.960; a ratio of 1.827 in times, making BitTwiddle2 82.7% faster than BitTwiddle. If this was a performance critical function, that's a pretty significant saving.

An Athlon 2400 XP gets times of 0.111, 2.521 and 1.641, giving baseline times of 0s, 2.410s and 1.530s, making it 57.5% faster.

A PowerPC G4/1GHz gives figures of 0.33s, 4.01s and 1.46s, giving baseline times of 0s, 3.68s and 1.13s, making it 3.2 times as fast.

A look at the code generated for x86 and for PowerPC shows the reason for the comparatively lower performance increase of the Athlon and K6 when compared to the PowerPC. The x86 and PowerPC code for the original function are very similar, comparisons and branches. However, the x86 code for BitTwiddle2 contains almost as many instructions as for BitTwiddle, due to the requirement that shift indices must be stored in the CX register. This necessitates a lot of register shuffling, and also forces the execution to be serialised on dependencies around that register, narrowing the amount of instruction level parallelism that the CPU can exploit effectively.