In C:

unsigned char BitTwiddle(unsigned char input)
{
     //this function scrambles the bits of the input
     //change the hard-coded bitwise ors and ands for desired output
     //this example takes 2->5, 0->7, 1->3, 3->4, 4->0, 5->2, 6->1, 7->6
     return ( ((input&0x04)?0x20:0x00)|
              ((input&0x01)?0x80:0x00)|
              ((input&0x02)?0x08:0x00)|
              ((input&0x08)?0x10:0x00)|
              ((input&0x10)?0x01:0x00)|
              ((input&0x20)?0x04:0x00)|
              ((input&0x40)?0x02:0x00)|
              ((input&0x80)?0x40:0x00));
}

bit rot = B = bit-paired keyboard

bit twiddling n.

[very common] 1. (pejorative) An exercise in tuning (see tune) in which incredible amounts of time and effort go to produce little noticeable improvement, often with the result that the code becomes incomprehensible. 2. Aimless small modification to a program, esp. for some pointless goal. 3. Approx. syn. for bit bashing; esp. used for the act of frobbing the device control register of a peripheral in an attempt to get it back to a known state.

--The Jargon File version 4.3.1, ed. ESR, autonoded by rescdsk.

Not covered by The Jargon File, the usage of the term 'bit twiddling' with which I, (and everyone I've mentioned this to) am most familiar does not refer to specific performance optimisations, but merely to the activity of writing code which is aware of and exploits in some way the nature of integers stored as two's complement numbers, or deals with a number treating it as a vector of bits rather than an integer.

It just so happens that a lot of applications of the first form of bit-twiddling are done for performance reasons. The classic examples are probably implementations of find first set, and population count, simple bit-related functions that are appearing as single instructions in modern computer architectures.

Code which simply treats numbers as bits need not be written for performance reasons, though it frequently happens in performance critical code, and is common when interfacing with hardware, or dealing with networking protocols. Calculating checksums is probably a good example, as is the arbitrary bit-remapping function provided above by sydnius.

To provide an example of the two different meanings of the phrase 'bit twiddling', here's my (more efficient) re-write of sydnius bit-remapping code. This is bit-twiddling not only in the sense that it deals with integer quantities as vectors of bits, but also bit twiddling in the Jargon File's sense of being written for performance reasons (sydnius' code implies a lot of branches, which are A Bad ThingTM in modern computer architectures). The fact it's a lot easier to read and maintain reliably is merely a coincidence.


unsigned char BitTwiddle2 (unsigned char input)
{
  /* Remap bits:
     "2->5, 0->7, 1->3, 3->4, 4->0, 5->2, 6->1, 7->6"
     Or, to express the result in terms of the input:

     result[0] = input[4];
     result[1] = input[6];
     result[2] = input[5];
     result[3] = input[1];
     result[4] = input[3];
     result[5] = input[2];
     result[6] = input[7];
     result[7] = input[0];

     The obvious basis of the remapping would be to form a single bit
     value. (0 or 1) corresponding with each bit of the input by shifting
     such that the input bit is in position 0, and masking with 0x1,
     then shifting up into position.

     For each bit, this involves the calculation of:

        ((input >> input_bit) & 0x1) << output_bit

     then combining by addition (+) or bitwise OR () with each other
     bit's value. This implies 4 operations per bit:

        Shift right to position the bits,
        AND to mask all but the lowest bit
        Shift left to position the new value into the output bit position.
        OR (or addition) to combine with the other bit's values.

     However, we can do one better than this by making the following observations:

        1) 8-bit constants are cheap (free) to construct in almost any
           architecture.
        2) The first shift can be removed by using a constant value
           shifted, reducing the second shift to a smaller one.

     Using these, we can calculate each bit as:

        (( input_bit > output_bit)?
                // right shift required
                ((input & (1<<input_bit)) >> (input_bit - output_bit) ) :
                // left shift rquired
                ((input & (1<<input_bit)) << (output_bit - input_bit) ) )

     The two different cases are required since the new bit position
     may lie to the right or to the left of the new bit position, and
     there's no generalised shift-left-or-right construct.

     Naïvely, this appears more complex to calculate. However, since
     input_bit and output_bit will be constant, the comparison will
     have a constant outcome, and the shift indices will be constant
     (also the 1<<input_bit term is constant).

     Assuming the compiler is smart enough to remove the dead code for
     the unused condition outcome, there are only three operations per
     bit required:

        AND with 8-bit constant to get single-bit in input bit position.

        Shift left or right by appropriate constant to place bit in
        output bit position.

        OR to combine  with other bit contributions.

    The calculations for each bit are dependent on each other, with
    no opportunity for parallelisation within each bit. However, since
    no branches are involved, a compiler should be able to schedule
    the operations for succesive bits so that a superscalar processor
    can compute different bits in parallel.

    Since it's fairly complex, we'll define the element of our funciton
    as a macro:
   */
  #define MAP_BIT(input, output_bit, input_bit) \
        ( ( input_bit > output_bit)? \
                /* right shift required */ \
                ((input & (1<<input_bit)) >> (input_bit - output_bit) ) : \
                /* left shift rquired */ \
                ((input & (1<<input_bit)) << (output_bit - input_bit) ) )



  /* Now, again define the output in terms of the input... */
  return
     MAP_BIT(input, 0, 4) |
     MAP_BIT(input, 1, 6) |
     MAP_BIT(input, 2, 5) |
     MAP_BIT(input, 3, 1) |
     MAP_BIT(input, 4, 3) |
     MAP_BIT(input, 5, 2) |
     MAP_BIT(input, 6, 7) |
     MAP_BIT(input, 7, 0);

  /* It should be noted that this method is only efficient because the
     indices are constant; if these were variable, there'd be a lot of
     extra calculation required (which the compiler will do at compile
     time for constants). The 'obvious' method mentioned above will
     be more effective in those situations.

     Also, on ARM, the implementation given by sydnius above is likely
     to be more efficient, since each bit's operations will probably be
     reduced to simply:

        TEQ        rInput, #(1<<input_bit)
        ORRNE      rOutput, #(1<<output_bit)

   */

}

A quick timing analysis on a 475MHz AMD K6-II, using gcc -O4, running each algorithm over an array of 512 random 8-bit values (to enure most memory operations hit cache), repeated 3 times, gives:

  • Baseline comparison (identity function): 0m0.220s (0s vs. baseline)
  • BitTwiddle: 0m5.630s (5.410 vs. baseline)
  • BitTwiddle2: 0m3.180s (2.960 vs. baseline)

Giving figures for BitTwiddle of 5.410, and BitTwiddle2 of 2.960; a ratio of 1.827 in times, making BitTwiddle2 82.7% faster than BitTwiddle. If this was a performance critical function, that's a pretty significant saving.

An Athlon 2400 XP gets times of 0.111, 2.521 and 1.641, giving baseline times of 0s, 2.410s and 1.530s, making it 57.5% faster.

A PowerPC G4/1GHz gives figures of 0.33s, 4.01s and 1.46s, giving baseline times of 0s, 3.68s and 1.13s, making it 3.2 times as fast.

A look at the code generated for x86 and for PowerPC shows the reason for the comparatively lower performance increase of the Athlon and K6 when compared to the PowerPC. The x86 and PowerPC code for the original function are very similar, comparisons and branches. However, the x86 code for BitTwiddle2 contains almost as many instructions as for BitTwiddle, due to the requirement that shift indices must be stored in the CX register. This necessitates a lot of register shuffling, and also forces the execution to be serialised on dependencies around that register, narrowing the amount of instruction level parallelism that the CPU can exploit effectively.

Log in or register to write something here or to contact authors.