Very fast adder design for n-bit binary numbers. Whereas the naïve design (chaining full adders) have an O(n) component count and takes time O(n) to add, the carry lookahead adder requires O(n log n) components (which is a lot more), but adds in time O(log n).

The design is based on a fast method of calculating where the carry bits will be added. Suppose we're adding the 2 binary numbers xn-1...x0 and yn-1...y0. If we knew to which bits we need to add carry (define cj to be 1 if a 1 is carried out from adding bit j, else 0), we could just XOR each of these columns to get the result:

cn-1cn-2...c0
    xn-1...x1x0
    yn-1...y1y0
which takes constant time (and O(n) components).

So we need a fast way of calculating the cj's. To do this, we define 3 possible carry states, and assign one such to each of the n bits:

  • If both xj and yj are 1, then a 1 will always be carried out of that bit; this is carry state S ("set").
  • If both xj and yj are 0, then 1 will never be carried out of that bit; this is carry state C ("clear").
  • If exactly one of xj and yj is 1, then 1 will be carried out of that bit iff a 1 was carried into that bit; this is carry state P ("propogate").
Note that we may define the same 3 states for any contiguous block of bits, specifying that a 1 will always be carried out of the block (S), that 1 will never be carried out of the block (C), or that 1 will be carried out if a 1 was carried in (P). We define "multiplication" on the carry states so that the carry state of the concatenation of two blocks is the multiplication of their carry states (read the state on the right on the column, and the state on the left on the row).
. S C P
S S C S
C S C C
P S C P

This operation certainly isn't commutative (nor does it define a group), but it turns out to be associative! We may calculate the carry state of each bit using the above rules in constant time and O(n) components; let kj be the carry state for bit j. Then cj is 1 iff kj...k1k0 is S. So using the parallel prefix algorithm, we can calculate these block values, and therefore the c values, using O(n log n) components and in time O(log n).

The really amazing thing about this adder design is not that it works; it's that today similar designs are used in real CPUs!