On paper
Comba's method is an optimization for performing multiplication on a computer using
the same "long multiplication" method that would be performed with
pencil and paper. In a paper in the IBM Systems Journal in
1990, Paul G. Comba observed that most programmers would
implement multiple precision arithmetic following the established
patterns of manual calculation, but these were not necessarily
optimal for a computer. He suggested a rearrangement of the
calculations which, since they apply to the pencil and paper method,
we can demonstrate in the same way.
Suppose we are multiplying 23 by 89. Then the standard method gives
2 3
x 8 9
=======
2_0_7
1_8_4
=======
2_0 4 7
where the
underscore character denotes a
carry from one digit
position to the next. Comba made the observation that this order of
calculation produced far too many carries from one digit position to
the next, and it was computationally expensive, requiring extra
instructions and
registers. In this example, it makes little sense
for there to be 5 carries, when the final result only has 4 digits.
Reversing the order of the operands does not help:
8 9
x 2 3
=======
2_6_7
1_7_8
=======
2_0_4 7
 in fact, the situation appears worse, requiring 6 carry operations.
This may appear something of a
paradox, that the order of inputs
should affect the execution time. A good method should perform equally
well, no matter which order the
inputs are supplied.
Comba observed that the carries in the intermediate results were
effectively wastes of computation, since carries could be produced in
the final addition in any case. The first improvement then is to
ignore the carries in these intermediate results, and deal with them
at the end. Ignoring the carries though comes at a cost; the
intermediate results may now have two digits in each column:
2 3
x 8 9
===========
18 27
16 24
===========
2 _0 _4 _7
However, the apparent paradox is resolved. It now does not matter
which order the question is asked, the intermediate results are the
same, each a
product of single digits. Furthermore, all the carries
from column to column are performed in the end result  although it
costs extra width in the columns (which now may conceal some 'internal'
carries. These should only be a bother for us humans, not for a computer).
In the computer
This subheading is offered as an invitation to start
skimming.
The next trick that Comba applied was to minimize the number of
memory writes when the computer performed this calculation. If the
calculation was performed in row order, as is typical on pencil and
paper, some storage needed to be allocated to store the
intermediate results. Instead, the computer should be instructed to
compute the products, column by column, and keep the results in
registers. The only time a digit is written to memory is at the
bottom of each column, and it is a final result in the answer. The
carry into the next column remains in registers.
While the theory behind this approach appeared good, there were a
couple of technical issues to resolve. One was, with all this carry
propagation left to the last moment, the carry itself could get
quite large. This may not show in this 2digit example, but what about
when multiplying two 13digit numbers? In that case, each column could
be as deep as 13 lines, and each number in the column could be as large
as 81, so the sum could be as large as 1053  the carry could be as
large as three digits wide. Comba's approach was to allow more
registers for the carry, and only consider those extra registers in
the later digits summed over each column.
Getting the additional registers came as a sideeffect of the
solution to the second technical issue. This method requires some
subtlety on a computer, because not all the columns have the same
number of partial results. The conventional method used two nested
loops and did not have to consider this, every row in the partial
results was the same length. Comba's resolution to this issue was a
common optimization for loops; simply unroll them. Instead of
looping over the same code again and again, make enough duplicate
copies of the code, one after the other. A routine that was originally
only about a dozen bytes in length could suddenly unroll to be
several KB in size, and there may even have to be a different
routine for every length of input number. However, it could all be
programmed quite quickly with assembler macros, and all the
difficulties could be addressed  each digit combination could in
effect have its own code if needed, to squeeze every last drop of optimization
out of the routine. There were no longer any loop counters because
there were no loops  the registers saved could address the carry issue.
How good is it?
Comba's measurements on an 8086 processor showed his
optimizations were a significant 30% faster than the "obvious"
approach. However, much of this gain had a lot to do with the
processor architecture of the time. Newer processors have caching architectures that mean loop unrolling makes less
significant improvements, or in some cases could even be detrimental.
Improvements such as pipelining and hyperthreading mean the
specific issues that Comba addressed may be no longer a concern.
The same experiments today, on a modern processor, may show only
marginal effects, but any performance improvement is worthwhile, particularly in routines that execute many thousands of times. OpenSSL and other libraries make heavy use of the Comba method, which wins over the straightforward approach or more complex, alternative algorithms for small sizes (between 4 and 16 words).
Comba multiplication is a good example of optimization as an
implementation detail. The underlying algorithm still remained the
same, and thus still remained O(n^{2}), rather than being an
improvement in time complexity such as offered by Karatsuba
multiplication or some other algorithm. Comba's refinements merely
reduced the hidden constant that the bigoh notation conceals.
Can it be better?
Nevertheless, Comba multiplication does have some potential. For
example, it is one of very few computer multiplication algorithms that
can readily be explained using pencil and paper, or that can be
performed manually. One way do do this is to write the two numbers on
the corners of pieces of paper, and either reverse one, or turn the
paper upsidedown. Now slide the two numbers past each other. As the
digits align, they determine the order of the Comba multiplication
products. In the 23 x 89 example:

3 2
`
. => 3*9 = 27 => 2047
8 9 / 
 / 
_________/ 
 / carry 2 
3 2 / 
` / 
. => 2 + 3*8 + 2*9 = 44
8 9 / 
 / 
_________/ 
 / carry 4 
3 2 / 
` / 
. => 4 + 2*8 = 20
8 9

With a little practice, the two numbers sliding past each other can be
visualized in the
mind's eye, and the answer to a multiplication can
be written down, right to left, in an impressive feat of
mental arithmetic!
Or so I'm
told.
Even on paper, Comba's method has potential for improvement. One of
the issues mentioned earlier was that the size of the products in the
individual columns, and their sums, can get quite large. One way of
minimizing this problem is to drop some of the limitations of our
decimal number system. If digits are allowed to be negative, then
the products can be made smaller (at the expense of being signed).
For example, 89 = 8*10 + 9 = 9*10  1 = 1*100  1*10  1 could be
written (1, 1, 1) in this relaxed system, and the multiplication
now becomes:
2 3
x 1 1 1
===========
2 3
2 3
2 3
===========
2 1 5 3
===========
2 0 4 7
It takes a bit of thought to keep everything straight, but the
individual products are much smaller.
You
need never multiply by 6, 7, 8 or 9 again!. As in this example, the
inputs may grow by an extra digit, but such considerations may not
matter on a computer, where an appropriate choice of
limb size
allows this modified Comba method to be quite efficient. There is
still value in deferring the carry propagation to the last moment.
Who invented it?
Interestingly, there is some controversy over Comba's
attribution to the method. Similar techniques were mentioned in
1986 in the paper that proposed Barrett reduction, so many claim
Barrett should be credited with the columnbycolumn approach. However,
it turns out the method had been known about for quite some time. An
unknown author detailed the method in Treviso Arithmetic,
published in Treviso, Italy as early as 1478.
Sources
Comba, P. G. Exponentiation cryptosystems on the IBM PC,
IBM Systems Journal, Vol. 29, No. 4, December 1990.
http://www.research.ibm.com/journal/sj/294/ibmsj2904E.pdf, viewed on
July 13, 2005
Harley, Robert. Re: Comba attributed to Barrett?. Posting on
sci.crypt, May 20 2003, archived at
http://www.derkeiler.com/Newsgroups/sci.crypt/200305/1447.html,
viewed on July 13, 2005.
Granlund, T. The GNU Multiple Precision Arithmetic
Library, http://www.swox.com/gmp/gmpman4.1.4.pdf, as viewed on
July 13, 2005.
St Denis, T. LibTom Projects,
http://libtomcrypt.org/talk_lsm.pdf, as viewed on July 13, 2005.
Thanks to randombit for the OpenSSL reference and empirical performance data.