## On paper

Comba's method is an optimization for performing multiplication on a computer using the same "long multiplication" method that would be performed with pencil and paper. In a paper in the IBM Systems Journal in 1990, Paul G. Comba observed that most programmers would implement multiple precision arithmetic following the established patterns of manual calculation, but these were not necessarily optimal for a computer. He suggested a rearrangement of the calculations which, since they apply to the pencil and paper method, we can demonstrate in the same way.

Suppose we are multiplying 23 by 89. Then the standard method gives

```    2 3
x   8 9
=======
2_0_7
1_8_4
=======
2_0 4 7
```
where the underscore character denotes a carry from one digit position to the next. Comba made the observation that this order of calculation produced far too many carries from one digit position to the next, and it was computationally expensive, requiring extra instructions and registers. In this example, it makes little sense for there to be 5 carries, when the final result only has 4 digits.

Reversing the order of the operands does not help:

```    8 9
x   2 3
=======
2_6_7
1_7_8
=======
2_0_4 7
```
- in fact, the situation appears worse, requiring 6 carry operations. This may appear something of a paradox, that the order of inputs should affect the execution time. A good method should perform equally well, no matter which order the inputs are supplied.

Comba observed that the carries in the intermediate results were effectively wastes of computation, since carries could be produced in the final addition in any case. The first improvement then is to ignore the carries in these intermediate results, and deal with them at the end. Ignoring the carries though comes at a cost; the intermediate results may now have two digits in each column:

```         2  3
x        8  9
===========
18 27
16 24
===========
2 _0 _4 _7
```
However, the apparent paradox is resolved. It now does not matter which order the question is asked, the intermediate results are the same, each a product of single digits. Furthermore, all the carries from column to column are performed in the end result - although it costs extra width in the columns (which now may conceal some 'internal' carries. These should only be a bother for us humans, not for a computer).

## In the computer

This subheading is offered as an invitation to start skimming.

The next trick that Comba applied was to minimize the number of memory writes when the computer performed this calculation. If the calculation was performed in row order, as is typical on pencil and paper, some storage needed to be allocated to store the intermediate results. Instead, the computer should be instructed to compute the products, column by column, and keep the results in registers. The only time a digit is written to memory is at the bottom of each column, and it is a final result in the answer. The carry into the next column remains in registers.

While the theory behind this approach appeared good, there were a couple of technical issues to resolve. One was, with all this carry propagation left to the last moment, the carry itself could get quite large. This may not show in this 2-digit example, but what about when multiplying two 13-digit numbers? In that case, each column could be as deep as 13 lines, and each number in the column could be as large as 81, so the sum could be as large as 1053 - the carry could be as large as three digits wide. Comba's approach was to allow more registers for the carry, and only consider those extra registers in the later digits summed over each column.

Getting the additional registers came as a side-effect of the solution to the second technical issue. This method requires some subtlety on a computer, because not all the columns have the same number of partial results. The conventional method used two nested loops and did not have to consider this, every row in the partial results was the same length. Comba's resolution to this issue was a common optimization for loops; simply unroll them. Instead of looping over the same code again and again, make enough duplicate copies of the code, one after the other. A routine that was originally only about a dozen bytes in length could suddenly unroll to be several KB in size, and there may even have to be a different routine for every length of input number. However, it could all be programmed quite quickly with assembler macros, and all the difficulties could be addressed - each digit combination could in effect have its own code if needed, to squeeze every last drop of optimization out of the routine. There were no longer any loop counters because there were no loops - the registers saved could address the carry issue.

## How good is it?

Comba's measurements on an 8086 processor showed his optimizations were a significant 30% faster than the "obvious" approach. However, much of this gain had a lot to do with the processor architecture of the time. Newer processors have caching architectures that mean loop unrolling makes less significant improvements, or in some cases could even be detrimental. Improvements such as pipelining and hyperthreading mean the specific issues that Comba addressed may be no longer a concern. The same experiments today, on a modern processor, may show only marginal effects, but any performance improvement is worthwhile, particularly in routines that execute many thousands of times. OpenSSL and other libraries make heavy use of the Comba method, which wins over the straightforward approach or more complex, alternative algorithms for small sizes (between 4 and 16 words).

Comba multiplication is a good example of optimization as an implementation detail. The underlying algorithm still remained the same, and thus still remained O(n2), rather than being an improvement in time complexity such as offered by Karatsuba multiplication or some other algorithm. Comba's refinements merely reduced the hidden constant that the big-oh notation conceals.

## Can it be better?

Nevertheless, Comba multiplication does have some potential. For example, it is one of very few computer multiplication algorithms that can readily be explained using pencil and paper, or that can be performed manually. One way do do this is to write the two numbers on the corners of pieces of paper, and either reverse one, or turn the paper upside-down. Now slide the two numbers past each other. As the digits align, they determine the order of the Comba multiplication products. In the 23 x 89 example:

```  |
|3 2
`----
----.      => 3*9           = 27   => 2047
8 9|                        / |
|                       /  |
_________/   |
|              / carry 2     |
|3 2          /              |
`----        /               |
----.    => 2 + 3*8 + 2*9 = 44
8 9|                      / |
|                     /  |
_________/   |
|              / carry 4     |
|3 2          /              |
`----        /               |
----.  => 4 + 2*8       = 20
8 9|
|
```
With a little practice, the two numbers sliding past each other can be visualized in the mind's eye, and the answer to a multiplication can be written down, right to left, in an impressive feat of mental arithmetic! Or so I'm told.

Even on paper, Comba's method has potential for improvement. One of the issues mentioned earlier was that the size of the products in the individual columns, and their sums, can get quite large. One way of minimizing this problem is to drop some of the limitations of our decimal number system. If digits are allowed to be negative, then the products can be made smaller (at the expense of being signed). For example, 89 = 8*10 + 9 = 9*10 - 1 = 1*100 - 1*10 - 1 could be written (1, -1, -1) in this relaxed system, and the multiplication now becomes:

```         2  3
x     1 -1 -1
===========
-2 -3
-2 -3
2  3
===========
2  1 -5 -3
===========
2  0  4  7
```
It takes a bit of thought to keep everything straight, but the individual products are much smaller. You need never multiply by 6, 7, 8 or 9 again!. As in this example, the inputs may grow by an extra digit, but such considerations may not matter on a computer, where an appropriate choice of limb size allows this modified Comba method to be quite efficient. There is still value in deferring the carry propagation to the last moment.

## Who invented it?

Interestingly, there is some controversy over Comba's attribution to the method. Similar techniques were mentioned in 1986 in the paper that proposed Barrett reduction, so many claim Barrett should be credited with the column-by-column approach. However, it turns out the method had been known about for quite some time. An unknown author detailed the method in Treviso Arithmetic, published in Treviso, Italy as early as 1478.

### Sources

Comba, P. G. Exponentiation cryptosystems on the IBM PC, IBM Systems Journal, Vol. 29, No. 4, December 1990. http://www.research.ibm.com/journal/sj/294/ibmsj2904E.pdf, viewed on July 13, 2005
Harley, Robert. Re: Comba attributed to Barrett?. Posting on sci.crypt, May 20 2003, archived at http://www.derkeiler.com/Newsgroups/sci.crypt/2003-05/1447.html, viewed on July 13, 2005.
Granlund, T. The GNU Multiple Precision Arithmetic Library, http://www.swox.com/gmp/gmp-man-4.1.4.pdf, as viewed on July 13, 2005.
St Denis, T. LibTom Projects, http://libtomcrypt.org/talk_lsm.pdf, as viewed on July 13, 2005.

Thanks to randombit for the OpenSSL reference and empirical performance data.