/*

While SIMD is primarily thought of as a hardware concept, the same techniques as common SIMD implementations can be used purely in software for increased performance in critical code, particularly when using a system which has no hardware support for SIMD. As always YMMV. In fact, your mileage will almost certainly vary wildly.

Hardware SIMD implementations generally work on a constant sized block of data. Integer SIMD hardware like MMX, AltiVec, MDMX, MAX, MVI and VIS all treat a block of data (typically 64 or 128 bits) as a series of smaller sized units. The data units are treated as vectors of smaller-sized quantities: 128 bits of data can be a vector of 16 8-bit bytes, 8 16-bit half-words, 4 32-bit words or 2 64-bit long longs. Typical SIMD operations operate on corresponding elements of vectors; a good example is a byte addition operation. As an example, a two 32-bit vectors of 4 8-bit bytes can be added together like this:

    { 0xdd 0xcc 0xbb 0xaa } +
    { 0x11 0x22 0x33 0x44 }
    =
    { 0xee 0xee 0xee 0xee }

Each component of the result is the result of applying the operation to the corresponding elements of the two parameters. Conventionally, in C or C++, this would be written similarly to:

unsigned char *a, *b, result;
for (int i=0; i<4; i++) *result = *a + *b;
 

Terminology

First thing's first. Everybody and his uncle seems to use different terminology to describe different sizes of integer datum. So here's what I happen to call 'em:
  • 8 bits: "byte". No pussy-footing around with this octet nonsense
  • 16 bits: "half-word". The only true definition of the size of a 'word' is entirely dependent on the native machine word size of the target architecture. Which is fine if you now what you're architecture is. I'm going to assume that we're dealing with a 32-bit machine because, frankly, its what I were brung up on.
  • 32 bits: "word". See above.
  • 64 bits: "long" or "long long" because the latter happens to be the ANSI (or is that still just gcc?) definition.

SIMD Using Conventional Binary Arithmetic

A conventional 32-bit RISC has only a 32-bit ALU. When performing addition or subtraction on smaller quantities, such as bytes, the higher 24 bits of the ALU go unused. In addition, data elements also have to be moved to and from memory separately, or moved as a single unit then rearranged in a CPU register. An addition of two 32-bit quantities is almost the same function as a vector addition of two vectors of 4 8-bit quantities:

    0xddccbbaa +
    0x11223344
    =
    0xeeeeeeee

The difference between the two functions is in the carry propagation behaviour. In a plain 32-bit addition, if the sum of the lower 8 bits of the arguments is larger than 255, bit 9 of the result is set. In a SIMD operation, this would correspond to the lowest bit of the next 8-bit quantity.

In order to compute a smaller unit vector sum from a larger unit's addition, we need to ensure that the operation performed on each unit within the 32-bit data will not generate a carry output which may be incorporated into the result of the next unit. At least two techniques can be employed to achieve this.
 

Field separation to avoid carry

If a 'dead' zone can be allocated within the primary data format, that can be zero'd in the two arguments before addition: The corresponding 'dead' zone in the result data will be the carry result from the unit below it, and will generate no carry output of it's own.

For regular sized data elements (eg. bytes within a 32-bit word), arranging the data into this format will almost certainly eliminate any performance gain made by performing the additions as a single operation. In either case this approach only works for addition and subtraction operations (not multiplication).

Independent even/odd vectors

A temporary data format such as that described above can be constructed rapidly by separating a packed vector of smaller units into 2 separate sets consisting of alternating elements from the data set, and zeros.
0xddccbbaa
separates into
0xdd00bb00 (even elements)
0x00cc00aa (odd elements),

while
0x11223344
separates into
0x11003300 (even elements)
0x00220044 (odd elements)

The two sets of even/odd elements can be operated on to form the even/odd components of the result, with additional carry information:
0xdd00bb00 + 0x11003300 = 0xee00ee00 (even sum)
0x00cc00aa + 0x00220044 = 0x00ee00ee (odd sum)


To merge these to form the result vector, the carry information is discarded and the two sets recombined.

All the necessary data manipulation can be performed with simple masking operations. We can start by defining a constant EVEN_BYTES which will extract the even elements of the vector:

 */ #define EVEN_BYTES 0xff00ff00
 /*

The even components can be extracted by taking the logical AND of a data value and this mask; the odd components by the AND of the data and the logical complement of the mask.

 */ #define GET_EVEN_BYTES(x) ((x) & EVEN_BYTES)
    #define GET_ODD_BYTES(x) ((x) &~ EVEN_BYTES)
 /*

The even and odd components of the sum of vectors A and B are therefore
GET_EVEN_BYTES(A) + GET_EVEN_BYTES(B)
and
GET_ODD_BYTES(A) + GET_ODD_BYTES(B)

The odd components of the even sum (and the even components of the odd sum) contain the carry output of each component's sum. To assemble the simple vector sum, this carry information can be discarded by masking, and the even/odd sets recombined with a logical OR.

 */ #define COMBINE_BYTES(even, odd) \
    (((even) & EVEN_BYTES) | ((odd) & ODD_BYTES))
 /*

The vector sum is therefore:

COMBINE_BYTES(GET_EVEN_BYTES(a) + GET_EVEN_BYTES(b),
              GET_ODD_BYTES(a) + GET_ODD_BYTES(b))

This can be optimised by masking only the even or odd inputs: to get the correct sum in each component, we only need to ensure that no carry is generated by the odd parts of the even input (or vice versa). This requires only that one of the terms be masked off:

 */
    #define ADD_BYTES(a, b) \
    COMBINE_BYTES(GET_EVEN_BYTES((a)) + (b), \
                  GET_ODD_BYTES((a)) + (b))
 /*
 

Execution Costs

Calculating this sum requires a minimum of 7 operations:
  • 2 AND operations (construct the even, odd components of a)
  • 2 ADD operations (even, odd sums)
  • 2 AND operations (mask carry results from the even, odd sums)
  • 1 OR operation (combine even, odd sums)
This neglects any operations necessary to set up the mask constants: it is assumed that these can be amortized. This doesn't, initially, compare too favourably with the 4 operations which would be necessary if the elements were handled purely as bytes. However, when the load and store operations necessary to move the data betwen memory and registers, this method becomes more favourable: 8 byte loads and and 4 byte stores compared with 2 word loads and 1 word store. Where byte load stores cost the same as their word equivalents (any conventional RISC), brings the operation count to 10 vs. 14.

The cost of this method is proportional only to the amount of data processed: for data elements smaller than 8 bits (though there are very limited applications for this!), or for machines with a a basic arithmetic unit of 64 bits, this method gets a linear increase in performance, ie. performance is O(machine_word_width*data_size) or O(machine_word_width * n_data_items * sizeof(element)).  Performing the operation on each unit separately gives an effective performance of O(n_data_items), or O(data_size/sizeof(element)).
 

Unsigned Multiplication

The same approach can be applied to unsigned multiplication. Multiplying each element of a vector by a constant can be achieved by separating the factors into their n-bit even/odd sets and multiplying the sets by a a simple scalar (of the same size as the elements of the array). The resulting even/odd sets consist of the 2n-bit product of the original element, and the scalar. This can be shifted to select an appropriate n-bit product before combining the sets back to the n-bit product vector.

 */ #define SCALAR_MUL_BYTES(v, s) \
    COMBINE_BYTES(GET_EVEN_BYTES((v))*(s), \
                  GET_ODD_BYTES((v))*(s))
 /*

This has the potential to be much faster where multiplication is more time-consuming than other operations: only two multiplications are required rather than four.

Calculating the convolution of two vectors is similar to addition, but with the twist that the product will be twice the size of the terms. To multiply a 32-bit vector of bytes, a multiply which makes all 64 product bits visible is necessary. With a 64-bit multiply and a 64-bit variant of the COMBINE_BYTES defined above, we could define the multiply:

#define VECTOR_MUL_BYTES(a, b) \
COMBINE_BYTES_64 (GET_EVEN_BYTES((a)) * GET_EVEN_BYTES((b)),
                  GET_ODD_BYTES((a)) * GET_ODD_BYTES((b)))
 
 

Instruction-Level Parallelism

If the compiler is capable of scheduling the code correctly, this even/odd SIMD method should be able to make good use of up to a 2-way superscalar for a single vector operation: the even and odd components of the final result can be computed entirely independently, and combined only at the last. The critical path for the 7 kernel operations is 4 deep (mask, add, mask, combine) with a sustained issue rate of 2 (other than the final OR operation). For the multiplication operations, where both components must be masked, the critical path is still 4 deep, but would require a 4-way superscalar machine (even/odd separation of each of the 2 components can be performed simultaneously).

While a byte-independent approach would theoretically allow each element to be processed at the same time on a 4-way superscalar machine, it's unlikely that this could be realised since the conventional balance of no more than 2 load units would be the limiting factor unless more parallelism could be exposed by the compiler (and thus exploited by the even/odd SIMD method too).
 

Conclusions

As always, your mileage may vary. Your profiler is your best friend. On the plus side, this method is entirely expressed in C and most compilers (even gcc) seem to be able to generate sensible code for it. This is just another useful trick to keep in the ol' tool chest.

*/