/*
While SIMD is primarily thought of as a hardware concept, the same
techniques as common SIMD implementations can be used purely in software
for increased performance in critical code, particularly when using a system
which has no hardware support for SIMD. As always YMMV. In fact, your
mileage will almost certainly vary wildly.
Hardware SIMD implementations generally work on a constant sized block
of data. Integer SIMD hardware like MMX, AltiVec, MDMX, MAX,
MVI and VIS all treat a block of data (typically 64 or 128 bits) as
a series of smaller sized units. The data units are treated as vectors
of smallersized quantities: 128 bits of data can be a vector of 16 8bit
bytes, 8 16bit halfwords, 4 32bit words or 2 64bit long longs. Typical
SIMD operations operate on corresponding elements of vectors; a good
example is a byte addition operation. As an example, a two 32bit vectors
of 4 8bit bytes can be added together like this:
{ 0xdd 0xcc 0xbb 0xaa } +
{ 0x11 0x22 0x33 0x44 }
=
{ 0xee 0xee 0xee 0xee }
Each component of the result is the result of applying the operation
to the corresponding elements of the two parameters. Conventionally, in
C or C++, this would be written similarly to:
unsigned char *a, *b, result;
for (int i=0; i<4; i++) *result = *a + *b;
Terminology
First thing's first.
Everybody and his uncle seems to use different terminology
to describe different sizes of
integer datum. So here's what I happen
to call 'em:

8 bits: "byte". No pussyfooting around with this octet nonsense

16 bits: "halfword". The only true definition of the size of a 'word'
is entirely dependent on the native machine word size of the target architecture.
Which is fine if you now what you're architecture is. I'm going to assume
that we're dealing with a 32bit machine because, frankly, its what
I were brung up on.

32 bits: "word". See above.

64 bits: "long" or "long long" because the latter happens to be the
ANSI (or is that still just gcc?) definition.
A conventional 32bit
RISC has only a 32bit
ALU. When performing
addition
or
subtraction on smaller quantities, such as bytes, the higher 24 bits
of the ALU go unused. In addition, data elements also have to be moved
to and from memory separately, or moved as a single unit then rearranged
in a
CPU register. An addition of two 32bit quantities is
almost
the same function as a
vector addition of two vectors of 4 8bit quantities:
0xddccbbaa +
0x11223344
=
0xeeeeeeee
The difference between the two functions is in the carry propagation
behaviour. In a plain 32bit addition, if the sum of the lower 8 bits of
the arguments is larger than 255, bit 9 of the result is set. In a SIMD
operation, this would correspond to the lowest bit of the next 8bit quantity.
In order to compute a smaller unit vector sum from a larger unit's addition,
we need to ensure that the operation performed on each unit within the
32bit data will not generate a carry output which may be incorporated
into the result of the next unit. At least two techniques can be employed
to achieve this.
Field separation to avoid carry
If a 'dead' zone can be allocated within the primary data format, that
can be zero'd in the two
arguments before addition: The corresponding
'dead' zone in the result data will be the carry result from the unit below
it, and will generate no
carry output of it's own.
For regular sized data elements (eg. bytes within a 32bit word),
arranging the data into this format will almost certainly eliminate any
performance gain made by performing the additions as a single operation.
In either case this approach only works for addition and subtraction operations
(not multiplication).
Independent even/odd vectors
A temporary data format such as that described above can be constructed
rapidly by separating a
packed vector of smaller units into 2 separate
sets consisting of alternating elements from the data set, and zeros.
0xddccbbaa
separates into
0xdd00bb00 (even elements)
0x00cc00aa (odd elements),
while
0x11223344
separates into
0x11003300 (even elements)
0x00220044 (odd elements)
The two sets of even/odd elements can be operated on to form the even/odd
components of the result, with additional
carry information:
0xdd00bb00 + 0x11003300 = 0xee00ee00 (even sum)
0x00cc00aa + 0x00220044 = 0x00ee00ee (odd sum)
To merge these to form the result vector, the carry information
is discarded and the two sets recombined.
All the necessary data manipulation can be performed with simple masking
operations. We can start by defining a constant EVEN_BYTES which
will extract the even elements of the vector:
*/ #define EVEN_BYTES 0xff00ff00
/*
The even components can be extracted by taking the logical AND of
a data value and this mask; the odd components by the AND of the data and
the logical complement of the mask.
*/ #define GET_EVEN_BYTES(x) ((x) & EVEN_BYTES)
#define GET_ODD_BYTES(x) ((x) &~ EVEN_BYTES)
/*
The even and odd components of the sum of vectors A and B are therefore
GET_EVEN_BYTES(A) + GET_EVEN_BYTES(B)
and
GET_ODD_BYTES(A) + GET_ODD_BYTES(B)
The odd components of the even sum (and the even components of the odd
sum) contain the carry output of each component's sum. To assemble the
simple vector sum, this carry information can be discarded by masking,
and the even/odd sets recombined with a logical OR.
*/ #define COMBINE_BYTES(even, odd) \
(((even) & EVEN_BYTES)  ((odd) &
ODD_BYTES))
/*
The vector sum is therefore:
COMBINE_BYTES(GET_EVEN_BYTES(a) + GET_EVEN_BYTES(b),
GET_ODD_BYTES(a) + GET_ODD_BYTES(b))
This can be optimised by masking only the even or odd inputs: to get
the correct sum in each component, we only need to ensure that no carry
is generated by the odd parts of the even input (or vice versa). This requires
only that one of the terms be masked off:
*/
#define ADD_BYTES(a, b) \
COMBINE_BYTES(GET_EVEN_BYTES((a)) + (b), \
GET_ODD_BYTES((a)) + (b))
/*
Execution Costs
Calculating this sum requires a minimum of 7 operations:

2 AND operations (construct the even, odd components of a)

2 ADD operations (even, odd sums)

2 AND operations (mask carry results from the even, odd sums)

1 OR operation (combine even, odd sums)
This neglects any operations necessary to set up the mask constants: it
is assumed that these can be
amortized. This doesn't, initially, compare
too favourably with the 4 operations which would be necessary if the elements
were handled purely as
bytes. However, when the load and store operations
necessary to move the data betwen
memory and
registers, this method
becomes more favourable: 8 byte loads and and 4 byte stores compared with
2
word loads and 1
word store. Where byte load stores cost the same
as their word equivalents (any conventional RISC), brings the operation
count to 10 vs. 14.
The cost of this method is proportional only to the amount of data processed:
for data elements smaller than 8 bits (though there are very limited applications
for this!), or for machines with a a basic arithmetic unit of 64 bits,
this method gets a linear increase in performance, ie. performance is O(machine_word_width*data_size)
or O(machine_word_width * n_data_items * sizeof(element)).
Performing the operation on each unit separately gives an effective performance
of O(n_data_items), or O(data_size/sizeof(element)).
The same approach can be applied to unsigned multiplication. Multiplying
each element of a
vector by a
constant can be achieved by separating
the
factors into their
nbit even/odd sets and multiplying the
sets by a a simple scalar (of the same size as the elements of the array).
The resulting even/odd sets consist of the 2
nbit product of the
original element, and the scalar. This can be shifted to select an appropriate
nbit product before combining the sets back to the
nbit product
vector.
*/ #define SCALAR_MUL_BYTES(v, s) \
COMBINE_BYTES(GET_EVEN_BYTES((v))*(s), \
GET_ODD_BYTES((v))*(s))
/*
This has the potential to be much faster where multiplication is more
timeconsuming than other operations: only two multiplications are required
rather than four.
Calculating the convolution of two vectors is similar to addition,
but with the twist that the product will be twice the size of the terms.
To multiply a 32bit vector of bytes, a multiply which makes all 64 product
bits visible is necessary. With a 64bit multiply and a 64bit variant
of the COMBINE_BYTES defined above, we could define the multiply:
#define VECTOR_MUL_BYTES(a, b) \
COMBINE_BYTES_64 (GET_EVEN_BYTES((a)) * GET_EVEN_BYTES((b)),
GET_ODD_BYTES((a)) * GET_ODD_BYTES((b)))
If the
compiler is capable of
scheduling the code correctly, this even/odd
SIMD method should be able to make good use of up to a 2way
superscalar
for a single vector operation: the even and odd components of the final
result can be computed entirely independently, and combined only at the
last. The
critical path for the 7 kernel operations is 4 deep (mask,
add, mask, combine) with a sustained issue rate of 2 (other than the final
OR operation). For the multiplication operations, where both components
must be masked, the critical path is still 4 deep, but would require a
4way superscalar machine (even/odd separation of each of the 2 components
can be performed simultaneously).
While a byteindependent approach would theoretically allow each element
to be processed at the same time on a 4way superscalar machine, it's unlikely
that this could be realised since the conventional balance of no more
than 2 load units would be the limiting factor unless more parallelism
could be exposed by the compiler (and thus exploited by the even/odd
SIMD method too).
Conclusions
As always,
your mileage may vary. Your
profiler is your
best friend.
On the plus side, this method is entirely expressed in C and most compilers
(even gcc) seem to be able to generate sensible code for it. This is just
another useful trick to keep in the ol'
tool chest.
*/