SIMD - Everything2.com

The general priniple of 'Single Instruction, Multiple Data' more or less sums up the ideas behind SIMD: the processor executes just one instruction stream, or one program, which executes instructions which operate on multiple data items concurrently.

The most critical part of conventional SIMD architectures is data organisation. 'Multiple Data' could be interpreted many ways, but implementations universally use this to refer to treating a single large unit as a collection of smaller units. SIMD machines generally have large registers, of 64, 128 or 256 bits. These form the basic, large unit of data. This is simply raw binary data which will be interpreted by instructions as a binary number, or an array of smaller numbers.

This is most easily thought of in C as a union of arrays:

union SIMD64 {
  /* Assume a little endian memory */

  /* 8-bit bytes */
  char b [8];
  /* +-------------------------------+
     | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
     +-------------------------------+ */

  /* 16-bit half-words */
  short h [4];
  /* +-------------------------------+
     |     0 |     1 |      2 |    3 |
     +-------------------------------+ */

  /* 32-bit words */
  long w [2];
  /* +-------------------------------+
     |             0 |             1 |
     +-------------------------------+ */

  /* 64-bit double words */
  long long d [1];
  /* +-------------------------------+
     |                             0 |
     +-------------------------------+ */

};

Since the natural use of SIMD is to operate on arrays of similar data types, data can be loaded from memory into the register in a large chunk equivalent to the register size, regardless of the size of the individual operand units.

Typical SIMD operations operate between two sets of data such as the above, treating each element of the array as independent data, effectively similar to simple vector addition and subtraction.

For example, a SIMD 'add bytes' instruction using the above data type would calculate the vector sum Z of A ond B as:

A = { A.b[0], A.b[1], A.b[2], A.b[3], A.b[4], A.b[5], A.b[6], A.b[7] }
B = { B.b[0], B.b[1], B.b[2], B.b[3], B.b[4], B.b[5], B.b[6], A.b[7] }
Z = { A.b[0]+B.b[0], A.b[1]+B.b[1], A.b[2]+B.b[2], A.b[3]+B.b[3],
      A.b[4]+B.b[4], A.b[5]+B.b[5], A.b[6]+B.b[6], A.b[7]+B.b[7] }

Similar instructions will exist for the equivalent operation on different data lengths, and for other operations. It's important to note that SIMD operations are not simply vector operations. The simple SIMD product of two data items is the vector of products of corresponding elements, and not the vector product!

Since SIMD operations rely heavily on data being available in the correct format, an important part of SIMD instruction sets are packing instructions. They convert between larger and smaller data formats by truncating or extending elements, so that, for example, an array of bytes can be operated on as 16-bit half-words for extended precision during a calculation.

The array of bytes:

+-------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------------------------------+

would be converted (by sign extension or zero extension) to two data units of 16-bit half-words.

+-------------------------------+
|     0 |     1 |      2 |    3 |
+-------------------------------+

+-------------------------------+
|     4 |     5 |      6 |    7 |
+-------------------------------+

This is merely an exercise in bit shuffling, which is trivially performed in hardware.

Many SIMD operations such as addition do not need truly independent execution units for each data quantity. Naively, the set of addition operations implied for a 64-bit SIMD machine requires 8 byte-adders, 4 16-bit adders, 2 32-bit adders and a single 64-bit adder, since these are logically distinct operations. However, since the operations are very similar (the only difference between a 64-bit addition and any other is in the way the carry bits are propagated) they can be performed by a single unit which can calculate the SIMD values using part of the logic for the large values, giving a unit that's only slightly larger than the largest required.

MIMD	SISD	Arbitrary-Sized Software SIMD	SSE
DSP	MMX	32-bit era	PowerPC 970
HSCSD	AltiVec	SSE2	vectorization
SHARC	maspar	Limited-range MAX without branches	3DNow!
Hamas	truncate	dodecahedron	Parallel
Union	AMD	SMP