The general priniple of 'Single Instruction, Multiple Data' more or less sums up the ideas behind SIMD: the processor executes just one instruction stream, or one program, which
executes instructions which operate on multiple data items concurrently.

The most critical part of conventional SIMD architectures is data organisation. 'Multiple Data' could be interpreted many ways, but implementations universally use this to refer to treating a single large unit as a collection of smaller units. SIMD machines generally have large registers,
of 64, 128 or 256 bits. These form the basic, large unit of data. This is simply raw binary data which will be interpreted by instructions as a binary number, or an array of smaller numbers.

This is most easily thought of in C as a union of arrays:

union SIMD64 {
*/* Assume a little endian memory */*
*/* 8-bit bytes */*
char b [8];
*/* +-------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------------------------------+ */*
*/* 16-bit half-words */*
short h [4];
*/* +-------------------------------+
| 0 | 1 | 2 | 3 |
+-------------------------------+ */*
*/* 32-bit words */*
long w [2];
*/* +-------------------------------+
| 0 | 1 |
+-------------------------------+ */*
*/* 64-bit double words */*
long long d [1];
*/* +-------------------------------+
| 0 |
+-------------------------------+ */*
};

Since the natural use of SIMD is to operate on

arrays of similar data types, data
can be loaded from

memory into the

register in a large chunk equivalent to the register size,
regardless of the size of the individual operand units.

Typical SIMD operations operate between two sets of data such as the above, treating each element of
the array as independent data, effectively similar to simple vector addition and subtraction.

For example, a SIMD 'add bytes' instruction using the above data type would calculate the vector
sum Z of A ond B as:

A = { A.b[0], A.b[1], A.b[2], A.b[3], A.b[4], A.b[5], A.b[6], A.b[7] }
B = { B.b[0], B.b[1], B.b[2], B.b[3], B.b[4], B.b[5], B.b[6], A.b[7] }
Z = { A.b[0]+B.b[0], A.b[1]+B.b[1], A.b[2]+B.b[2], A.b[3]+B.b[3],
A.b[4]+B.b[4], A.b[5]+B.b[5], A.b[6]+B.b[6], A.b[7]+B.b[7] }

Similar instructions will exist for the equivalent operation on different data lengths, and for
other operations. It's important to note that SIMD operations are not simply vector operations.
The simple SIMD product of two data items is the vector of products of corresponding elements, and
not the vector product!

Since SIMD operations rely heavily on data being available in the correct format, an important part of SIMD instruction sets are packing instructions. They convert between larger and smaller data formats by truncating or extending elements, so that, for example, an array of bytes can be operated on as 16-bit half-words for extended precision during a calculation.

The array of bytes:

+-------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------------------------------+

would be converted (by sign extension or zero extension) to two data units of 16-bit half-words.

+-------------------------------+
| 0 | 1 | 2 | 3 |
+-------------------------------+

+-------------------------------+
| 4 | 5 | 6 | 7 |
+-------------------------------+

This is merely an exercise in bit shuffling, which is trivially performed in hardware.

Many SIMD operations such as addition do not need truly independent execution units for each data
quantity. Naively, the set of addition operations implied for a 64-bit SIMD machine requires 8
byte-adders, 4 16-bit adders, 2 32-bit adders and a single 64-bit adder, since these are logically distinct operations. However, since the operations
are very similar (the only difference between a 64-bit addition and any other is in the way the
carry bits are propagated) they can be performed by a single unit which can calculate the SIMD values
using part of the logic for the large values, giving a unit that's only slightly larger than the
largest required.