The Intel Xscale micro-architecture is a high-speed (700MHz and upwards), full-custom implementation of the ARM V5 instruction set, including the Thumb compressed instruction set. It's an evolution of the DEC StrongARM, originally slated as 'StrongARM II'.

The micro-architecture is in-order, single-issue, and quite deeply superpipelined. Cache accesses are each pipelined across 2 cycles rather than the single-cycle access of the StrongARM and others, giving a longer load-use penalty and more painful worst-case branch penalty. A branch target buffer helps with this (so long as branches are predictable) and allows zero-cost branches in tight loops; a big plus for DSP.

Basic pipeline structure

From the textual descriptions in the Intel technical summary, the pipeline structure looks something like the following:
Branch target buffer
    ↓
Icache 1
    ↓
Icache 2
    ↓
Register/shift
    +----------------------------+
    ↓                           ↓
Integer ALU                     MAC1
    +---------------+            |
    ↓              ↓            ↓
State/buffer    Dcache 1        MAC2
    ↓              ↓            ↓
Writeback       Dcache 2        MAC3
                    |            |
                    |            ↓
                    |           MAC4
                    +------------+
                    ↓
                DC Writeback
After the common instruction fetch/decode stages, the pipeline splits into 3 separate pipes; integer arithmetic, load/store and a multiply-accumulate pipe to implement the extension MAC instructions and coprocessor interface.

The load/store pipe allows for hit under miss operation while cache misses are serviced by external memory.

The pre-ALU shift, which has become a painful throwback in recent ARM implementations, is subsumed into the register fetch pipe stage.

There shall now follow some educated guesswork/wild speculation about implementation details.

The effect on instruction timing isn't mentioned by the Intel technical summary, but it seems fairly safe to assume that the traditional extra stall cycle for shift-by-register will still be in effect; shifting by non-trivial constants may or may not require an extra cycle.

The 'State/buffer' stage above is referred to by Intel as 'State Execute' which is probably equivalent to the 'Buffer' stage in the StrongARM. This sort of implies that the machine runs in-order up till the boundary between State execute/Dcache 1/MAC2 and Writeback/DCache2/MAC3; any memory faults will be determined by the end of Dcache 1, allowing integer/mac instructions to be safely aborted before writeback. Beyond this, the mismatched length of the MAC and dcache pipes implies result completion is out-of-order.

The DC Writeback is presumably a second (set of) GP reg write port(s) to match the differences in the lengths of integer, MAC and load/store pipes without the excessive increase in the number of bypass points that would be needed if the StrongARM-esque buffer-stage scheme were to be extended to match the length of the MAC pipe.

Since the DC Writeback stage is shared between the MAC and load/store pipes, this implies potential contention between these pipes when both want to return an integer register result; this shouldn't be a performance problem since the MAC pipe will return integer results comparatively infrequently, but the arbitration logic may be complex. If not for hit-under-miss cache operation, it would be possible to perform the arbitration by simply stalling a load operation at issue if a MAR operation was issued in the previous cycle.

...and I shall stop right there, as this has already gotten to be too dull and rambling for anyone other than another comp.arch geek to read.

References:

Intel Xscale Microarchitecture Technical Summary, http://developer.intel.com/design/intelxscale/ixm.htm

Log in or registerto write something here or to contact authors.