is a high-speed
(700MHz and upwards), full-custom
implementation of the ARM V5
instruction set, including the Thumb
compressed instruction set. It's an
evolution of the DEC StrongARM
, originally slated as 'StrongARM II'.
The micro-architecture is in-order, single-issue, and quite deeply
superpipelined. Cache accesses are each pipelined across 2 cycles
rather than the single-cycle access of the StrongARM and others,
giving a longer load-use penalty and more painful worst-case branch
penalty. A branch target buffer helps with this (so long as branches
are predictable) and allows zero-cost branches in tight loops; a big
plus for DSP.
Basic pipeline structure
From the textual descriptions in the Intel technical summary
, the pipeline
structure looks something like the following:
Branch target buffer
Integer ALU MAC1
↓ ↓ ↓
State/buffer Dcache 1 MAC2
↓ ↓ ↓
Writeback Dcache 2 MAC3
After the common instruction fetch
stages, the pipeline splits
into 3 separate pipes; integer arithmetic
pipe to implement the extension MAC
instructions and coprocessor
The load/store pipe allows for hit under miss operation while cache misses
are serviced by external memory.
The pre-ALU shift, which has become a painful throwback in recent ARM
implementations, is subsumed into the register fetch pipe stage.
There shall now follow some educated guesswork/wild speculation about
The effect on instruction timing isn't mentioned by the Intel
technical summary, but it seems fairly safe to assume that the traditional
extra stall cycle for shift-by-register will still be in effect;
shifting by non-trivial constants may or may not require an extra cycle.
The 'State/buffer' stage above is referred to by Intel as 'State Execute'
which is probably equivalent to the 'Buffer' stage in the StrongARM. This
sort of implies that the machine runs in-order up till the boundary
between State execute/Dcache 1/MAC2 and Writeback/DCache2/MAC3;
any memory faults will be determined by the end of Dcache 1, allowing
integer/mac instructions to be safely aborted before writeback. Beyond
this, the mismatched length of the MAC and dcache pipes implies result
completion is out-of-order.
The DC Writeback is presumably a second (set of) GP reg write port(s)
to match the differences in the lengths of integer, MAC and load/store pipes
without the excessive increase in the number of bypass points that would be
needed if the StrongARM-esque buffer-stage scheme were to be extended to
match the length of the MAC pipe.
Since the DC Writeback stage is shared between the MAC and load/store pipes,
this implies potential contention between these pipes when both want to
return an integer register result; this shouldn't be a performance problem
since the MAC pipe will return integer results comparatively infrequently,
but the arbitration logic may be complex. If not for hit-under-miss
cache operation, it would be possible to perform the arbitration by simply
stalling a load operation at issue if a MAR operation was issued in the
...and I shall stop right there, as this has already gotten to be too
dull and rambling for anyone other than another comp.arch geek to read.
Intel Xscale Microarchitecture Technical Summary, http://developer.intel.com/design/intelxscale/ixm.htm