A low-power, high performance RISC processor from ARM Limited.

The much-anticipated successor to the ARM9 was formally announced in October 1998, and first put to silicon by Lucent Technologies in April 2000. ARM10's purpose was to again double the performance of its predecessor on the same fabrication, while allowing for further improvements with smaller processes.

As with the move from ARM7 to ARM8, effective performance was enhanced using two methods: altering the pipeline to enable a higher clock speed, and improving the number of cycles taken to execute each instruction. Since the ARM9's pipeline was already almost optimal, and its CPI very low, some relatively complicated techniques were required if the target was to be met. A fine balance between core complexity and power consumption had to be struck for ARM's reputation as a low power architecture to be upheld.

Pipeline Optimisation

Maximum clock speed is always determined by the slowest pipeline stage. One common approach to increasing clock speed, favoured most famously by Intel, is superpipelining. Intel's XScale modifications to the StrongARM processor used this technique, dividing each pipeline stage into two sections. This allows for dramatically higher clock scaling, but has the obvious disadvantage that a pipeline flush is disasterous to performance. Therefore, complicated branch prediction techniques are required to keep CPI at a reasonable level.

The ARM10 team's initial approach was to simply optimise each stage as much as possible to allow better clock scaling than the ARM9. For example, the ARM's fast multiplication unit uses the ALU to add its partial product terms: this is efficient in terms of gates, but stalls the pipeline at the Execute stage. ARM10 has a dedicated adder for multiplication, which runs in the Memory stage, speeding up multiplication throughput.

Another optimisation used was to add specific address calculation hardware, and provide memory addresses to the Fetch and Memory stages half a cycle early. This effectively permits these stages to operate for one-and-a-half cycles, allowing them more time to access the relatively slow memory without limiting the maximum clock speed.

Unfortunately, the Decode stage proved to be too difficult to optimise: ARM's instruction set is far less regular than many RISC architectures, and decoding instructions is a complex task. Adjustments to the pipeline were then made to split the Decode stage into Issue and Decode, where the former effectively partially decodes the instruction, to allow the next stage to perform register read operations in parallel with the rest of the decode.

6-stage Pipeline

Therefore, ARM10's pipeline has six stages, as shown below: Fetch, Issue, Decode, Execute, Memory, and Write. The first stage includes the static branch prediction unit, and the instruction fetcher. The next stage translates Thumb instructions into their appropriate ARM counterpart, issues coprocessor instructions where appropriate, and begins to decode the instruction. Next, the registers are read in parallel with the rest of the instruction decoding.

The fourth stage is the most complicated: the address calculation unit determines the address of a branch or memory access, while the hardware multiplier creates partial products, and the ALU and barrel shifter perform arithmetic operations. In the fifth stage the multiplication operation is completed, or data memory is read (instructions never use both multiplication and data memory, so there is no conflict here). Finally, the ALU, multiply, or data memory stages write-back to the register file as appropriate.

 ---------- ---------- ---------- ---------- ---------- ---------- 
|          |  ARM &   |          | AddrCalc | Multiply |          |
|  Branch  |  Thumb   | Register |----------|  Adder   |          |
| Predict  |  Decode  |   Read   | Multiply |__________| Register |
|----------|----------|----------|----------|          |  Write   |
|          | Co-proc  |  Final   |   ALU/   |  Memory  |          |
|  Fetch   |  Issue   |  Decode  |  Shift   |  Access  |          |
 ---------- ---------- ---------- ---------- ---------- ---------- 
   Fetch      Issue      Decode    Execute     Memory     Write    

The new 6-stage pipeline is better balanced than ARM9's, and scales much further. When branch prediction is taken into account, the average CPI is comparable to its predecessor, and so a 50% clock speed improvement on the same fabrication process gives around half the required performance improvement.

Core Improvements

Branch Prediction

Two concepts first introduced in the ill-fated ARM8 make a reappearance in the ARM10. A 64-bit wide split instruction and data memory is used to allow multiple instructions to be fetched in one clock cycle. This is exploited using the first two pipeline stages, which perform static branch prediction and instruction issue. As in the ARM8, backward branches are predicted "taken", and forward branches are predicted "not taken", for the two normal branch cases of loops and function calls respectively.

Memory Access

In addition to the increased instruction issuing capability, the wider memory bus allows for faster load and store operations. ARM10 allows two register transfers to or from memory in one cycle, which vastly improves performance of the commonly-used Load Multiple and Store Multiple instructions.

Further to this, the newly introduced instruction issue pipeline stage allows for non-blocking memory access. For example, consider the following code, assuming ideal memory conditions:

LDMIA   r0!, [r1-r4] ; load four words from address r0 into r1-r4
SUBS    r5, r6, r7   ; r5 = r6 - r7, and set flags
SUBMI   r5, r7, r6   ; if r5 < 0, r5 = r7 - r6
CMP     r5, r8       ; compare r5 and r8
MOVEQ   r6, r4       ; if r5 == r8, r6 = r4

All ARM cores previous to ARM10 would stall the pipeline while executing the LDMIA instruction, causing the code segment to take nine cycles in total -- five for loading four words of memory, and one for each following instruction. Since there are no data dependencies, the ARM10 allows the instructions following the LDMIA to execute while it reads from memory. This gives a total execution time of five cycles -- one cycle per instruction -- and an 80% improvement in CPI over previous ARM cores. Also note that, since two registers are fetched from memory every cycle, the MOVEQ instruction is able to use r4 as a source operand only four instructions later.

Multiplication

A new fast 16x32 hardware multiplier, combined with the two-stage multiplication pipeline, allows ARM10 to complete a 32-bit multiply-accumulate operation every clock cycle. This is a huge increase in performance from previous cores, which even with the TDMI extensions required between 3 and 5 cycles for a multiplication.

New Features

ARM10 is the first ARM core to support architecture version 5TE. This is a superset of version 4T, adding BLX (branch-with-link and toggle Thumb/ARM mode), CLZ (count leading zeroes, useful for DSP operations), and BRK (software breakpoint). Production ARM10 processors actually support v5TE, which adds signal processing (saturate-on-overflow) instructions. Somewhat related to this is ARM10's support for an on-chip vector floating point coprocessor, the VFP10.

Performance Implications

The goal of doubling the performance of ARM9 on the same process was effectively reached: at release, ARM10 achieved around 375 Dhrystone 2.1 MIPS at 300MHz, when fabricated on a 0.25µm process. The ARM1020E was later fabricated on 0.13µm, running at up to 400MHz and achieving 500 MIPS. Astonishingly, it does this at 1.1V and only 240mW, giving a MIPS/W ratio of nearly 2100.

In 2002, Samsung announced their implementation of the ARM1020E, codenamed Halla. They used the expertise gained from producing the DEC Alpha 21264 to fabricate the ARM10 to a 0.13µm process at an even lower core voltage than previously used. Their core scales from 400MHz (260mW, 0.7V) to an incredible 1200MHz (1.8W, 1.1V). While its power usage is higher, and the MIPS/W ratio worse, this demonstrates the scalability introduced by the latest ARM core.

Fact Sheet

  • Processors available:
    • ARM1020E (cell with TDMI extensions, MMU, dual 32KB instruction and data caches, optional VFP10 vector floating-point unit)
    • ARM1022E (as ARM1020E with 16KB caches)
    • ARM1026EJ-S (as ARM1020E, but a fully synthesizable processor with Jazelle Java acceleration and configurable cache sizes )
  • Fabrication: 0.25µm, 0.18µm, 0.13µm
  • Clock: 300MHz, 400MHz, 600MHz, 700MHz, 800MHz, 1GHz, 1.2GHz
  • Cache: 64-bit split instruction/data
  • Addressing: 32-bit
  • Architecture: ARMv5TE
  • Notable features: incredibly scalable, highest MIPS/Watt yet, introduces support for vector floating point coprocessor

References:

"ARM System-on-Chip Architecture", Furber, Addison-Wesley, 2000
"Exploring the ARM1026EJ-S Pipeline", Levy, Microprocessor Report 2002-04-30
"Samsung Twists ARM Past 1GHz", Levy, Microprocessor Report 2002-10-16
"ARM10 Data Sheet", ARM Limited, www.arm.com
"ARM10 Thumb Family Product Overview", Advanced RISC Machines Ltd, www.arm.com