A power-efficient high-performance RISC microprocessor core from ARM Limited, the ARM8 was the 1996 follow-up to the immensely successful ARM7. The primary goal of the project was to double the performance of the ARM7, while retaining the low power usage and simplicity of implementation which the ARM has become reknowned for.

Increasing Performance

There are two ways of increasing the effective performance of a microprocessor: reduce the number of clock cycles per instruction (CPI), or increasing the clock rate of the processor. The former requires that the datapath must be altered such that instructions take fewer pipeline slots, or pipeline stalls caused by inter-instruction dependencies are reduced, or both. The latter requires that the pipeline must be extended, in order to simplify each pipeline stage and allow it to complete its task in less time.

The von Neumann Bottleneck

The fundamental difficulty in reducing CPI with respect to the ARM7 is that the three-stage pipeline is primarily restricted by the ARM's use of the von Neumann architecture, which allows at most one memory access per clock cycle. In fact, the ARM7's pipeline makes optimal use of the memory sub-system: memory is always accessed in every cycle.

Clearly then, the only way to significantly improve CPI is to alter the memory architecture such that multiple 32-bit accesses are possible within one cycle. There are two ways of doing this: use a seperate instruction and data memory (Harvard architecture), or use a double-bandwidth single memory. The DEC StrongARM design used the former approach, while ARM's team for the ARM8 chose the latter.

The ARM8 exploits the concept of spatial locality to its advantage: almost all instructions are executed sequentially, and the multiple load/store operations read and write from and to blocks of memory. This allows a double-bandwidth 32-bit wide memory to deliver the first word in a clock cycle, and overlap the next word half a cycle later. Double-bandwidth memory requires only a little extra hardware for a large boost in performance; using a 32-bit bus with a half-cycle delay also uses far less routing than a 64-bit bus.

Altering the Pipeline

In order to achieve an increased clock rate, the ARM's 3-stage fetch-decode-execute pipeline had to be changed significantly. The ARM8 uses a more "standard" RISC 5-stage pipeline, very similar to that of the MIPS processor. The five stages are: instruction prefetch, decode/register read, ALU operation/shift, data memory access/ALU write, and register write-back:

      ---------- -------- ----------------------------------
     |  Fetch   | Decode |             Execute              |
      ---------- -------- ----------------------------------
Standard ARM pipeline (above) compared with ARM8 pipeline (below). Not to scale: each stage takes one clock cycle.
      ---------- -------------- -------- -------- ----------
     | Prefetch | Decode / Reg | ALU Op | Memory | Register |
     |          |         Read | Shift  | Access | Write    |
      ---------- -------------- -------- -------- ----------

The new prefetch stage is used to select and buffer instructions, thereby exploiting the double-bandwidth memory. For the first time in an ARM core, branch prediction is employed: however, to keep core complexity low, the prefetch unit uses static prediction based on the direction of the branch. Branches backwards in the code are predicted to be taken, and branches forwards are predicted not taken. The prefetcher passes instructions (in order) to the ALU and shift stage at a rate of one per cycle, and the ALU notifies the prefetch unit when a branch mis-prediction occurs, allowing it to fetch the correct instruction.

One drawback of the extended pipeline is that instruction scheduling becomes more important. Loading a register with a value from memory (LDR) and immediately using that register as a source operand causes a single clock cycle pipeline bubble. Instructions must therefore be re-ordered to avoid such data dependencies wherever possible, and ARM's own C compiler has been updated to reflect this.

Performance Gains

When used with a double-bandwidth memory, the ARM8 outperforms its predecessor by a factor of 2 or 3, while increasing core size by a similar proportion. The average CPI is reduced from the ARM7's 1.9 to 1.4, with a full 10% of that given by static branch prediction. The average CPI for Load Multiple (LDM) instructions is reduced by a factor of 1.5 for most applications, and single register load and store (LDR, STR) are reduced to single cycle instructions in the normal case.

An obvious application of the ARM8 is to add a double-bandwidth cache and produce a high-performance CPU. This was realised in the form of the ARM810, which used an 8KB 64-way set associative cache, and also introduced support for the more power- and bandwidth-efficient copy-back method of coherency control.

Products Using ARM8

The ARM8 was not widely used: some mobile phones were produced with the design, and a prototype CPU card was produced for the Acorn Risc PC. However, by the time it was available at 72MHz, the similarly designed DEC StrongARM had been released at 200Mhz, with higher performance for similar power usage and cost. The lower end of the market was sewn up by the ARM7, and so it struggled to find a niche. It was later superceded by the ARM9 core, which also supported the ARM7's popular TDMI extensions, while further increasing performance.

Fact Sheet

  • Used in: cell phones, Acorn Risc PC prototypes, network computers
  • Processors available: ARM810 (cell, MMU, 8KB cache, write buffer)
  • Fabrication: 0.6µm, 0.5µm, 0.35µm
  • Clock: 0--72MHz
  • Cache: double-bandwidth unified
  • Addressing: 26-bit, 32-bit
  • Architecture: ARMv4
  • Notable features: improved performance from double-bandwidth memory, 5-stage pipeline, copy-back cache

Sources:

"ARM System-on-Chip Architecture", Furber, Addison-Wesley, 2000
"ARM810: Dancing to the Beat of a Different Drum" presentation, Larri, 1996
"ARM8 Data Sheet", Advanced RISC Machines Ltd (ARM), www.arm.com