ARM Pipelining

7/28/2019 ARM Pipelining

1/31

1

Lecture 5 - ARM Organization

and Implementation- ICE 1222/2342

Fall, 2008

Daeyoung Kim

kimd@icu.ac.kr

http://resl.icu.ac.kr/~kimd
mailto:Kimd@icu.ac.krhttp://resl.icu.ac.kr/~kimdhttp://resl.icu.ac.kr/~kimdmailto:Kimd@icu.ac.kr


2/31

2

Contents

3-stage pipeline ARM organization &implementation

5-stage pipeline ARM organization &implementation


3/31

3

3-stage pipeline ARMOrganization

multiply

data out register

instruction

decode

&

control

incrementer

registerbank

address register

barrelshifter

A[31:0]

D[31:0]

data in register

ALU

control

PC

PC

ALUbus

Abus

Bbus

register

ARM Processors

up to ARM7


4/31

4

3-stage pipeline

Fetch Instruction is fetched and placed in the instruction pipeline

Decode

The instruction is decoded and the datapath control signalsprepared for the next cycle

The instruction owns the decode logic but not the datapath

Execute The instruction owns the datapath Register bank is read, an operand is shifted, ALU result

generated and written back into a destination register


5/31

5

ARM single-cycle instruction 3-stage pipeline operation

fetch decode execute

time

1



2

3

instruction


6/31

6

ARM multi-cycle instruction 3-stage pipeline operation

fetch ADD decode execute

time

1

fetch STR decode calc. addr.

fetch ADD decode execute

2

3

data xfer

fetch ADD decode execute4

5 fetch ADD decode execute

instruction


7/317

To achieve higher performance

Tprog = Ninst x CPI / fclk

Increase the clock rate, fclk The logic in each pipeline stage to be simplified and, therefore, the

number of pipeline stages to be increased

Reduce the average number of clock cycles per instruction, CPI Instructions which occupy more than one pipeline slot are re-

implemented to occupy fewer slots Pipeline stalls caused by dependencies between instructions are reduced

Memory bottleneck Von Neumann bottleneck

Deliver more than 32 bits per access Separate instruction and data memory


8/318

ARM9TDMI 5-stage pipelineorganization

Fetch Instruction is fetched and placed

in the instruction pipeline

Decode The instruction is decoded and

register operands read

Execute An operand is shifted and ALU

result generated. Load/Store -> memory address

is calculated in ALU

Buffer/Data Data memory is accessed if

required Otherwise ALU result is simply

buffered

Write-back Result is written back to register

file

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediate

fields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BL

MOV pc


9/319

Data Forwarding

A major source of complexity in 5-stage pipeline Instruction execution is spread

across the stages To resolve data dependencies

without stalling the pipeline Forwarding paths

Even with forwarding we can notavoid stall

LDR rN, [..] ADD r2, r1, rN One cycle stall required rN available at the end of

buffer/data stage Use instruction level scheduling

Do not put a dependentinstruction immediately after aload instruction

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediate

fields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BL

MOV pc


10/3110

Data Processing Instructions

address register

increment

registers

Rd

Rn

PC

Rm

as ins.

as instruction

mult

data out data in i. pipe

(a) regist er - register operations

address register

increment

registers

Rd

Rn

PC

as ins.

as instruction

mult


[7:0]

(b) register - immediate operations


11/3111

Data Transfer Instructions (STR)

address register

increment

registers

Rn

PC

lsl #0

= A / A + B / A - B

mult


[11:0]

(a) 1st cycle - compute address

address register

increment

registers

Rn

Rd

shifter

= A + B / A - B

mult

PC

byte? data in i. pipe

(b) 2nd cycle - store data & auto-index

immediate offset If store byte, replicates it four times,Lowest two bits are used for proper by


12/3112

Branch Instructions

address register

increment

registers

PC

lsl #2

= A + B

mult


[23:0]

(a) 1st cycle - compute branch target

address register

increment

registers

R14

PC

shifter

= A

mult


(b) 2nd cycle - save r eturn address


13/31

13

ARM Implementation - 1

Clocking Scheme Most ARMs do not operate with edge-sensitive registers Based around 2-phase non-overlapping clocks generated

internally from a single input clock signal Allows level-sensitive transparent latches Data movement is controlled by passing the data alternatively

through latches open during phase 1 and latches open during phase2

Non-overlapping property ensures no race condition

1 clock cycle

phase 1

phase 2


14/31

14


Datapath Timing (1)

read bus valid

shift out valid

ALU out

shift time

ALU time

registerwrite time

registerreadtime

ALU operandslatched

phase 1

phase 2

prechargeinvalidatesbuses


15/31

15


Datapath Timing (2) The minimum datapath cycle time is the sum of

Register read time Shifter delay

ALU delay Dominates cycle time

Logical operations relatively faster than Arithmetic operations Why?

Register write set-up time

Phase 2 and phase 1 non-overlap time


16/31

16


Adder Design 1http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.ht 32-bit addition time has a significant effect on the datapath

cycle time Influence maximum clock rate and processors performance

The first Arm processor prototype Ripple-carry adder circuit Worst-case carry path is 32 gates long

AB

Cin

sum

Cout
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.htmlhttp://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.html


17/31

17


Adder Design - 2 ARM2 4-bit look-ahead scheme

To reduce the worst-case carry path length

A[3:0]

B[3:0]

Cin[0]

sum[3:0]

Cout[3]

4-bitadderlogic

P

G


18/31

18

Carry-Look-Ahead (CLA) Adder -1

calculating the carry signals in advance a carry signal will be generated

when both bits Ai and Bi are 1 when one of the two bits is 1 and the carry-in (carry of the previous

stage) is 1

COUT = Ci+1= Ai.Bi + (Ai $ Bi).Ci (1) Ci+1 = Gi + Pi.Ci (2)

Gi = Ai.Bi (3) -Generate

Pi = (Ai $ Bi) (4) Propagate

Propagate and Generate terms only depend on the input bits will be valid after one gate delay

If one uses the above expression to calculate the carry signals, onedoes not need to wait for the carry to ripple through all the previous

stages to find its proper value. Lets apply this to a 4-bit adder


19/31

19


Lets apply this to a 4-bit adder C1 = G0 + P0.C0 (5)

C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6)C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7)

C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8) carry-out bit, Ci+1, of the last stage will be available after three delays

(one delay to calculate the Propagate signal and two delays as a result ofthe AND and OR gate)

Sum signal can be calculated as follows

Si = Ai $ Bi $ Ci = Pi $ Ci. (9)


20/31

20


4-bit adder


21/31

21


16-bit adder (Group) PG = P3.P2.P1.P0 (10)

GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0 (11)


22/31

22


ALU functions Adder, address computations for memory transfer, branch

calculations, bit-wise logical functions, and so on

fs 5 f s 4 f s 3 f s 2 f s 1 fs 0 ALU o utput

0 0 0 1 0 0 A and B

0 0 1 0 0 0 A and not B

0 0 1 0 0 1 A xor B

0 1 1 0 0 1 A plus not B plus carry

0 1 0 1 1 0 A plus B plus carry

1 1 0 1 1 0 not A plus B plus carry

0 0 0 0 0 0 A0 0 0 0 0 1 A or B

0 0 0 1 0 1 B

0 0 1 0 1 0 not B

0 0 1 1 0 0 zero


23/31

23


ALU functions The ARM2 ALU logic for one result bit

ALU

bus

432105

NB

bus

NA

bus

carrylogic

fs:

G

P


24/31

24


ARM6 Carry-Select Adder Computes the sums of various fields of the word for a carry-in of

both zero and one The final result is selected by using the correct carry-in bit

sum[31:16]sum[15:8]sum[7:4]sum[3:0]

s s+1

a,b[31:28]a,b[3:0]

+ +, +1

c

+, +1

mux

mux

mux


25/31

25


ARM6 ALU Organization

Z

N

VC

logic/arithmetic

C infunction

invert A invert B

result

result mux

logic functions

A operand latch B operand latch

XOR gates XOR gates

adder

zero detect


26/31

26


Barrel Shifter The shifter performance is critical

Shifter time contributes to the datapath cycle time

in[0]

in[1]

in[2]

in[3]

out[0] out[1] out[2] out[3]

no shiftright 1right 2right 3

left 1

left 2

left 3


27/31

27


The ARM register bank

A bus read decoders

B bus read decoders

write decoders

register cellsPC

Vdd

Vss

ALUbus

PC

bus

INCbus

ALUbus

A bus

B bus


28/31

28


Control Structures

decodePLA

cyclecount

multiplycontrol

load/storemultiple

addresscontrol

registercontrol

ALUcontrol

shiftercontrol

instruction

coprocessor


29/31

29

ARM Coprocessor Interface - 1

A general-purpose extension of its instruction set through theaddition of hardware coprocessors Also supports software emulation of coprocessors through

undefined instruction trap

Coprocessor Architecture 16 logical coprocessors Each coprocessor have up to 16 private registers of any

reasonable size Load-store architecture

Internal operations on registers Load and store from and to the memory Move data to or from an ARM register

Implementation Board level coprocessor slow speed On-chip coprocessor high clock speed, cache and memory

management, etc.


30/31

30


ARM7TDMI Coprocessor interface Bus watching

Coprocessor is attached to a bus where the ARM instruction streamflows into the ARM

Coprocessor copies the instructions into an internal pipeline Handshake between ARM and coprocessor

cpi* (from ARM to all coprocessors) Coprocessor instruction

cpa (from the coprocessors to ARM) Coprocessor absent

cpb (from the coproessors to ARM) Coprocessor busy


31/31

31


Handshake outcomes ARM may decide not to execute it

It falls in a branch shadow or fails condition code test / cpi* high

ARM may decide to execute it (cpi* low), but cpa high Undefined instruction trap

ARM decides to execute it and a coprocessor accepts it, butcannot execute it yet

cpa low but cpb high Busy-wait while stalling instruction stream Enabled interrupt request arrives? Handle it and retry coprocessor

instruction later

ARM decides to execute it and coprocessor accepts it andexecutes it immediately

cpi* low, cpa low, cpb low

ARM Pipelining

Documents

Chapter6 pipelining

Review : Pipelining

Pipelining Verilog

Untitled-1 [files.cluster2.hostgator.co.in]files.cluster2.hostgator.co.in/hostgator103813/file/dgi...Processors AM3358 1GHz ARM Cortex-A8 processor TI AM3358 Sitara Processor Pipelining

Pipelining: basisprincipes

Pipelining III

Pipelining IV

Linear Pipelining

Pipelining & Parallel Processing - ics.kaist.ac.krics.kaist.ac.kr/ee878_2018f/[EE878]3 Pipelining and Parallel Processing.pdf · Pipelining processing By using pipelining latches

Pipelining Multiplier

Advanced Pipelining

Processor Pipelining

unit3 pipelining

Pipelining Cache

Pipelining - II

Graphics pipelining

Instruction Pipelining

EE457Unit6a Pipelining Notes - USC Viterbiee.usc.edu/~redekopp/ee457/slides/EE457Unit6a_Pipelining_Notes.pdf · • w/o pipelining: ___ • w/ pipelining: _ – _ cycles for

Pipelining ChemAxon

Pipelining - University of Toronto · 2005-09-17 · Pipelining • Principles of pipelining † Simple pipelining † Structural Hazards † Data Hazards † Control Hazards †