View
227
Download
0
Category
Preview:
Citation preview
7/28/2019 ARM Pipelining
1/31
1
Lecture 5 - ARM Organization
and Implementation- ICE 1222/2342
Fall, 2008
Daeyoung Kim
kimd@icu.ac.kr
http://resl.icu.ac.kr/~kimd
mailto:Kimd@icu.ac.krhttp://resl.icu.ac.kr/~kimdhttp://resl.icu.ac.kr/~kimdmailto:Kimd@icu.ac.kr7/28/2019 ARM Pipelining
2/31
2
Contents
3-stage pipeline ARM organization &implementation
5-stage pipeline ARM organization &implementation
7/28/2019 ARM Pipelining
3/31
3
3-stage pipeline ARMOrganization
multiply
data out register
instruction
decode
&
control
incrementer
registerbank
address register
barrelshifter
A[31:0]
D[31:0]
data in register
ALU
control
PC
PC
ALUbus
Abus
Bbus
register
ARM Processors
up to ARM7
7/28/2019 ARM Pipelining
4/31
4
3-stage pipeline
Fetch Instruction is fetched and placed in the instruction pipeline
Decode
The instruction is decoded and the datapath control signalsprepared for the next cycle
The instruction owns the decode logic but not the datapath
Execute The instruction owns the datapath Register bank is read, an operand is shifted, ALU result
generated and written back into a destination register
7/28/2019 ARM Pipelining
5/31
5
ARM single-cycle instruction 3-stage pipeline operation
fetch decode execute
time
1
fetch decode execute
fetch decode execute
2
3
instruction
7/28/2019 ARM Pipelining
6/31
6
ARM multi-cycle instruction 3-stage pipeline operation
fetch ADD decode execute
time
1
fetch STR decode calc. addr.
fetch ADD decode execute
2
3
data xfer
fetch ADD decode execute4
5 fetch ADD decode execute
instruction
7/28/2019 ARM Pipelining
7/317
To achieve higher performance
Tprog = Ninst x CPI / fclk
Increase the clock rate, fclk The logic in each pipeline stage to be simplified and, therefore, the
number of pipeline stages to be increased
Reduce the average number of clock cycles per instruction, CPI Instructions which occupy more than one pipeline slot are re-
implemented to occupy fewer slots Pipeline stalls caused by dependencies between instructions are reduced
Memory bottleneck Von Neumann bottleneck
Deliver more than 32 bits per access Separate instruction and data memory
7/28/2019 ARM Pipelining
8/318
ARM9TDMI 5-stage pipelineorganization
Fetch Instruction is fetched and placed
in the instruction pipeline
Decode The instruction is decoded and
register operands read
Execute An operand is shifted and ALU
result generated. Load/Store -> memory address
is calculated in ALU
Buffer/Data Data memory is accessed if
required Otherwise ALU result is simply
buffered
Write-back Result is written back to register
file
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediate
fields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BL
MOV pc
7/28/2019 ARM Pipelining
9/319
Data Forwarding
A major source of complexity in 5-stage pipeline Instruction execution is spread
across the stages To resolve data dependencies
without stalling the pipeline Forwarding paths
Even with forwarding we can notavoid stall
LDR rN, [..] ADD r2, r1, rN One cycle stall required rN available at the end of
buffer/data stage Use instruction level scheduling
Do not put a dependentinstruction immediately after aload instruction
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediate
fields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BL
MOV pc
7/28/2019 ARM Pipelining
10/3110
Data Processing Instructions
address register
increment
registers
Rd
Rn
PC
Rm
as ins.
as instruction
mult
data out data in i. pipe
(a) regist er - register operations
address register
increment
registers
Rd
Rn
PC
as ins.
as instruction
mult
data out data in i. pipe
[7:0]
(b) register - immediate operations
7/28/2019 ARM Pipelining
11/3111
Data Transfer Instructions (STR)
address register
increment
registers
Rn
PC
lsl #0
= A / A + B / A - B
mult
data out data in i. pipe
[11:0]
(a) 1st cycle - compute address
address register
increment
registers
Rn
Rd
shifter
= A + B / A - B
mult
PC
byte? data in i. pipe
(b) 2nd cycle - store data & auto-index
immediate offset If store byte, replicates it four times,Lowest two bits are used for proper by
7/28/2019 ARM Pipelining
12/3112
Branch Instructions
address register
increment
registers
PC
lsl #2
= A + B
mult
data out data in i. pipe
[23:0]
(a) 1st cycle - compute branch target
address register
increment
registers
R14
PC
shifter
= A
mult
data out data in i. pipe
(b) 2nd cycle - save r eturn address
7/28/2019 ARM Pipelining
13/31
13
ARM Implementation - 1
Clocking Scheme Most ARMs do not operate with edge-sensitive registers Based around 2-phase non-overlapping clocks generated
internally from a single input clock signal Allows level-sensitive transparent latches Data movement is controlled by passing the data alternatively
through latches open during phase 1 and latches open during phase2
Non-overlapping property ensures no race condition
1 clock cycle
phase 1
phase 2
7/28/2019 ARM Pipelining
14/31
14
ARM Implementation - 2
Datapath Timing (1)
read bus valid
shift out valid
ALU out
shift time
ALU time
registerwrite time
registerreadtime
ALU operandslatched
phase 1
phase 2
prechargeinvalidatesbuses
7/28/2019 ARM Pipelining
15/31
15
ARM Implementation - 3
Datapath Timing (2) The minimum datapath cycle time is the sum of
Register read time Shifter delay
ALU delay Dominates cycle time
Logical operations relatively faster than Arithmetic operations Why?
Register write set-up time
Phase 2 and phase 1 non-overlap time
7/28/2019 ARM Pipelining
16/31
16
ARM Implementation - 4
Adder Design 1http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.ht 32-bit addition time has a significant effect on the datapath
cycle time Influence maximum clock rate and processors performance
The first Arm processor prototype Ripple-carry adder circuit Worst-case carry path is 32 gates long
AB
Cin
sum
Cout
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.htmlhttp://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.html7/28/2019 ARM Pipelining
17/31
17
ARM Implementation - 5
Adder Design - 2 ARM2 4-bit look-ahead scheme
To reduce the worst-case carry path length
A[3:0]
B[3:0]
Cin[0]
sum[3:0]
Cout[3]
4-bitadderlogic
P
G
7/28/2019 ARM Pipelining
18/31
18
Carry-Look-Ahead (CLA) Adder -1
calculating the carry signals in advance a carry signal will be generated
when both bits Ai and Bi are 1 when one of the two bits is 1 and the carry-in (carry of the previous
stage) is 1
COUT = Ci+1= Ai.Bi + (Ai $ Bi).Ci (1) Ci+1 = Gi + Pi.Ci (2)
Gi = Ai.Bi (3) -Generate
Pi = (Ai $ Bi) (4) Propagate
Propagate and Generate terms only depend on the input bits will be valid after one gate delay
If one uses the above expression to calculate the carry signals, onedoes not need to wait for the carry to ripple through all the previous
stages to find its proper value. Lets apply this to a 4-bit adder
7/28/2019 ARM Pipelining
19/31
19
Carry-Look-Ahead (CLA) Adder -2
Lets apply this to a 4-bit adder C1 = G0 + P0.C0 (5)
C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6)C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7)
C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8) carry-out bit, Ci+1, of the last stage will be available after three delays
(one delay to calculate the Propagate signal and two delays as a result ofthe AND and OR gate)
Sum signal can be calculated as follows
Si = Ai $ Bi $ Ci = Pi $ Ci. (9)
7/28/2019 ARM Pipelining
20/31
20
Carry-Look-Ahead (CLA) Adder -3
4-bit adder
7/28/2019 ARM Pipelining
21/31
21
Carry-Look-Ahead (CLA) Adder -4
16-bit adder (Group) PG = P3.P2.P1.P0 (10)
GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0 (11)
7/28/2019 ARM Pipelining
22/31
22
ARM Implementation - 6
ALU functions Adder, address computations for memory transfer, branch
calculations, bit-wise logical functions, and so on
fs 5 f s 4 f s 3 f s 2 f s 1 fs 0 ALU o utput
0 0 0 1 0 0 A and B
0 0 1 0 0 0 A and not B
0 0 1 0 0 1 A xor B
0 1 1 0 0 1 A plus not B plus carry
0 1 0 1 1 0 A plus B plus carry
1 1 0 1 1 0 not A plus B plus carry
0 0 0 0 0 0 A0 0 0 0 0 1 A or B
0 0 0 1 0 1 B
0 0 1 0 1 0 not B
0 0 1 1 0 0 zero
7/28/2019 ARM Pipelining
23/31
23
ARM Implementation - 7
ALU functions The ARM2 ALU logic for one result bit
ALU
bus
432105
NB
bus
NA
bus
carrylogic
fs:
G
P
7/28/2019 ARM Pipelining
24/31
24
ARM Implementation - 8
ARM6 Carry-Select Adder Computes the sums of various fields of the word for a carry-in of
both zero and one The final result is selected by using the correct carry-in bit
sum[31:16]sum[15:8]sum[7:4]sum[3:0]
s s+1
a,b[31:28]a,b[3:0]
+ +, +1
c
+, +1
mux
mux
mux
7/28/2019 ARM Pipelining
25/31
25
ARM Implementation - 9
ARM6 ALU Organization
Z
N
VC
logic/arithmetic
C infunction
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
7/28/2019 ARM Pipelining
26/31
26
ARM Implementation - 10
Barrel Shifter The shifter performance is critical
Shifter time contributes to the datapath cycle time
in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shiftright 1right 2right 3
left 1
left 2
left 3
7/28/2019 ARM Pipelining
27/31
27
ARM Implementation - 10
The ARM register bank
A bus read decoders
B bus read decoders
write decoders
register cellsPC
Vdd
Vss
ALUbus
PC
bus
INCbus
ALUbus
A bus
B bus
7/28/2019 ARM Pipelining
28/31
28
ARM Implementation - 11
Control Structures
decodePLA
cyclecount
multiplycontrol
load/storemultiple
addresscontrol
registercontrol
ALUcontrol
shiftercontrol
instruction
coprocessor
7/28/2019 ARM Pipelining
29/31
29
ARM Coprocessor Interface - 1
A general-purpose extension of its instruction set through theaddition of hardware coprocessors Also supports software emulation of coprocessors through
undefined instruction trap
Coprocessor Architecture 16 logical coprocessors Each coprocessor have up to 16 private registers of any
reasonable size Load-store architecture
Internal operations on registers Load and store from and to the memory Move data to or from an ARM register
Implementation Board level coprocessor slow speed On-chip coprocessor high clock speed, cache and memory
management, etc.
7/28/2019 ARM Pipelining
30/31
30
ARM Coprocessor Interface - 2
ARM7TDMI Coprocessor interface Bus watching
Coprocessor is attached to a bus where the ARM instruction streamflows into the ARM
Coprocessor copies the instructions into an internal pipeline Handshake between ARM and coprocessor
cpi* (from ARM to all coprocessors) Coprocessor instruction
cpa (from the coprocessors to ARM) Coprocessor absent
cpb (from the coproessors to ARM) Coprocessor busy
7/28/2019 ARM Pipelining
31/31
31
ARM Coprocessor Interface - 3
Handshake outcomes ARM may decide not to execute it
It falls in a branch shadow or fails condition code test / cpi* high
ARM may decide to execute it (cpi* low), but cpa high Undefined instruction trap
ARM decides to execute it and a coprocessor accepts it, butcannot execute it yet
cpa low but cpb high Busy-wait while stalling instruction stream Enabled interrupt request arrives? Handle it and retry coprocessor
instruction later
ARM decides to execute it and coprocessor accepts it andexecutes it immediately
cpi* low, cpa low, cpb low
Recommended