Advanced MicroarchitectureLecture 10: ALUs and Bypass
2
This Lecture: Execution Datapath• ALUs• Scheduler to Execution Unit interface• Execution unit organization• Bypass networks• Clustering
Lecture 10: ALUs and Bypassing
3
ALUs• ALU: Arithmetic Logic Units• FU: Functional Units• EU: Execution Units
Lecture 10: ALUs and Bypassing
Adder ALU
What’s thedifference? ShiftAdder DivLogic Mult
“ALU”
Opcode
Result
Operand1 Operand2Implementation details, algorithms,etc. of adders, multipliers, dividers
not covered in this course
4
Interfacing ALUs to the Scheduler• Issue N instructions• Read N sets of
operands, immediates, opcodes, destination tags
• Route to correct functional units
Lecture 10: ALUs and Bypassing
Fetch &Dispatch
ARF PRF/ROB
Data-CaptureScheduler
FunctionalUnits
Physical register update
Bypass
5
Data-Capture Payload RAM
Lecture 10: ALUs and Bypassing Select Logic
opcode ValL ValR
Payload RAM
opcode ValL ValR
IssuePort 3IssuePort 0
opcode ValL ValRIssuePort 2
Select decisions,port bindings, etc.
IssueLane 0
IssueLane 1
IssueLane 2
IssueLane 3
Effectivelyone nastycrossbar
6
“Register File” Organization
Lecture 10: ALUs and Bypassing
“R1”
val(R1)
“R7”“R3”“R4”
val(R7)
val(R3)
val(R4)
Each RF read port input has a 1-to-1correspondence with one and only oneRF read port output
No MUXing of outputs is required
select 3select 2select 1select 0
Payload RAM
Issue 0Issue 1Issue 2Issue 3
Register File
7
“Register File” Is An Overkill
Lecture 10: ALUs and Bypassing Select
Select
Select
Select
SRAM Row Decoders
But how do you assign which setof data gets routed to which set
of read port outputs?
RS entries Payload RAM
8
Execution Lane ↔ Select Binding
Lecture 10: ALUs and Bypassing
Select
Select
Select
SelectPayload RAM readport outputs are inthe same order asthe Select Blocks
RS entries Payload RAM
9
Select Port 3
Select Port 2
Select Port 1
Single Entry Close-Up
Lecture 10: ALUs and Bypassing
Select Port 0
bid 0
bid 1
bid 2
bid 3
grant 0
grant 1
grant 2
grant 3
Opcode Src L Src RSingle RS Entry
One RS entry can only bid on oneselect port, so payload neverdriven to more than one port
Each select port only gives the grant to a singleRS entry, so more than one payload entry can
never drive the same payload output port
Tri-State Driver
Output buses connectedto all payload RAM entries
10
Src RSilo
Src LSilo
Need to “Swizzle” at the End
Lecture 10: ALUs and Bypassing
OpcodeSilo
Nasty tangle ofwires (Src’s are64-128 bits each!)
11
Register FileSRAM Array
Non-Data-Capture Scheduler
Lecture 10: ALUs and Bypassing Select
Select
Select
Select
RS entries Payload RAM
Register FileRow Decoders
Src Ltags
Src R tags
12
Immediate Values• data-capture can store immediate values in
payload bay• non-DC needs separate storage
– Could add extra field to payload– could allocate a physical register and store the
immediate there– Could store in a separate “immediate file”
Lecture 10: ALUs and Bypassing
13
Select 0
Select 1
Select 2
Select 3
Distributed Scheduler
• Grant/Payload read lines may have to travel further horizontally (multiple RS widths)
• ScheduleExecute latency less critical than ScheduleSchedule (wakeup-select) loop latencyLecture 10: ALUs and Bypassing
FAddFM/D
ALU1 ALU2 M/D
StoreShift
Load
FP-Ld FP-St
Payload RAM
14
Naive ALU Organization
• Besides making scheduling hard to scale, arbitrary any issue any ALU makes operand routing a horrible mess (needs full cross bar)
Lecture 10: ALUs and Bypassing
add shift mult div load store Fadd FMul FDiv
From Payload/RF Read Ports
15
Execution-Port-Based Layout
• Just need to fan-out data to FUs within the same execution lane; no cross-bar needed
• Each FU needs a “valid” input to know that the incoming data is meant for it and not another FU in the same lane– Or just let them all compute in parallel and use only the output
that you want wasted power
Lecture 10: ALUs and Bypassing
add add shift mult div store load FP ld FPCvt
Lane 0 Lane 1 Lane 2 Lane 3
16
Bypass Network Organization
Lecture 10: ALUs and Bypassing
add shift mult div
From Payload RAM/Register File
f × 64 bits
f × 64 bits
N × 2 sets of inputs
N=Issue Width, f=Num FUsO(f2N) area just for the bypass wiring!!!
… which is cubic since f = W(N)Previous slide had f=9 FUs, and thatdidn’t even include all of the FP units
17
ALU Stacks
Lecture 10: ALUs and Bypassing
add add
shift
mult
div
store load FP ld
FPCvt
FP st
Fadd
Fmul
Fdiv
From Payload/RFInteger Bypass
Floating Point BypassBypass FU Fan-OutBypass MUXes reduced to one pair per
ALU stack (as opposed to one per FU)
18
Bypass Sharing
Lecture 10: ALUs and Bypassing
add add
shift
mult
div
store load
FP ld
FPCvt
FP st
Fadd
Fmul
Fdiv
From Payload/RFInteger Bypass
Floating Point BypassBypass FU Fan-Out
Local FU OutputBypass wiring reduced to one output
per execution lane/ALU stack
19
Bypass Sharing (2)• If all FU’s in a stack have the same latency,
writeback conflicts are impossible– because only one instruction can issue to each
lane per cycle• But not all FU’s have the same latency:
Lecture 10: ALUs and Bypassing
1-cycle add, to Lane 1 S X X ES X X E1 E22-cycle shift, to Lane 1
add
shift
load
Two instructions want to writeback using same bypass path!X
20
Bypass Sharing (3)• How to resolve this structural hazard?
– Obvious solution: stall• Creates scheduling headaches
– Treat bypass/WB as another structural resource• Separate select logic* for bypass allocation
Lecture 10: ALUs and Bypassing
1-cycle add, to Lane 1 S ES X X
X XE1 E22-cycle shift, to Lane 1
0 1 2 3 4 5
S
Writeback Scoreboard 0 1 2 3 4 5 6X
To Bypass
To Bypass
*Not same as regular selectlogic, just a table read/write
21
Bypass Sharing (4)
Lecture 10: ALUs and Bypassing
SB: 1-cycle add, to Lane 1 S
S X X E1 E2A: 2-cycle shift, to Lane 1
0 1 2 3 4 5
EX XSC: 3-cycle load, to Lane 1
0 1 2 3 4 5 6
6 7
7
B
C
Select
8
8
Wasted issue opportunity:B picked by select, but cannot
issue due to WB conflictC could have issued, but is
stalled by one cycle
S E1S X X E2 E3
22
Bypass Critical Path
Lecture 10: ALUs and Bypassing
add add
shift
mult
div
store load
FP ld
FPCvt
FP st
Fadd
Fmul
Fdiv
Total wire length is abouttwice the total width plus
twice the total height
23
Bypass Critical Path (2)
Lecture 10: ALUs and Bypassing
Each executionlane/ALU stack
is self-containedadd add
shift
mult
div
store load
FP ld
FPCvt
Longest pathonly crossestotal width
once
24
Bypass Control Problem• We now have the datapaths to forward
values between ALUs/FUs• How do we orchestrate what goes where
and when?
• In particular, how do we set the controls of each of the bypass MUXes on a cycle-by-cycle basis?
Lecture 10: ALUs and Bypassing
25
Scoreboarding• For each value produced, make note (in the
scoreboard) of where it will be available• For each source, consult scoreboard to find
out how to rendezvous
Lecture 10: ALUs and Bypassing
Port 1: ADD P21 = … S X X E0 1 2 3 4 5 6 7
1
Port 0: ADD P17 = P21 + P4
21
R4
S X X E
-17 0R
add
EPort 2: MUL P30 = P21 * P17 S X X E E
mul
26
Scoreboarding (2)• Setting bypass controls is easy
– Read where the value will come from and feed to bypass MUXes in the operand read stage
Lecture 10: ALUs and Bypassing
Payload(src tags)
P21P4
WBScoreboard
R1
add
• May add scheduleexecute stages for data-capture scheduler– why not for non-data-
capture?
27
Scoreboarding (3)• Updating can be more complicated• Depends on when SB read occurs w.r.t.
operand reading– earlier reads cause more disconnect
Lecture 10: ALUs and Bypassing
S X X E1 E2 E3
S X X ES X X E
Value bypassed, WB to RFRF
Value read from RF
Assume SB read in1st cycle after schedule
ABC
A needs to update SB this cyclefor C to correctly source its operand
28
Scoreboarding (4)• Scoreboard can become a critical timing
bottleneck– All sources must read from scoreboard– All destinations must update scoreboard
• Once at schedule to indicate bypass location• Once later to indicate value has written back to RF
– ~ 4×N ports for the scoreboard!• If scoreboard becomes multi-cycle, things can get
really crazy– need to bypass scoreboard reads/writes like inter-group
rename bypassing
Lecture 10: ALUs and Bypassing
29
CAM-based Bypass• Extend data-capture concept to bypass
network
Lecture 10: ALUs and Bypassing
Register Valuefrom Payload/RFRegister Tag
= = = =
Lane 0Lane 1Lane 2Lane 3
Use Lane 0Use Lane 1Use Lane 2Use Lane 3
Use PL/RF
Result ValueResult Tag
30
CAM-based Bypass (2)• Must carry destination tag to execution and
broadcast along with result– But you have to do this anyway; need the
destination tag for RF writeback• A lot of CAM logic
– Costs power and area– Control is simple: it’s basically control-less
Lecture 10: ALUs and Bypassing
31
Writeback to Data-Caputure• Looks very similar to bypass CAM
Lecture 10: ALUs and Bypassing
Payload of DC Scheduler
=
=
=
=
=
=
=
=
SrcL SrcRValL ValRExec
Lane 3Exec
Lane 2Exec
Lane 1Exec
Lane 0
32
PRF Writeback Latency
Lecture 10: ALUs and Bypassing
Physical Register File(3-cycle write latency)
Bypass Network
A A
A
A: ADD P21 = …B: ADD P17 = P21 + …C: MUL P30 = P21 × P17AB B
B
C
Problem: How doesC pickup the value
of P21?
??
33
Multi-Level Bypass• Bypass network must cover the latency of
the writeback operation– If WB requires N cycles, then bypass must be
able to source all N cycles worth of results
Lecture 10: ALUs and Bypassing
Physical Register File
From PL/RF
AB
B
C
AC
3-level Bypass
But this is onlyfor one ALU
(or ALU stack)
34
Superscalar, Multi-Level Bypass
Lecture 10: ALUs and Bypassing
ALU Stack 0 ALU Stack 1 ALU Stack 2 AL
3-cycle PRF WB latency
35
A Bit More Hierarchical
Lecture 10: ALUs and Bypassing
ALU Stack 0 ALU Stack 1 ALU Stack 2 ALU Stack 3
To Physical Register Writeback
36
Bypass Network Complexity• Parameters
– N = Issue width– f = Number of functional units– b = bit width of data* (e.g., 32 bits, 64 bits)– D = Network depth (RF write latency)
• Metrics– Area– Latency… Both contribute directly to power
Lecture 10: ALUs and Bypassing
*For CAM-based bypass logic,should include tag width as well
37
Bypass Network Complexity (Area)• Width
– 2×(N+D) + 1 inputs at b bits each
– Replicated N times– Total 2N2b + Nb(D+1)
• Height– N values at b bits each, times D
levels– MUXes: O((D-1)×(lg N) +
lg(N+D))– Assume FUs per ALU stack is
constant: f/N = O(1)– Total O(NDb)
• Total Area– O(N3b2D + N2b2D2)– Cubic in N, Quadratic in D and b
Lecture 10: ALUs and Bypassing
N+D inputs
N values
N values
1 value O(f/N)-to-1 MUX for outputs:
O(lg(f/N)) height
N stacks
O(lg N)
O(lg N)
O(lg(N+D))ALU Stack 0
N values
38
Bypass Network Complexity (Delay)• ALU output to 1st latch
– O(lg(f/N)) gates for the MUX– O(N+D) wire delay horizontally– O(f/N + lg(N+D)) wire delay
vertically• Last latch to ALU input
– O(N+D) wire horizontally– O(lg N) gate delay for 1st MUX– O(N + lg N) wire delay vertically– O(lg(N+D)) gate delay
• Gate Delay (worse of the two)– O(lg(N+D)) or O(lg(f/N))
• Wire Length (ditto)– O(N + D + f/N)– Unbuffered wire has quadratic
delay
Lecture 10: ALUs and Bypassing
N+D inputs
N values
N values
1 value O(f/N)-to-1 MUX for outputs:
O(lg(f/N)) height
N stacks
O(lg N)
O(lg N)
O(lg(N+D))ALU Stack 0
N values
39
Bypass Network Complexity** Complexity analysis is entirely dependent on
the layout assumptions.
For example, hierarchical vs. non-hierarchical bypass organizations lead to different areas, wire lengths and gate delays
When someone says “this circuit’s area scales quadratically with respect to X”, this really means that “this circuit’s area scales quadratically with respect to X assuming a layout style of Z”
Lecture 10: ALUs and Bypassing
40
ALU Clustering• The exact distribution of FUs to ALU stacks and/or
select binding groups can affect layout• Already saw how separation of INT and FP stacks
reduces unnecessary datapaths– Has additional benefits when bits(INT) != bits(FP)– Ex. x86 uses 32/64-bit integers, but internally uses 80-
bit FP– SSE3 introduces 128-bit packed SIMD values, but normal
GPRs are still only 64 bits wide• Certain instructions do not generate outputs
(branches)• Memory instructions treated differently (outputs go
to LSQ), and stores don’t generate a register resultLecture 10: ALUs and Bypassing
41
Clustered Microarchitectures• Bypass network delays scale poorly • Scheduling delays scale poorly• RF delays scale poorly
• Partition into smaller control and data domains
Lecture 10: ALUs and Bypassing
42
Clustered Scheduling
Lecture 10: ALUs and Bypassing
Payload0 Payload1 Payload2 Payload3
FUs FUs FUs FUs
Cross-Cluster Wakeup Interconnection Network
RS Entries(Cluster 0)
RS Entries(Cluster 1)
RS Entries(Cluster 2)
RS Entries(Cluster 3)
ExecutionCluster 0
Cross-cluster
wakeup may take > 1
cycle
43
Cross-Cluster Wakeup
Cross-Cluster Wakeup Delay
Lecture 10: ALUs and Bypassing
A
B
C
D
E
Normally takes3 cycles
(assume all1-cycle latencies)
2 cluster, round-robincluster assignment
A
B
C
D
E
Now it takes 5 cycles
Cross-Cluster Wakeup
A
B
C
D
E
But a differentclustering algorithm
only needs 3!
44
Cross-Cluster Bypass
Lecture 10: ALUs and Bypassing
Payload0
FUs
Payload0
FUs
Payload0
FUs
Payload0
FUs
Cross-Cluster Bypass Network
Similar delay issues like the case for scheduling
Values may take > 1 cycle to get from cluster to cluster
45
Cross-Cluster Bypass (2)• So do we have to pay X-cluster penalties
once at schedule and again at bypass?
Lecture 10: ALUs and Bypassing
A
B
S X X X X E
S X X X X E
B schedules 2 cyclesafter A due to extra
cycle of wakeup delay
Penalties are notadditive!
This assumes that theWakeup Delay (CiCj)
is equal to theBypass Delay (CiCj)
If true for all i and j, thenbypass and wakeup delays
always overlapped
46
Clustered RFs• Place 1/nth of the physical registers in each
cluster– How to partition?– ARF/PRF: read at dispatch, extra latency may
require more levels of bypassing– Unified PRF: latency may make schedexec
delay intolerable (replay penalty too expensive), plus all of the bypassing
• Replicate PRF– Keep a full copy of the register file in each
cluster– Reduces per cluster read port requirements– Still need to write to all clusters (each cluster
needs full set of write ports)Lecture 10: ALUs and Bypassing