Download ppt - Lecture 10: ALUs and Bypass. ALUs Scheduler to Execution Unit interface Execution unit organization Bypass networks Clustering Lecture 10: ALUs and Bypassing

Advanced MicroarchitectureLecture 10: ALUs and Bypass

2

This Lecture: Execution Datapath• ALUs• Scheduler to Execution Unit interface• Execution unit organization• Bypass networks• Clustering

Lecture 10: ALUs and Bypassing

3

ALUs• ALU: Arithmetic Logic Units• FU: Functional Units• EU: Execution Units


Adder ALU

What’s thedifference? ShiftAdder DivLogic Mult

“ALU”

Opcode

Result

Operand1 Operand2Implementation details, algorithms,etc. of adders, multipliers, dividers

not covered in this course

4

Interfacing ALUs to the Scheduler• Issue N instructions• Read N sets of

operands, immediates, opcodes, destination tags

• Route to correct functional units


Fetch &Dispatch

ARF PRF/ROB

Data-CaptureScheduler

FunctionalUnits

Physical register update

Bypass

5

Data-Capture Payload RAM

Lecture 10: ALUs and Bypassing Select Logic

opcode ValL ValR

Payload RAM

opcode ValL ValR

IssuePort 3IssuePort 0

opcode ValL ValRIssuePort 2

Select decisions,port bindings, etc.

IssueLane 0

IssueLane 1

IssueLane 2

IssueLane 3

Effectivelyone nastycrossbar

6

“Register File” Organization


“R1”

val(R1)

“R7”“R3”“R4”

val(R7)

val(R3)

val(R4)

Each RF read port input has a 1-to-1correspondence with one and only oneRF read port output

No MUXing of outputs is required

select 3select 2select 1select 0

Payload RAM

Issue 0Issue 1Issue 2Issue 3

Register File

7

“Register File” Is An Overkill

Lecture 10: ALUs and Bypassing Select

Select

Select

Select

SRAM Row Decoders

But how do you assign which setof data gets routed to which set

of read port outputs?

RS entries Payload RAM

8

Execution Lane ↔ Select Binding


Select

Select

Select

SelectPayload RAM readport outputs are inthe same order asthe Select Blocks


9

Select Port 3

Select Port 2

Select Port 1

Single Entry Close-Up


Select Port 0

bid 0

bid 1

bid 2

bid 3

grant 0

grant 1

grant 2

grant 3

Opcode Src L Src RSingle RS Entry

One RS entry can only bid on oneselect port, so payload neverdriven to more than one port

Each select port only gives the grant to a singleRS entry, so more than one payload entry can

never drive the same payload output port

Tri-State Driver

Output buses connectedto all payload RAM entries

10

Src RSilo

Src LSilo

Need to “Swizzle” at the End


OpcodeSilo

Nasty tangle ofwires (Src’s are64-128 bits each!)

11

Register FileSRAM Array

Non-Data-Capture Scheduler

Lecture 10: ALUs and Bypassing Select

Select

Select

Select


Register FileRow Decoders

Src Ltags

Src R tags

12

Immediate Values• data-capture can store immediate values in

payload bay• non-DC needs separate storage

– Could add extra field to payload– could allocate a physical register and store the

immediate there– Could store in a separate “immediate file”


13

Select 0

Select 1

Select 2

Select 3

Distributed Scheduler

• Grant/Payload read lines may have to travel further horizontally (multiple RS widths)

• ScheduleExecute latency less critical than ScheduleSchedule (wakeup-select) loop latencyLecture 10: ALUs and Bypassing

FAddFM/D

ALU1 ALU2 M/D

StoreShift

Load

FP-Ld FP-St

Payload RAM

14

Naive ALU Organization

• Besides making scheduling hard to scale, arbitrary any issue any ALU makes operand routing a horrible mess (needs full cross bar)


add shift mult div load store Fadd FMul FDiv

From Payload/RF Read Ports

15

Execution-Port-Based Layout

• Just need to fan-out data to FUs within the same execution lane; no cross-bar needed

• Each FU needs a “valid” input to know that the incoming data is meant for it and not another FU in the same lane– Or just let them all compute in parallel and use only the output

that you want wasted power


add add shift mult div store load FP ld FPCvt

Lane 0 Lane 1 Lane 2 Lane 3

16

Bypass Network Organization


add shift mult div

From Payload RAM/Register File

f × 64 bits

f × 64 bits

N × 2 sets of inputs

N=Issue Width, f=Num FUsO(f2N) area just for the bypass wiring!!!

… which is cubic since f = W(N)Previous slide had f=9 FUs, and thatdidn’t even include all of the FP units

17

ALU Stacks


add add

shift

mult

div

store load FP ld

FPCvt

FP st

Fadd

Fmul

Fdiv

From Payload/RFInteger Bypass

Floating Point BypassBypass FU Fan-OutBypass MUXes reduced to one pair per

ALU stack (as opposed to one per FU)

18

Bypass Sharing


add add

shift

mult

div

store load

FP ld

FPCvt

FP st

Fadd

Fmul

Fdiv

From Payload/RFInteger Bypass

Floating Point BypassBypass FU Fan-Out

Local FU OutputBypass wiring reduced to one output

per execution lane/ALU stack

19

Bypass Sharing (2)• If all FU’s in a stack have the same latency,

writeback conflicts are impossible– because only one instruction can issue to each

lane per cycle• But not all FU’s have the same latency:


1-cycle add, to Lane 1 S X X ES X X E1 E22-cycle shift, to Lane 1

add

shift

load

Two instructions want to writeback using same bypass path!X

20

Bypass Sharing (3)• How to resolve this structural hazard?

– Obvious solution: stall• Creates scheduling headaches

– Treat bypass/WB as another structural resource• Separate select logic* for bypass allocation


1-cycle add, to Lane 1 S ES X X

X XE1 E22-cycle shift, to Lane 1

0 1 2 3 4 5

S

Writeback Scoreboard 0 1 2 3 4 5 6X

To Bypass

To Bypass

*Not same as regular selectlogic, just a table read/write

21

Bypass Sharing (4)


SB: 1-cycle add, to Lane 1 S

S X X E1 E2A: 2-cycle shift, to Lane 1

0 1 2 3 4 5

EX XSC: 3-cycle load, to Lane 1

0 1 2 3 4 5 6

6 7

7

B

C

Select

8

8

Wasted issue opportunity:B picked by select, but cannot

issue due to WB conflictC could have issued, but is

stalled by one cycle

S E1S X X E2 E3

22

Bypass Critical Path


add add

shift

mult

div

store load

FP ld

FPCvt

FP st

Fadd

Fmul

Fdiv

Total wire length is abouttwice the total width plus

twice the total height

23

Bypass Critical Path (2)


Each executionlane/ALU stack

is self-containedadd add

shift

mult

div

store load

FP ld

FPCvt

Longest pathonly crossestotal width

once

24

Bypass Control Problem• We now have the datapaths to forward

values between ALUs/FUs• How do we orchestrate what goes where

and when?

• In particular, how do we set the controls of each of the bypass MUXes on a cycle-by-cycle basis?


25

Scoreboarding• For each value produced, make note (in the

scoreboard) of where it will be available• For each source, consult scoreboard to find

out how to rendezvous


Port 1: ADD P21 = … S X X E0 1 2 3 4 5 6 7

1

Port 0: ADD P17 = P21 + P4

21

R4

S X X E

-17 0R

add

EPort 2: MUL P30 = P21 * P17 S X X E E

mul

26

Scoreboarding (2)• Setting bypass controls is easy

– Read where the value will come from and feed to bypass MUXes in the operand read stage


Payload(src tags)

P21P4

WBScoreboard

R1

add

• May add scheduleexecute stages for data-capture scheduler– why not for non-data-

capture?

27

Scoreboarding (3)• Updating can be more complicated• Depends on when SB read occurs w.r.t.

operand reading– earlier reads cause more disconnect


S X X E1 E2 E3

S X X ES X X E

Value bypassed, WB to RFRF

Value read from RF

Assume SB read in1st cycle after schedule

ABC

A needs to update SB this cyclefor C to correctly source its operand

28

Scoreboarding (4)• Scoreboard can become a critical timing

bottleneck– All sources must read from scoreboard– All destinations must update scoreboard

• Once at schedule to indicate bypass location• Once later to indicate value has written back to RF

– ~ 4×N ports for the scoreboard!• If scoreboard becomes multi-cycle, things can get

really crazy– need to bypass scoreboard reads/writes like inter-group

rename bypassing


29

CAM-based Bypass• Extend data-capture concept to bypass

network


Register Valuefrom Payload/RFRegister Tag

= = = =

Lane 0Lane 1Lane 2Lane 3

Use Lane 0Use Lane 1Use Lane 2Use Lane 3

Use PL/RF

Result ValueResult Tag

30

CAM-based Bypass (2)• Must carry destination tag to execution and

broadcast along with result– But you have to do this anyway; need the

destination tag for RF writeback• A lot of CAM logic

– Costs power and area– Control is simple: it’s basically control-less


31

Writeback to Data-Caputure• Looks very similar to bypass CAM


Payload of DC Scheduler

=

=

=

=

=

=

=

=

SrcL SrcRValL ValRExec

Lane 3Exec

Lane 2Exec

Lane 1Exec

Lane 0

32

PRF Writeback Latency


Physical Register File(3-cycle write latency)

Bypass Network

A A

A

A: ADD P21 = …B: ADD P17 = P21 + …C: MUL P30 = P21 × P17AB B

B

C

Problem: How doesC pickup the value

of P21?

??

33

Multi-Level Bypass• Bypass network must cover the latency of

the writeback operation– If WB requires N cycles, then bypass must be

able to source all N cycles worth of results


Physical Register File

From PL/RF

AB

B

C

AC

3-level Bypass

But this is onlyfor one ALU

(or ALU stack)

34

Superscalar, Multi-Level Bypass


ALU Stack 0 ALU Stack 1 ALU Stack 2 AL

3-cycle PRF WB latency

35

A Bit More Hierarchical


ALU Stack 0 ALU Stack 1 ALU Stack 2 ALU Stack 3

To Physical Register Writeback

36

Bypass Network Complexity• Parameters

– N = Issue width– f = Number of functional units– b = bit width of data* (e.g., 32 bits, 64 bits)– D = Network depth (RF write latency)

• Metrics– Area– Latency… Both contribute directly to power


*For CAM-based bypass logic,should include tag width as well

37

Bypass Network Complexity (Area)• Width

– 2×(N+D) + 1 inputs at b bits each

– Replicated N times– Total 2N2b + Nb(D+1)

• Height– N values at b bits each, times D

levels– MUXes: O((D-1)×(lg N) +

lg(N+D))– Assume FUs per ALU stack is

constant: f/N = O(1)– Total O(NDb)

• Total Area– O(N3b2D + N2b2D2)– Cubic in N, Quadratic in D and b


N+D inputs

N values

N values

1 value O(f/N)-to-1 MUX for outputs:

O(lg(f/N)) height

N stacks

O(lg N)

O(lg N)

O(lg(N+D))ALU Stack 0

N values

38

Bypass Network Complexity (Delay)• ALU output to 1st latch

– O(lg(f/N)) gates for the MUX– O(N+D) wire delay horizontally– O(f/N + lg(N+D)) wire delay

vertically• Last latch to ALU input

– O(N+D) wire horizontally– O(lg N) gate delay for 1st MUX– O(N + lg N) wire delay vertically– O(lg(N+D)) gate delay

• Gate Delay (worse of the two)– O(lg(N+D)) or O(lg(f/N))

• Wire Length (ditto)– O(N + D + f/N)– Unbuffered wire has quadratic

delay


N+D inputs

N values

N values

1 value O(f/N)-to-1 MUX for outputs:

O(lg(f/N)) height

N stacks

O(lg N)

O(lg N)

O(lg(N+D))ALU Stack 0

N values

39

Bypass Network Complexity** Complexity analysis is entirely dependent on

the layout assumptions.

For example, hierarchical vs. non-hierarchical bypass organizations lead to different areas, wire lengths and gate delays

When someone says “this circuit’s area scales quadratically with respect to X”, this really means that “this circuit’s area scales quadratically with respect to X assuming a layout style of Z”


40

ALU Clustering• The exact distribution of FUs to ALU stacks and/or

select binding groups can affect layout• Already saw how separation of INT and FP stacks

reduces unnecessary datapaths– Has additional benefits when bits(INT) != bits(FP)– Ex. x86 uses 32/64-bit integers, but internally uses 80-

bit FP– SSE3 introduces 128-bit packed SIMD values, but normal

GPRs are still only 64 bits wide• Certain instructions do not generate outputs

(branches)• Memory instructions treated differently (outputs go

to LSQ), and stores don’t generate a register resultLecture 10: ALUs and Bypassing

41

Clustered Microarchitectures• Bypass network delays scale poorly • Scheduling delays scale poorly• RF delays scale poorly

• Partition into smaller control and data domains


42

Clustered Scheduling


Payload0 Payload1 Payload2 Payload3

FUs FUs FUs FUs

Cross-Cluster Wakeup Interconnection Network

RS Entries(Cluster 0)




ExecutionCluster 0

Cross-cluster

wakeup may take > 1

cycle

43

Cross-Cluster Wakeup

Cross-Cluster Wakeup Delay


A

B

C

D

E

Normally takes3 cycles

(assume all1-cycle latencies)

2 cluster, round-robincluster assignment

A

B

C

D

E

Now it takes 5 cycles

Cross-Cluster Wakeup

A

B

C

D

E

But a differentclustering algorithm

only needs 3!

44

Cross-Cluster Bypass


Payload0

FUs

Payload0

FUs

Payload0

FUs

Payload0

FUs

Cross-Cluster Bypass Network

Similar delay issues like the case for scheduling

Values may take > 1 cycle to get from cluster to cluster

45

Cross-Cluster Bypass (2)• So do we have to pay X-cluster penalties

once at schedule and again at bypass?


A

B

S X X X X E

S X X X X E

B schedules 2 cyclesafter A due to extra

cycle of wakeup delay

Penalties are notadditive!

This assumes that theWakeup Delay (CiCj)

is equal to theBypass Delay (CiCj)

If true for all i and j, thenbypass and wakeup delays

always overlapped

46

Clustered RFs• Place 1/nth of the physical registers in each

cluster– How to partition?– ARF/PRF: read at dispatch, extra latency may

require more levels of bypassing– Unified PRF: latency may make schedexec

delay intolerable (replay penalty too expensive), plus all of the bypassing

• Replicate PRF– Keep a full copy of the register file in each

cluster– Reduces per cluster read port requirements– Still need to write to all clusters (each cluster

needs full set of write ports)Lecture 10: ALUs and Bypassing