85
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF urhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Gh Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21 st International Conference on Computer Design (ICCD’03), October 14 th 2003

ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

1

Distributed Reorder Buffer Schemes for Low Power *

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

21st International Conference on Computer Design (ICCD’03), October 14th 2003

Page 2: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

2

– Reorder Buffer (ROB) complexities– Motivation for the low-complexity ROB– Low-complexity ROB designs

Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme

– Results– Concluding remarks

Outline

Page 3: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

3

P6-style Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROB

Page 4: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

4

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2ROB

RB

PPC 620-style Superscalar Datapath

Page 5: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

5

ROB Port Requirements for a W-way CPU

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

Page 6: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

6

What This Work is All About

– ROB complexity reduction is important for reducing power and improving performance

ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles

– Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

Page 7: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

7

Comparison of ROB Bitcells (0.18µ, TSMC)

Layout of a 32-ported SRAM bitcell

Layout of a 16-ported SRAM bitcell

Area Reduction – 71%

Shorter bit and wordlines

Page 8: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

8

Instruction dispatch

P6-style Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROB

Page 9: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

9

Reorder Buffer Distribution

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

Holds pointers to entries within

ROBCs

ROB Components

(ROBCs)

Page 10: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

10

Impact of Distributing the ROB

– Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time

– Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs

Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires)

– Fits in naturally with a multi-clustered datapath design

Page 11: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

11

– Port conflicts result in performance penalty

– Interconnection network is more complex

Problems with the earlier Multi-banked RF Schemes

Page 12: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

12

– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment

– Interconnection network is more complex

and some good news!

Problems with the earlier Multi-banked RF Schemes

Page 13: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

13

– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment

– Interconnection network is more complexCompletely remove source read ports

and some good news!

Problems with the earlier Multi-banked RF Schemes

Page 14: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

14

Problems with the earlier Multi-banked RF Schemes

– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitmentTotally avoid source read port conflicts

– Interconnection network is more complexCompletely remove source read ports

and some good news!

Page 15: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

15

ROBCs Assigned to Each Function Unit

1

2

3

4

n

ROBC #11 1

2

3

1

ROBC #21

2

3

4

m 1

2 1

ROBC #m1FU #m

FU #2

FU #1

Centralized ROB Distributed ROBCs

FU_id offset

Page 16: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

16

Good News:Write port conflicts are avoided

ROBC #11

2

3

ROBC #21

2

3

4

ROBC #m1FU #m

FU #2

FU #1

1 write port

Distributed ROBCs

1

2

3

4

n

1 1

m 1

2 1

Centralized ROB

FU_id offset

Page 17: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

17

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

instruction

5

Page 18: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

18

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADDinstruction

5

Page 19: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

19

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADDreserved

instruction

5

Page 20: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

20

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

5

ADD

Page 21: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

21

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reservedSUB

5

Page 22: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

22

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reservedSUB

reserved

5

Page 23: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

23

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

5

SUB

Page 24: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

24

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1AND

5

Page 25: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

25

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

reserved

AND

5

Page 26: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

26

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

Int ADDROBC #1

1

2

FU_id offset

Centralized ROB Distributed ROBCs

Int ADDROBC #2

1

2

Int ADDROBC #3

1

2

Int ADDROBC #4

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

reserved

AND13

5

AND

Page 27: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

27

Good News:Avoiding Read Port Conflicts

1

2

3

4

n

1

2

FU_id offset

Centralized ROB Distributed ROBCs

1

2

1

2

1

2

ADD1 1

instruction

reserved

reserved

SUB2 1

1 read port

Tocommitment

3 1 AND

reserved

5

Page 28: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

28

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

1

2

ADD1 1

instruction

SUB2 1

AND13MUL

5

IntMUL/DIVROBC #5

Page 29: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

29

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

2

1

ADD1 1

instruction

SUB2 1

AND13MUL

5

reserved

IntMUL/DIVROBC #5

Page 30: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

30

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

1

2

ADD1 1

instruction

reserved

SUB2 1

AND13

5

5 1 MUL

IntMUL/DIVROBC #5

MUL

Page 31: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

31

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

ADD1 1

instruction

SUB2 1

AND13

DIV5

5 1 MUL1

2reserved

IntMUL/DIVROBC #5

Page 32: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

32

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

ADD1 1

instruction

SUB2 1

AND13

DIV5

5 1 MUL1

2reservedreserved

IntMUL/DIVROBC #5

Page 33: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

33

Round Robin Scheduling at Dispatch Time

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

ADD1 1

instruction

SUB2 1

AND13

5

5 1 MUL

5 2 DIV

1

2reservedreserved

IntMUL/DIVROBC #5

DIV

Page 34: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

34

Read Port Conflicts at Commitment

1

2

3

4

n

FU_id offset

Centralized ROB Distributed ROBCs

ADD1 1

instruction

SUB2 1

AND13

5

5 1 MUL

5 2 DIV

1

2reserved

IntMUL/DIVROBC #5

reserved Tocommitment

CONFLICT:If MUL and DIV wantsto commit in the same cycle

1 read port

DIV

Page 35: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

35

Distributed ROB Design 1

ROBC

Writeback1 write port

to write results

Page 36: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

36

Distributed ROB Design 1

ROBC

Writeback1 write port

to write results

Commit1 read port

for instruction commitment

Page 37: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

37

Distributed ROB Design 1: with source read ports

ROBC

Writeback1 write port

to write resultsDispatch/Issue1 read port

to read the source operands

Commit1 read port

for instruction commitment

Page 38: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

38

Experimental Setup: the AccuPower (DATE’02)Compiled

SPEC benchmarks

Datapathspecs

Performance stats

VLSI layoutdata

SPICEdeck

SPICE

MicroarchitecturalSimulator(Rooted in

SimpleScalar)

Energy/PowerEstimator

Power/energystats

SPICE measures ofenergy per transition

Transition counts,Context information

Page 39: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

39

Configuration of the Simulated System

Machine width 4-way

Issue Queue 32 entries

96 entriesReorder Buffer

Load/Store Queue 32 entries

Simulated the execution of SPEC2000 benchmarks

Page 40: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

40

Peak/Average demands on the number of ROBC entries

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3

SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5

SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5

peak peakpeak peak peak avg.avg.avg.avg.avg.

Page 41: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

41

Peak/Average demands on the number of ROBC entries

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3

SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5

SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5

peak peakpeak peak peak avg.avg.avg.avg.avg.

8 8 8 8 4 4 4 4 4 4 16Number of entriesassigned to eachROBC

Page 42: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

42

Peak/Average demands on the number of ROBC entries

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3

SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5

SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5

peak peakpeak peak peak avg.avg.avg.avg.avg.

8 8 8 8 4 4 4 4 4 4 16+ + + + + + + + + + = 72entry

8_4_4_4_16 configuration

Number of entriesassigned to eachROBC

Page 43: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

43

Percentage of cycles when dispatch blocks for 8_4_4_4_16

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 0.9 0.1 0 0 5.2

SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9

SPEC 2000Average 1.2 0.5 0 0.4 3.8

Average IPC drop% with 8_4_4_4_16 configuration = 4.8%

Page 44: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

44

Percentage of cycles when dispatch blocks for 8_4_4_4_16

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 0.9 0.1 0 0 5.2

SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9

SPEC 2000Average 1.2 0.5 0 0.4 3.8

8 8 8 8 4 4 4 4 4 4 16+ + + + + + + + + + = 72entry

Number of entriesassigned to eachROBC

Page 45: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

45

Reducing performance penalty: 12_6_4_6_20 Configuration

ROBC type IntADD#1, #2, #3, #4

IntMUL/DIV

FPADD#1, #2, #3, #4

FPMUL/DIV

Load

SPEC 2000Integer Average 0.9 0.1 0 0 5.2

SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9

SPEC 2000Average 1.2 0.5 0 0.4 3.8

12 12 12 12 6 4 4 4 4 6 20+ + + + + + + + + + = 96entry

12_6_4_6_20 configuration

Number of entriesassigned to eachROBC

Page 46: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

46

0

1

2

3

Base, 2-cycle RO B access and full bypass 2 read ports, 12_6_4_6_20

Performance Results for 12_6_4_6_20 Configuration

0

1

2

3

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

IPC

Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

Page 47: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

47

Distributed ROB Design 1: with source read ports

ROBC

Writeback1 write port

to write resultsDispatch/Issue1 read port

to read the source operands

Commit1 read port

for instruction commitment

Page 48: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

48

Eliminating All Source Read Ports

ROBC

Writeback1 write port

to write resultsDispatch/Issue1 read port

to read the source operands

Commit1 read port

for instruction commitment

Page 49: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

49

Eliminating All Source Read Ports

ROBC

Writeback1 write port

to write results

Commit1 read port

for instruction commitment

Page 50: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

50

Where are the Source Values Coming From?

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROB

12

3

Page 51: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

51

Where are the Source Values Coming From ?

0%

20%

40%

60%

80%

100%

Forwarding ARF ROB

96-entry ROB, 4-way processorSPEC2K Benchmarks

62% 32%32% 6%

Page 52: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

52

How Efficiently are the Ports Used ?

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

6%

Page 53: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

53

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROB

12

3

Page 54: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

54

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROB

12

3

Page 55: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

55

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

1

3

ROB

Page 56: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

56

Distributed Reorder Buffer Scheme

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

Holds pointers to entries within

ROBCs

ROBCs

Page 57: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

57

Elimination of Source Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

ROBCs

Holds pointers to entries within

ROBCs

Page 58: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

58

Elimination of Source Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

ROBCs

Holds pointers to entries within

ROBCs

Page 59: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

59

Completely Eliminating the Source Read Ports on the ROBCs

– The Problem: Issue of instructions that require a value stored in a ROBC will stall

– Solutions:Forward the value to the waiting instruction at the

time of committing the value: LATE FORWARDING

Page 60: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

60

Late Forwarding: Use the Normal Forwarding Buses!

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

ROBCs

Holds pointers to entries within

ROBCs

Page 61: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

61

Late Forwarding: Use the Normal Forwarding Buses!

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

Late Forwarding

ROBCs

Holds pointers to entries within

ROBCs

Page 62: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

62

0

8

16

24

No ROBC source read ports with Late Forwarding

Performance Drop of Simplified ROBC Design

Per

form

ance

Dro

p %

0

8

16

24

32

40

48

9.6%Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

37%

17%

Page 63: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

63

IPC Penalty:Source Value Not Accessible within the ROBC

ForwardingLate Forwarding/

Commitment

Lifetime of a Result Value

ResultGeneration

time

Valuewithin ARF

Valuewithin a ROBC

Page 64: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

64

Improving IPC with No Read Ports

– Cache recently generated values in a set of RETENTION LATCHES (RL)

– Retention Latches are SMALL and FASTOnly 8 to 16 latches needed in the setEntire set has 1 or 2 read ports

Page 65: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

65

Adding Retention Latches into the Picture

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

Late Forwarding

ROBCs

Holds pointers to entries within

ROBCs

Page 66: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

66

Adding Retention Latches into the Picture

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

ROBC 1

ROBC 2

ROBC m

ROB

Late Forwarding

RETENTION LATCHES

Holds pointers to entries within

ROBCs

Page 67: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

67

Eliminating All Source Read Ports

ROBC

Writeback1 write port

to write results

Commit1 read port

for instruction commitment

Page 68: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

68

Distributed ROB Design 2: with Retention Latches

ROBC

Writeback1 write port

to write results

Commit1 read port

for instruction commitment

Eight,2-ported

FIFORLs

Page 69: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

69

0

1

2

3

Base, 2-cycle RO B access and full bypass 2 read ports, 12_6_4_6_20

Performance Results for 12_6_4_6_20 Configuration

0

1

2

3

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

IPC

Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

Page 70: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

70

0

1

2

3

gap gcc gzip pars perl twolf vortex vpr

Base, 2-cycle ROB access and full bypassDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port

Performance Results for 12_6_4_6_20 Configuration

0

1

2

3

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

IPC

Average IPC drop% with 12_6_4_6_20 configuration = 1.7%

Page 71: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

71

0

1

2

3

gap gcc gzip pars perl twolf vortex vpr

Base, 1-cycle ROB access and full bypassDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port

Performance Results for 12_6_4_6_20 Configuration

0

1

2

3

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

IPC

Average IPC drop% with 12_6_4_6_20 configuration = 3.8%

Page 72: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

72

0

10

20

30

40

50

60

Eight 2-ported FIFO latchesDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port

Power Results for 12_6_4_6_20 Configuration

0

10

20

30

40

50

60

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

Pow

er S

avin

gs %

Power savings%: 49% 47%23%

Page 73: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

73

0

10

20

30

40

50

60

Eight 2-ported FIFO latchesDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port

Power Results for 12_6_4_6_20 Configuration(Compared to Baseline case with 64 entry Rename Buffers)

0

10

20

30

40

50

60

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

Pow

er S

avin

gs %

Power savings%: 39% 37%20%

Page 74: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

74

Summary of Results

– Low performance degradation: 1.7% IPC drop on the average (compared to 2-cycle ROB) 3.8% IPC drop on the average (compared to 1-cycle ROB)

– ROB Power savings: as high as 49% are realized (compared to P6-style datapath: 96

entry ROB) as high as 39% (compared to Rename Buffer design: 96 entry

ROB, 64 entry RB)

Page 75: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

75

Conclusions

– We introduced a conflict-free distributed Reorder Buffer design

– ROB power savings of as high as 49% are realized with only a small (1.7%) performance penalty

– ROB complexity is drastically reduced by Distributing the ROB into multiple banks Reducing the port requirements to no more than 2 ports for

each ROB components

Page 76: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

76

~ Thank You~

Page 77: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

77

Distributed Reorder Buffer Schemes for Low Power *

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

21st International Conference on Computer Design (ICCD’03), October 14th 2003

Page 78: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

78

Related Work

– Replicated (Kessler, IEEE Micro) and distributed (Canal et.al, HPCA’00 and Farkas et.al, MICRO’97) RFs in a clustered organization

– Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)

– Multiple Register Banks with additional pipeline stage to avoid complex arbitration logic (Tseng et.al, ISCA’03

– Multiple Register Banks without write port conflicts (Wallase et.al, PACT’96)

Page 79: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

79

ROB Port Requirements for a W-way CPU

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

Page 80: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

80

ROB Port Requirements for a W-way CPU

ROB

WritebackW write ports

To write results

Dispatch/Issue2W read ports

to read the source operands

Decode/Dispatch1 W-wide write port

to setup entries

Commit1 W-wide read port

for instruction commitment

Page 81: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

85

Fully Distributed Reorder Buffer Scheme

Page 82: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

86

Fully Distributed Reorder Buffer Scheme

– Distributed ROB Components (ROBCs) are assigned to each Function Unit

No write port conflicts at writeback stage, and minimal read port conflicts at commitment: Negligible performance penalty

Each ROBC can be tailored to the needs of its FU : No over commitment of resources, less complexity

– The FIFO structure that maintains pointers to the ROBCs remains centralized

Page 83: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

87

Fully Distributed Reorder Buffer Scheme

1

2

3

4

n

ROBC #11 1

2

3

1

FU_id offset

ROBC #21

2

3

4

m 1

2 1

ROBC #m1

Centralized ROB Distributed ROBCs

Page 84: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

88

Fully Distributed Reorder Buffer Scheme

1

2

3

4

n

ROBC #11 1

2

3

1

ROBC #21

2

3

4

m 1

2 1

ROBC #m1

Centralized ROB Distributed ROBCs

FU_id offset

Page 85: ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry

ICCD’03

90

0

10

20

30

40

50

60

Centralized ROB, Eight 2-ported FIFO Retention Latches

Results for the Scheme with Retention Latches

0

10

20

30

40

50

60

gap gcc gzip parser perl twolf Int Avg.vortex vpr

applu art mesa mgrid swim wupwise FP Avg.

Pow

er S

avin

gs %

Power savings%: 23%