1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay

1

FabScalar: Composing Synthesizable RTL Designs of

Arbitrary Cores within a Canonical Superscalar Template

Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel,

Sandeep Navada, Hashem H. Najaf-abadi, Eric Rotenberg

Center for Efficient, Scalable, and Reliable ComputingDepartment of Electrical & Computer Engineering

North Carolina State University

© Niket K. Choudhary 38th Int'l Symp. on Computer Architecture, 2011 2

Single Core to Multiple Cores

o Generic microarchitecture o One-size-fits-all approacho Sub-optimal performance

for individual applications o Power inefficient

o Exciting opportunity to exploit application diversity

o Employ many microarchitecturally diverse core designs

o Higher performance on individual applications

o Power efficient

corecore

core

core

core

core A

core C core D

Core B


App. 1

ILP Characteristics

App. 2

App. 3

Application Diversity

Cor

e A

Core B

Core C

Core D

Heterogeneous Multi-core

ILP Characteristics

ILP Characteristics

Str

uctu

re S

izes

Superscalar Width

Pipeli

ne D

epth

Customize each core to an application,class of application, or class of application behavior


Benefits of Employing Diverse Cores

o R. Kumar et al. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction (MICRO 2003)

o R. Kumar et al. Core Architecture Optimization for Heterogeneous Chip Multiprocessors (PACT 2006)

o B. C. Lee et al. Efficiency Trends and Limits from Comprehensive Microarchitectural Adaptivity (ASPLOS 2008)

o M. A. Suleman et al. Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures (ASPLOS 2009)

o H. H. Najaf-abadi et al. Core-Selectability in Chip Multiprocessors (PACT 2009)

Prior works have shown significantperformance and power advantages


“Achilles’ Heel” of Employing Diverse Cores

o Designing and verifying a core is expensive

o Designing and verifying many different core types is prohibitively expensive and impractical

Cor

e A

Core B

Core C

Core D

o No prior research in heterogeneous multi-core has addressed this challenge


FabScalar

o Automate the generation of superscalar processors

o Our approach: Frame superscalar processors in a canonical formo All superscalar processors have same set of canonical pipeline stages and

interfaces among them, expressed by a Canonical Superscalar Templateo A Canonical Pipeline Stage Library (CPSL) provides many different designs

for each canonical pipeline stage, that differ in major superscalar dimensions

o Automation is enabled because ofo Invariant interfaces among canonical pipeline stageso Confinement of microarchitectural diversity within the canonical pipeline

stages


Fetch Decode Rename Dispatch Issue Reg. Read Execute Writeback Retire

Canonical Superscalar Template


Canonical Pipeline Stage Library (CPSL)

Microarchitectural diversity is focused along key dimensions:

1) Superscalar Complexity• Superscalar width, i.e., number of superscalar “ways”• Sizes of stage-specific structures for extracting instruction-level parallelism (ILP)

2) Sub-pipelining• Pipeline depth of a canonical stage

3) Stage-specific design choices• E.g., different speculation alternatives, recovery alternatives, etc.

Issue Stage

1-Wide 2-Wide 3-Wideiq-cam:8 iq-cam:32 iq-cam:641-deep 2-deep 3-deep

• issuing policy: oldest-first, critical-first• squash policy: complete or selective


Fetch

Rename

Issue

CoreGenerator

core configuration

CPSL

Decode

Dispatch

Register Read

Execute

Writeback

Retire


Fetch

Rename

Issue

synthesizable RTLof customized core

App. 1


Fetch

Rename

Issue

CoreGenerator

core configuration

CPSL

Decode

Dispatch

Register Read

Execute

Writeback

Retire


Fetch

Rename

Issue

App. 2

synthesizable RTLof customized core


Addressing Design-Effort Problem

o FabScalar boosts designer productivity by generating RTL designs of whole cores

o Quality RTL is an essential starting point of chip design cycleo Starting point for design tuning, verification, and physical design

o Highly-ported RAMs and CAMs are pervasive in superscalarso FabMem: generates layouts of highly-ported RAMs and CAMso See paper for more about FabMem


Outline

o Quality assessment of FabScalar-generated coreso Functional and IPC validationo Timing validationo Suitability for standard ASIC and FPGA flows

o Extensibility of CPSL

o G21: a workload-agnostic heterogeneous multi-core

o Future work and conclusion


Validation Results

o Evaluate the quality of the register-transfer-level (RTL) designs produced by FabScalar along three fronts.o Functional and IPC validationo Timing validationo Suitability for physical design

EDA tool(s)/ Library used

Functional verification Cadence NC-Verilog, vers. 06.20-s006

Logic synthesis Synopsys Design Compiler, vers. X-2005.09-SP3

Place & route Cadence SoC Encounter, vers. 7.1

Standard cell library FreePDK 45nm

SPICE model BSIM4 PredictiveTechnology Model


Functional & IPC Validationo Unit testing on isolated canonical pipeline stages of different widths/depthso Generate multiple arbitrary cores and run CPU2000 benchmarks on them

Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7 Core-8 Core-9 Core-10 Core-11 Core-12Fetch/Decode/Rename/Dispatch width

4 4 5 6 8 2 4 4 6 6 4 4

Issue/RR/Execute/WB/Retire width

4 6 5 6 8 4 4 4 6 6 4 6

function unit mix (simple, complex, branch, load/store)

1,1,1,1 3,1,1,1 2,1,1,1 3,1,1,1 5,1,1,1 1,1,1,1 1,1,1,1 1,1,1,1 3,1,1,1 3,1,1,1 1,1,1,1 3,1,1,1

fetch queue 16 16 32 32 64 8 16 16 32 32 16 16active list (ROB) 128 128 128 256 512 64 128 128 256 256 128 128physical register file (PRF) 96 128 128 192 512 64 96 96 192 192 96 128issue queue (IQ) 32 32 32 64 128 16 16 32 64 64 32 32load queue / store queue (LQ/SQ)

32 / 32 32 / 32 32 / 32 32 / 32 32 / 32 16 / 16 32 / 32 32 / 32 32 / 32 32 / 32 32 / 32 32 / 32

branch predictor bimodalbimodal with block-ahead

gshare

branch history table (BHT) (# entries)

64K 64K 64K 64K 64K 64K 64K 64K 64K 64K 64K 64K

branch target buffer (BTB) (# entries)

4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K

return address stack (RAS) 16 16 16 32 64 8 16 16 32 32 16 16branch order buffer (BOB) 16 16 32 32 32 8 16 16 32 32 16 16Fetch depth 2 2 2 2 2 2 2 2 3 3 2 2Rename depth 2 2 2 2 2 2 2 2 2 2 2 2Issue depth: total / wakeup-select loop

2 / 2 2 / 2 2 / 2 2 / 2 2 / 2 1 / 1 1 / 1 3 / 2 2 / 2 3 / 2 2 / 2 2 / 2

Register Read (and Writeback) depth

1 1 1 1 1 1 1 4 2 4 1 1

fetch-to-execute pipeline depth

10 10 10 10 10 9 9 14 12 15 10 10


Functional & IPC Validationo Unit testing on isolated canonical pipe-stage of different widths/depthso Generate multiple arbitrary cores and run CPU2000 benchmarks on them


4 4 5 6 8 2 4 4 6 6 4 4


4 6 5 6 8 4 4 4 6 6 4 6


1,1,1,1 3,1,1,1 2,1,1,1 3,1,1,1 5,1,1,1 1,1,1,1 1,1,1,1 1,1,1,1 3,1,1,1 3,1,1,1 1,1,1,1 3,1,1,1


32 / 32 32 / 32 32 / 32 32 / 32 32 / 32 16 / 16 32 / 32 32 / 32 32 / 32 32 / 32 32 / 32 32 / 32


gshare


64K 64K 64K 64K 64K 64K 64K 64K 64K 64K 64K 64K


4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K


2 / 2 2 / 2 2 / 2 2 / 2 2 / 2 1 / 1 1 / 1 3 / 2 2 / 2 3 / 2 2 / 2 2 / 2


1 1 1 1 1 1 1 4 2 4 1 1


10 10 10 10 10 9 9 14 12 15 10 10


Functional & IPC Validationo Unit testing on isolated canonical pipe-stage of different widths/depthso Generate multiple arbitrary cores and run CPU2000 benchmarks on them


4 4 5 6 8 2 4 4 6 6 4 4


4 6 5 6 8 4 4 4 6 6 4 6


1,1,1,1 3,1,1,1 2,1,1,1 3,1,1,1 5,1,1,1 1,1,1,1 1,1,1,1 1,1,1,1 3,1,1,1 3,1,1,1 1,1,1,1 3,1,1,1


32 / 32 32 / 32 32 / 32 32 / 32 32 / 32 16 / 16 32 / 32 32 / 32 32 / 32 32 / 32 32 / 32 32 / 32


gshare


64K 64K 64K 64K 64K 64K 64K 64K 64K 64K 64K 64K


4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K


2 / 2 2 / 2 2 / 2 2 / 2 2 / 2 1 / 1 1 / 1 3 / 2 2 / 2 3 / 2 2 / 2 2 / 2


1 1 1 1 1 1 1 4 2 4 1 1


10 10 10 10 10 9 9 14 12 15 10 10


Functional & IPC Validationo RTL successfully simulates 100M instr. SimPoints from different benchmarkso IPC from RTL closely tracks with IPC from C++ simulatoro IPC differences among cores correlate with microarchitecture differences


Timing Validationo Compare cycle times and raw fetch-to-execute delays of FabScalar-generated

cores with three different commercial processors:o 90nm POWER5o 180nm Alpha 21364o 65nm MIPS32 74K

o All processors implement RISC ISAso Represent extremes from highly custom designed to fully synthesized o Convert all delays to fanout-of-4 (FO4)

POWER5 Pipeline stages [B. Sinharoy et al., IBM Journal of R&D, 2005]

Integer Pipeline: fetch-to-execute


Timing Validation

Power5 Alpha-21364 MIPS 74K0

50

100

150

200

250

300

350

400

450

Total Exe Latency (FO4)

FabScalar Exe Latency (FO4)

FabScalar Raw Exe La-tency (FO4)

Power5 Alpha-21364

MIPS 74K

Fetch Width 8 4 4Dispatch Width 5 4 2

Issue Width 8 6 1Fetch Queue 24 24 12

Issue Queue(s)Int+Ld/St: 36, FP: 24, Br.: 12,

CR: 10

Int:20, FP:15

Int:8, Agen:8

Physical Reg ister File(s) Int:120, FP:120 Int:80, FP:72

64

Load Queue / Store Queue 32 / 32 32 / 32 8 / 8L1 I$ / L1 D$ (KB) 64 / 32 64 / 64 32 / 32

fetch-to-execute pipeline depth 12 6 12Cycle Time of commercial core 23 FO4 25 FO4 33 FO4Cycle Time of FabScalar core 29 FO4 37 FO4 32 FO4

Cycle Time of deeper FabScalar core

25 FO4 (depth=15)

26 FO4 (depth=11)

N/A

raw fetch-to-execute delay of FabScalar core

291 FO4 188 FO4 384 FO4

Cycle Time of FabScalar core with ideal latch-based design

24 FO4 32 FO4 N/A

with ideal latch-based design

17%

14%


Physical Design Validationo Synthesized and place-and-routed a 4-way superscalar processoro Also synthesized the same core to a Virtex-5 FPGA in a BEE3 system

ASIC Flow FPGA Flow

bzip gzip mcf parserFPGA & verilog retired instr. 10M 10M 10M 10MFPGA & verilog cycles 1.13M 1.7M 0.84M 1.16MFPGA & verilog IPC 0.89 0.59 1.20 0.86FPGA simulation time (s) 0.75 1.22 1.21 0.87verilog simulation time (s) 4,018 5,536 2,870 3,748simulation speedup 5,357 4,538 2,372 4,308FPGA speed (MHz) 50 50 50 50FPGA effective speed (MHz) 15 14 7 13

Technology 45nm

Die Area (excluding L1 caches)

2.6 mm2

Clock frequency 500MHz

Timing-critical path Next-PC logic


Extensibility

o Extensibility of CPSL is important for proliferating microarchitectural diversity

o Two examples:o LMP: Load Misspeculation Predictor (fix IPC bottleneck)o DEAP: Decoupled Effective Ahead Pipelining for conditional branch

predictors (fix cycle-time)

Design Canonical Pipestages Modified

Signals Added to Interface

Implementation Effort

LMP Dispatch, Execute (LSU), Retire

Dispatch/LSU, Retire/Dispatch

2days-1author

DEAP Fetch - 14days-1author


Providing Diverse Cores

o FabScalar framework provides a design space of almost 38,000 different coreso Fit the most complex configuration for a given clock-period and pipeline

depth (to maximize single-thread performance)

o Prior approacheso Customize a core to a specific applicationo Co-customize multiple cores to a specific multiprogrammed workload

o Customizing cores to specific workloads has two drawbacks:o Computationally intensive design-space explorationo Not robust for general performance (or for an arbitrary workload)

With FabScalar, a chip with many different superscalar core types is conceivable


A Workload-Agnostic Heterogeneous Multi-Core

o G21: Generic heterogeneous multi-core o Not trained for specific workloadso 21 core types provide a wide range of microarchitectural diversityo Maximizes single-thread performance for arbitrary instruction-level

behavior

G21 Core Selection

larger structures

higher frequency

cycle time (ns)

superscalar width

2 or 3 0.5 0.6 0.74 or 5 0.6 0.7 0.86 or 7 0.7 0.8 0.9

8 0.8 0.9 1.0


Analysis of G21

o Consider two multi-core designso Best-1:

o Homogeneous multi-core with a single core type o The best harmonic-mean of BIPS across benchmarkso I.e., single core type customized to workload as a whole

o G21: o Proposed heterogeneous multi-coreo Assume ideal benchmark-to-core mapping for G21

o Peak BIPS of a benchmarko Highest BIPS possible for benchmark, considering entire design spaceo I.e., customize core to individual benchmark o Used only as upper-bound on performance


Analysis of G21

0.0

0.2

0.4

0.6

0.8

1.0

bzip

.187

3bz

ip.3

089

bzip

.341

bzip

.927

7cr

afty.

2151

3cr

afty.

1216

6cr

afty.

5647

craft

y.23

4ga

p.32

29ga

p.15

783

gap.

1691

3ga

p.21

010

gcc.

264

gcc.

873

gcc.

473

gcc.

628

gzip

.136

1gz

ip.7

79gz

ip.5

030

mcf

.201

8m

cf.3

091

pars

er.5

201

pars

er.1

9972

pars

er.2

4856

pars

er.1

0758

perl.

18pe

rl.13

09pe

rl.78

1pe

rl.41

5tw

olf.2

482

twol

f.277

55tw

olf.1

5968

twol

f.815

5vo

rtex

.743

2vo

rtex

.902

5vo

rtex

.587

7vp

r.34

63vp

r.13

50vp

r.61

20am

mp.

3783

amm

p.29

45am

mp.

3018

amm

p.28

91am

mp.

2423

appl

u.10

appl

u.12

equa

ke.1

093

equa

ke.8

69eq

uake

.104

6eq

uake

.279

6eq

uake

.463

mgr

id.3

657

mgr

id.2

977

mgr

id.3

283

mgr

id.3

490

swim

.158

3sw

im.1

226

swim

.158

2sw

im.1

312

AVER

AGEBI

PS /

pea

k BI

PS

Best-1 G21

Best-1 is within 10% of peak performance

Worst sub-optimality: 70% of peak performance

Core

cycle time (ns)

fetch/issue width

fetch-to-execute depth

L1 I$ L1 D$ IQ LQ/SQ

Phys. Reg. File

Best-1 0.6 3/4 15 64 64 32 32/32 128

Performance of Best-1 and G21


Analysis of G21

0.0

0.2

0.4

0.6

0.8

1.0

bzip

.187

3bz

ip.3

089

bzip

.341

bzip

.927

7cr

afty.

2151

3cr

afty.

1216

6cr

afty.

5647

craft

y.23

4ga

p.32

29ga

p.15

783

gap.

1691

3ga

p.21

010

gcc.

264

gcc.

873

gcc.

473

gcc.

628

gzip

.136

1gz

ip.7

79gz

ip.5

030

mcf

.201

8m

cf.3

091

pars

er.5

201

pars

er.1

9972

pars

er.2

4856

pars

er.1

0758

perl.

18pe

rl.13

09pe

rl.78

1pe

rl.41

5tw

olf.2

482

twol

f.277

55tw

olf.1

5968

twol

f.815

5vo

rtex

.743

2vo

rtex

.902

5vo

rtex

.587

7vp

r.34

63vp

r.13

50vp

r.61

20am

mp.

3783

amm

p.29

45am

mp.

3018

amm

p.28

91am

mp.

2423

appl

u.10

appl

u.12

equa

ke.1

093

equa

ke.8

69eq

uake

.104

6eq

uake

.279

6eq

uake

.463

mgr

id.3

657

mgr

id.2

977

mgr

id.3

283

mgr

id.3

490

swim

.158

3sw

im.1

226

swim

.158

2sw

im.1

312

AVER

AGEBI

PS /

pea

k BI

PS

Best-1 G21

severe sub-optimality (note: sub-optimality can become more severe for unknown workload)

Core

cycle time (ns)

fetch/issue width



Phys. Reg. File

Best-1 0.6 3/4 15 64 64 32 32/32 128



Analysis of G21

0.0

0.2

0.4

0.6

0.8

1.0

bzip

.187

3bz

ip.3

089

bzip

.341

bzip

.927

7cr

afty.

2151

3cr

afty.

1216

6cr

afty.

5647

craft

y.23

4ga

p.32

29ga

p.15

783

gap.

1691

3ga

p.21

010

gcc.

264

gcc.

873

gcc.

473

gcc.

628

gzip

.136

1gz

ip.7

79gz

ip.5

030

mcf

.201

8m

cf.3

091

pars

er.5

201

pars

er.1

9972

pars

er.2

4856

pars

er.1

0758

perl.

18pe

rl.13

09pe

rl.78

1pe

rl.41

5tw

olf.2

482

twol

f.277

55tw

olf.1

5968

twol

f.815

5vo

rtex

.743

2vo

rtex

.902

5vo

rtex

.587

7vp

r.34

63vp

r.13

50vp

r.61

20am

mp.

3783

amm

p.29

45am

mp.

3018

amm

p.28

91am

mp.

2423

appl

u.10

appl

u.12

equa

ke.1

093

equa

ke.8

69eq

uake

.104

6eq

uake

.279

6eq

uake

.463

mgr

id.3

657

mgr

id.2

977

mgr

id.3

283

mgr

id.3

490

swim

.158

3sw

im.1

226

swim

.158

2sw

im.1

312

AVER

AGEBI

PS /

pea

k BI

PS

Best-1 G21


Core

cycle time (ns)

fetch/issue width



Phys. Reg. File

Best-1 0.6 3/4 15 64 64 32 32/32 128

G21 is within 3% of peak performance

Worst sub-optimality: 88% of peak performance


Analysis of G21

o G21 highlights the merits of workload-agnostic designo Low computational complexityo Robust performance for unknown workloads

o G21 outperforms Best-1 (even though Best-1 is customized to workload) o Diversity not only delivers better efficiency (BIPS/Watt) but also delivers

better raw-performanceo E.g. microarchitecture-loops and instruction-level behavior

o G21 is highly representative of the entire design space w.r.t. single-thread performance o Can be used to distill fewer cores


Future Work

FabScalar

Correct-by-Construction

• Application of formal verification & tools

Expanding CPSL

• Floating-point and multimedia instr. support

• More features …..

Selection of Cores

• G21 is a preliminary study • N-of-G21: factor-in other metrics

e.g. power, area

FabFPGA• Automate the mapping of FabScalar-

generated cores to FPGAs• Accelerate verification• Design space exploration


Conclusion

o Providing microarchitecturally diverse cores has significant benefits but need multiple core designs

o FabScalar addresses the practical issue of designing and verifying multiple cores

o FabScalar is a novel toolset for automatically composing the RTL designs of arbitrary superscalar cores

o FabScalar toolset is available as open-source gateware o http://people.engr.ncsu.edu/ericro/research/fabscalar/pre-release.htm

31

FabScalar: Composing Synthesizable RTL Designs of

Arbitrary Cores within a Canonical Superscalar Template

Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel,

Sandeep Navada, Hashem H. Najaf-abadi, Eric Rotenberg

Center for Efficient, Scalable, and Reliable ComputingDepartment of Electrical & Computer Engineering

North Carolina State University

Documents

1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay