1
AnyCore-1 Chip Details AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar Processor Rangeen Basu Roy Chowdhury, Anil K. Kannepalli, Eric Rotenberg Department of Electrical and Computer Engineering, North Carolina State University References AnyCore-1 Hot Chips 28, 21 23 August, 2016, Cupertino, California, USA Funding: NSF grant CCF-1018517 (AnyCore: A Universal Superscalar Core) AnyCore RTL Design A synthesizable parameterized RTL design of an adaptive superscalar core Can generate both adaptive and fixed cores of arbitrary maximum size The RTL is paired with UPF based power domain description for power gating of lanes and partitions AnyCore CAD Flow AnyCore Toolset [1] R. Basu Roy Chowdhury, A. K. Kannepalli, S. Ku, and E. Rotenberg. “AnyCore: A Synthesizable RTL Model for Exploring and Fabricating Adaptive Superscalar Cores.” ISPASS'16, pp. 214-224, April 2016. [2] Niket K. Choudhary, Salil V. Wadhavkar, et al. “FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template.” ISCA-38, pp. 11-22, June 2011. Parameter Max Size Legal Configs Fetch Width 4 1, 2, 3, 4 Issue Width 5 3, 4, 5 Issue Queue 64 16, 32, 48, 64 Load/Store Queues 32 16, 32 Phys. Register File 128 64, 96, 128 Reorder Buffer 128 64, 96, 128 Benchmark Dynamic Instr. (million) Avg. IPC Avg. Power (mW) Reduce an array to a sum 235.46 3.05 36.25 Bubble Sort an array 36.92 0.32 27.50 Prime Number Generator 88.28 1.52 30.63 Linear Feedback Shift Register 67.11 1.57 30.63 Sum first N natural numbers 67.11 4.00 36.25 Power Configuration Power (mW) Leakage Power Smallest / Biggest 0.01 Idle Power Smallest 21 Idle Power Biggest 25 A comprehensively adaptive out-of-order superscalar core which adapts to the available ILP in programs Dynamically changes superscalar width and sizes of ILP extracting structures Orthogonal to DVFS and can improve energy efficiency further To the best of our knowledge, AnyCore-1 is the first adaptive processor chip Key Results Power consumption and IPC scale with configured core size Idle power is large due to clock tree and fully synthesized caches Fabricated in 130nm technology Clock gating at level of lanes and structure partitions Input gating of de-configured ports CQFP-100 package (79 signal pads, 21 power pads) Test Infrastructure Xilinx ML-605 board used as a testbench Custom mezzanine card interfaces chip to ML605 L2 Controller in FPGA (currently uses block RAMs) FPGA talks to AnyCore-1 through a Management Bus A console program running on a PC communicates with the FPGA to load benchmarks, write AnyCore-1 configurations, and read performance counters Adaptive Core vs. Fixed Cores Scheduler adjusts configuration for each program phase to minimize energy while maintaining close to maximum performance for the phase Dynamic reconfiguration at phase boundaries places AnyCore-1 at the pareto frontier AnyCore-1 enables the scheduler to trade performance for lower energy AnyCore-1 Microarchitecture Experiments with AnyCore-1 AnyCore-1 successfully runs many different microbenchmarks, with dynamic reconfiguration enabled and disabled Currently we are trying to run SPEC 2000 SimPoints on AnyCore-1 Physical Design Information Technology IBM 8RF (130nm) Dimensions 5 mm x 5 mm Pads (Signal, Power) 100 (79, 21) Transistors 3.4 million Cells 1.5 million Nets 7.6 million Reconfiguring AnyCore-1 Reconfiguration penalty is ~20 cycles Reconfiguration can take up to 150 cycles Three ways to trigger reconfiguration in our test setup 1) By the program, by issuing a store with the configuration to a reserved memory location 2) By the testbench hardware based on performance counter driven bottleneck analysis 3) By the console program based on scheduling constraints AnyCore PAT Tool A Power-Area- Timing database generation tool that can be coupled with C++ based microarchitecture simulators for broad exploration studies Enables Adaptive Core Research Understand circuit-level overheads of adaptivity Compare comprehensively adaptive cores with other adaptive architectures such as heterogeneous multicore Fabricate adaptive superscalar cores like AnyCore-1 Automatically synthesize, analyze, and populate PAT database easy to use for computer architects who are not familiar with low-level flows Uses UPF-based per-domain power analysis to populate power database Industry-standard synthesis and analysis flow to easily quantify circuit-level overheads Clock gating and power gating methodology to maximize power savings from disabled resources RTL UPF Constraints Synopsys Design Compiler Cell Library (.lib , .db, .v) Cadence NC Sim Synopsys Prime Time UPF SDF Netlist VCD Area Report Timing Report Power Report Stall instruction fetch Is pipeline empty? New configuration written? Consolidate architectural registers to the first PRF partition Latch in the new configuration Unstall instruction fetch and refill pipeline Drain backend of the pipeline Start No Yes No Yes 10 – 128 cycles (CPU is doing useful work) 7 cycles 10 cycles Fetch1 Fetch2 Decode Inst Buffer Predecode Next PC BTB, BP BTB, BP BTB, BP BTB, BP Predecode Predecode Predecode Decode Decode Decode Decode Inst RAM Rename Dispatch Issue Queue Rename Rename Rename Rename RMT Ctrl Queue +32 +32 30 Free List Select / RSR Select / RSR Select / RSR Select / RSR Select / RSR 64 +32 +32 AL Register Read / Execute / Writeback Execute Mem Execute Ctrl Execute ALU (S+C) Execute ALU (S) Execute ALU (S) Commit Logic AMT Retire 4 5 IQ +16 +16 16 +16 TOP 11 10 9 1 2 3 4 5 Inst/Tag RAM (64 lines, 8 instr. per line) I-Cache 64 +32 +32 PRF 6 7 +16/+16 16/16 LQ/SQ 8 Hit Logic Data/Tag RAM (128 lines, 16 bytes per line) D-Cache Miss Logic, MSHR Hit Logic Miss Logic, MSHR 6 7 6 7 RTL Config Space UPF C++ Performance Simulator PAT Estimation Tool PAT Database Power, area, and cycle time estimates Performance estimates (IPC, miss rates, etc.) PAT Generation Tool Counters

AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar ... · AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar Processor ... A Universal Superscalar Core) ... Linear Feedback

Embed Size (px)

Citation preview

Page 1: AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar ... · AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar Processor ... A Universal Superscalar Core) ... Linear Feedback

AnyCore-1 Chip Details

AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar ProcessorRangeen Basu Roy Chowdhury, Anil K. Kannepalli, Eric Rotenberg

Department of Electrical and Computer Engineering, North Carolina State University

References

AnyCore-1

Hot Chips 28, 21 – 23 August, 2016, Cupertino, California, USA

Funding: NSF grant CCF-1018517 (AnyCore: A Universal Superscalar Core)

AnyCore RTL Design

• A synthesizable parameterized RTL design of

an adaptive superscalar core

• Can generate both adaptive and fixed cores of

arbitrary maximum size

• The RTL is paired with UPF based power

domain description for power gating of lanes

and partitions

AnyCore CAD Flow

AnyCore Toolset

[1] R. Basu Roy Chowdhury, A. K. Kannepalli, S. Ku, and E. Rotenberg. “AnyCore: A

Synthesizable RTL Model for Exploring and Fabricating Adaptive Superscalar Cores.”

ISPASS'16, pp. 214-224, April 2016.

[2] Niket K. Choudhary, Salil V. Wadhavkar, et al. “FabScalar: Composing Synthesizable

RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template.” ISCA-38, pp.

11-22, June 2011.

Parameter Max

Size

Legal

Configs

Fetch Width 4 1, 2, 3, 4

Issue Width 5 3, 4, 5

Issue Queue 64 16, 32, 48, 64

Load/Store Queues 32 16, 32

Phys. Register File 128 64, 96, 128

Reorder Buffer 128 64, 96, 128

Benchmark Dynamic Instr. (million) Avg. IPC Avg. Power (mW)

Reduce an array to a sum 235.46 3.05 36.25

Bubble Sort an array 36.92 0.32 27.50

Prime Number Generator 88.28 1.52 30.63

Linear Feedback Shift Register 67.11 1.57 30.63

Sum first N natural numbers 67.11 4.00 36.25

Power Configuration Power (mW)

Leakage PowerSmallest /

Biggest0.01

Idle Power Smallest 21

Idle Power Biggest 25

• A comprehensively adaptive out-of-order

superscalar core which adapts to the

available ILP in programs

• Dynamically changes superscalar width

and sizes of ILP extracting structures

• Orthogonal to DVFS and can improve

energy efficiency further

• To the best of our knowledge, AnyCore-1

is the first adaptive processor chip

Key Results• Power consumption and IPC scale with

configured core size

• Idle power is large due to clock tree and

fully synthesized caches

• Fabricated in 130nm technology

• Clock gating at level of lanes and structure partitions

• Input gating of de-configured ports

• CQFP-100 package (79 signal pads, 21 power pads)

Test Infrastructure• Xilinx ML-605 board used as a testbench

• Custom mezzanine card interfaces chip to ML605

• L2 Controller in FPGA (currently uses block RAMs)

• FPGA talks to AnyCore-1 through a Management Bus

• A console program running on a PC communicates

with the FPGA to load benchmarks, write AnyCore-1

configurations, and read performance counters

Adaptive Core vs. Fixed Cores

• Scheduler adjusts configuration for each

program phase to minimize energy while

maintaining close to maximum

performance for the phase

• Dynamic reconfiguration at phase

boundaries places AnyCore-1 at the

pareto frontier

• AnyCore-1 enables the scheduler to

trade performance for lower energy

AnyCore-1 Microarchitecture

Experiments with AnyCore-1• AnyCore-1 successfully runs many different microbenchmarks, with dynamic

reconfiguration enabled and disabled

• Currently we are trying to run SPEC 2000 SimPoints on AnyCore-1

Physical Design Information

Technology IBM 8RF (130nm)

Dimensions 5 mm x 5 mm

Pads (Signal, Power) 100 (79, 21)

Transistors 3.4 million

Cells 1.5 million

Nets 7.6 million

Reconfiguring AnyCore-1• Reconfiguration penalty is ~20 cycles

• Reconfiguration can take up to 150 cycles

• Three ways to trigger reconfiguration in our

test setup1) By the program, by issuing a store with the

configuration to a reserved memory location

2) By the testbench hardware based on

performance counter driven bottleneck

analysis

3) By the console program based on

scheduling constraints

AnyCore PAT Tool

• A Power-Area-

Timing database

generation tool that

can be coupled

with C++ based

microarchitecture

simulators for

broad exploration

studies

Enables Adaptive Core Research

• Understand circuit-level overheads of

adaptivity

• Compare comprehensively adaptive cores

with other adaptive architectures such as

heterogeneous multicore

• Fabricate adaptive superscalar cores like

AnyCore-1

• Automatically synthesize, analyze, and

populate PAT database – easy to use for

computer architects who are not familiar with

low-level flows

• Uses UPF-based per-domain power

analysis to populate power database

• Industry-standard

synthesis and

analysis flow to

easily quantify

circuit-level

overheads

• Clock gating and

power gating

methodology to

maximize power

savings from

disabled

resources

RTL

UPF Constraints

Synopsys

Design

Compiler

Cell

Library

(.lib , .db,

.v)

Cadence

NC Sim

Synopsys

Prime

Time

UPF SDFNetlist

VCD

Area

Report

Timing

Report

Power

Report

Stall instruction fetch

Is pipeline empty?

New configuration

written?

Consolidate architectural registers to the first PRF

partition

Latch in the new configuration

Unstall instruction fetch and refill pipeline

Drain backend of the pipeline

Start

No Yes

No

Yes

10 – 128 cycles (CPU is doing useful work)

7 cycles

10 cycles

Fetch1 Fetch2 Decode Inst Buffer

Predecode

Next

PC

BTB, BP

BTB, BP

BTB, BP

BTB, BP

Predecode

Predecode

Predecode

Decode

Decode

Decode

Decode

Inst

RAM

Rename Dispatch Issue Queue

Rename

Rename

Rename

Rename

RMTCtrl

Queue

+32

+32

30

Free List

Select /

RSR

Select /

RSR

Select /

RSR

Select /

RSR

Select /

RSR

64

+32

+32

AL

Register Read /

Execute / Writeback

Execute

Mem

Execute

Ctrl

Execute

ALU (S+C)

Execute

ALU (S)

Execute

ALU (S)

Commit

Logic

AMT

Retire

4

5

IQ

+16

+16

16

+16

TOP

11

10

9

1

2

3 4

5

Inst/Tag

RAM

(64 lines, 8

instr. per

line)

I-Cache64

+32

+32

PRF

67

+16/+16

16/16

LQ/SQ

8

Hit

Logic

Data/Tag RAM

(128 lines, 16

bytes per line)

D-Cache

Miss Logic,

MSHR

Hit Logic

Miss

Logic,

MSHR

67

67

RTL

Config

Space

UPF

C++

Performance

Simulator

PAT

Estimation

Tool

PAT

Database

Power, area,

and cycle

time

estimates

Performance

estimates

(IPC, miss

rates, etc.)

PAT

Generation

Tool

Counters