AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar ... · AnyCore-1: A Comprehensively...

Preview:

Citation preview

AnyCore-1 Chip Details

AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar ProcessorRangeen Basu Roy Chowdhury, Anil K. Kannepalli, Eric Rotenberg

Department of Electrical and Computer Engineering, North Carolina State University

References

AnyCore-1

Hot Chips 28, 21 – 23 August, 2016, Cupertino, California, USA

Funding: NSF grant CCF-1018517 (AnyCore: A Universal Superscalar Core)

AnyCore RTL Design

• A synthesizable parameterized RTL design of

an adaptive superscalar core

• Can generate both adaptive and fixed cores of

arbitrary maximum size

• The RTL is paired with UPF based power

domain description for power gating of lanes

and partitions

AnyCore CAD Flow

AnyCore Toolset

[1] R. Basu Roy Chowdhury, A. K. Kannepalli, S. Ku, and E. Rotenberg. “AnyCore: A

Synthesizable RTL Model for Exploring and Fabricating Adaptive Superscalar Cores.”

ISPASS'16, pp. 214-224, April 2016.

[2] Niket K. Choudhary, Salil V. Wadhavkar, et al. “FabScalar: Composing Synthesizable

RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template.” ISCA-38, pp.

11-22, June 2011.

Parameter Max

Size

Legal

Configs

Fetch Width 4 1, 2, 3, 4

Issue Width 5 3, 4, 5

Issue Queue 64 16, 32, 48, 64

Load/Store Queues 32 16, 32

Phys. Register File 128 64, 96, 128

Reorder Buffer 128 64, 96, 128

Benchmark Dynamic Instr. (million) Avg. IPC Avg. Power (mW)

Reduce an array to a sum 235.46 3.05 36.25

Bubble Sort an array 36.92 0.32 27.50

Prime Number Generator 88.28 1.52 30.63

Linear Feedback Shift Register 67.11 1.57 30.63

Sum first N natural numbers 67.11 4.00 36.25

Power Configuration Power (mW)

Leakage PowerSmallest /

Biggest0.01

Idle Power Smallest 21

Idle Power Biggest 25

• A comprehensively adaptive out-of-order

superscalar core which adapts to the

available ILP in programs

• Dynamically changes superscalar width

and sizes of ILP extracting structures

• Orthogonal to DVFS and can improve

energy efficiency further

• To the best of our knowledge, AnyCore-1

is the first adaptive processor chip

Key Results• Power consumption and IPC scale with

configured core size

• Idle power is large due to clock tree and

fully synthesized caches

• Fabricated in 130nm technology

• Clock gating at level of lanes and structure partitions

• Input gating of de-configured ports

• CQFP-100 package (79 signal pads, 21 power pads)

Test Infrastructure• Xilinx ML-605 board used as a testbench

• Custom mezzanine card interfaces chip to ML605

• L2 Controller in FPGA (currently uses block RAMs)

• FPGA talks to AnyCore-1 through a Management Bus

• A console program running on a PC communicates

with the FPGA to load benchmarks, write AnyCore-1

configurations, and read performance counters

Adaptive Core vs. Fixed Cores

• Scheduler adjusts configuration for each

program phase to minimize energy while

maintaining close to maximum

performance for the phase

• Dynamic reconfiguration at phase

boundaries places AnyCore-1 at the

pareto frontier

• AnyCore-1 enables the scheduler to

trade performance for lower energy

AnyCore-1 Microarchitecture

Experiments with AnyCore-1• AnyCore-1 successfully runs many different microbenchmarks, with dynamic

reconfiguration enabled and disabled

• Currently we are trying to run SPEC 2000 SimPoints on AnyCore-1

Physical Design Information

Technology IBM 8RF (130nm)

Dimensions 5 mm x 5 mm

Pads (Signal, Power) 100 (79, 21)

Transistors 3.4 million

Cells 1.5 million

Nets 7.6 million

Reconfiguring AnyCore-1• Reconfiguration penalty is ~20 cycles

• Reconfiguration can take up to 150 cycles

• Three ways to trigger reconfiguration in our

test setup1) By the program, by issuing a store with the

configuration to a reserved memory location

2) By the testbench hardware based on

performance counter driven bottleneck

analysis

3) By the console program based on

scheduling constraints

AnyCore PAT Tool

• A Power-Area-

Timing database

generation tool that

can be coupled

with C++ based

microarchitecture

simulators for

broad exploration

studies

Enables Adaptive Core Research

• Understand circuit-level overheads of

adaptivity

• Compare comprehensively adaptive cores

with other adaptive architectures such as

heterogeneous multicore

• Fabricate adaptive superscalar cores like

AnyCore-1

• Automatically synthesize, analyze, and

populate PAT database – easy to use for

computer architects who are not familiar with

low-level flows

• Uses UPF-based per-domain power

analysis to populate power database

• Industry-standard

synthesis and

analysis flow to

easily quantify

circuit-level

overheads

• Clock gating and

power gating

methodology to

maximize power

savings from

disabled

resources

RTL

UPF Constraints

Synopsys

Design

Compiler

Cell

Library

(.lib , .db,

.v)

Cadence

NC Sim

Synopsys

Prime

Time

UPF SDFNetlist

VCD

Area

Report

Timing

Report

Power

Report

Stall instruction fetch

Is pipeline empty?

New configuration

written?

Consolidate architectural registers to the first PRF

partition

Latch in the new configuration

Unstall instruction fetch and refill pipeline

Drain backend of the pipeline

Start

No Yes

No

Yes

10 – 128 cycles (CPU is doing useful work)

7 cycles

10 cycles

Fetch1 Fetch2 Decode Inst Buffer

Predecode

Next

PC

BTB, BP

BTB, BP

BTB, BP

BTB, BP

Predecode

Predecode

Predecode

Decode

Decode

Decode

Decode

Inst

RAM

Rename Dispatch Issue Queue

Rename

Rename

Rename

Rename

RMTCtrl

Queue

+32

+32

30

Free List

Select /

RSR

Select /

RSR

Select /

RSR

Select /

RSR

Select /

RSR

64

+32

+32

AL

Register Read /

Execute / Writeback

Execute

Mem

Execute

Ctrl

Execute

ALU (S+C)

Execute

ALU (S)

Execute

ALU (S)

Commit

Logic

AMT

Retire

4

5

IQ

+16

+16

16

+16

TOP

11

10

9

1

2

3 4

5

Inst/Tag

RAM

(64 lines, 8

instr. per

line)

I-Cache64

+32

+32

PRF

67

+16/+16

16/16

LQ/SQ

8

Hit

Logic

Data/Tag RAM

(128 lines, 16

bytes per line)

D-Cache

Miss Logic,

MSHR

Hit Logic

Miss

Logic,

MSHR

67

67

RTL

Config

Space

UPF

C++

Performance

Simulator

PAT

Estimation

Tool

PAT

Database

Power, area,

and cycle

time

estimates

Performance

estimates

(IPC, miss

rates, etc.)

PAT

Generation

Tool

Counters