Upload
phungtuyen
View
213
Download
0
Embed Size (px)
Citation preview
AnyCore-1 Chip Details
AnyCore-1: A Comprehensively Adaptive 4-Way Superscalar ProcessorRangeen Basu Roy Chowdhury, Anil K. Kannepalli, Eric Rotenberg
Department of Electrical and Computer Engineering, North Carolina State University
References
AnyCore-1
Hot Chips 28, 21 – 23 August, 2016, Cupertino, California, USA
Funding: NSF grant CCF-1018517 (AnyCore: A Universal Superscalar Core)
AnyCore RTL Design
• A synthesizable parameterized RTL design of
an adaptive superscalar core
• Can generate both adaptive and fixed cores of
arbitrary maximum size
• The RTL is paired with UPF based power
domain description for power gating of lanes
and partitions
AnyCore CAD Flow
AnyCore Toolset
[1] R. Basu Roy Chowdhury, A. K. Kannepalli, S. Ku, and E. Rotenberg. “AnyCore: A
Synthesizable RTL Model for Exploring and Fabricating Adaptive Superscalar Cores.”
ISPASS'16, pp. 214-224, April 2016.
[2] Niket K. Choudhary, Salil V. Wadhavkar, et al. “FabScalar: Composing Synthesizable
RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template.” ISCA-38, pp.
11-22, June 2011.
Parameter Max
Size
Legal
Configs
Fetch Width 4 1, 2, 3, 4
Issue Width 5 3, 4, 5
Issue Queue 64 16, 32, 48, 64
Load/Store Queues 32 16, 32
Phys. Register File 128 64, 96, 128
Reorder Buffer 128 64, 96, 128
Benchmark Dynamic Instr. (million) Avg. IPC Avg. Power (mW)
Reduce an array to a sum 235.46 3.05 36.25
Bubble Sort an array 36.92 0.32 27.50
Prime Number Generator 88.28 1.52 30.63
Linear Feedback Shift Register 67.11 1.57 30.63
Sum first N natural numbers 67.11 4.00 36.25
Power Configuration Power (mW)
Leakage PowerSmallest /
Biggest0.01
Idle Power Smallest 21
Idle Power Biggest 25
• A comprehensively adaptive out-of-order
superscalar core which adapts to the
available ILP in programs
• Dynamically changes superscalar width
and sizes of ILP extracting structures
• Orthogonal to DVFS and can improve
energy efficiency further
• To the best of our knowledge, AnyCore-1
is the first adaptive processor chip
Key Results• Power consumption and IPC scale with
configured core size
• Idle power is large due to clock tree and
fully synthesized caches
• Fabricated in 130nm technology
• Clock gating at level of lanes and structure partitions
• Input gating of de-configured ports
• CQFP-100 package (79 signal pads, 21 power pads)
Test Infrastructure• Xilinx ML-605 board used as a testbench
• Custom mezzanine card interfaces chip to ML605
• L2 Controller in FPGA (currently uses block RAMs)
• FPGA talks to AnyCore-1 through a Management Bus
• A console program running on a PC communicates
with the FPGA to load benchmarks, write AnyCore-1
configurations, and read performance counters
Adaptive Core vs. Fixed Cores
• Scheduler adjusts configuration for each
program phase to minimize energy while
maintaining close to maximum
performance for the phase
• Dynamic reconfiguration at phase
boundaries places AnyCore-1 at the
pareto frontier
• AnyCore-1 enables the scheduler to
trade performance for lower energy
AnyCore-1 Microarchitecture
Experiments with AnyCore-1• AnyCore-1 successfully runs many different microbenchmarks, with dynamic
reconfiguration enabled and disabled
• Currently we are trying to run SPEC 2000 SimPoints on AnyCore-1
Physical Design Information
Technology IBM 8RF (130nm)
Dimensions 5 mm x 5 mm
Pads (Signal, Power) 100 (79, 21)
Transistors 3.4 million
Cells 1.5 million
Nets 7.6 million
Reconfiguring AnyCore-1• Reconfiguration penalty is ~20 cycles
• Reconfiguration can take up to 150 cycles
• Three ways to trigger reconfiguration in our
test setup1) By the program, by issuing a store with the
configuration to a reserved memory location
2) By the testbench hardware based on
performance counter driven bottleneck
analysis
3) By the console program based on
scheduling constraints
AnyCore PAT Tool
• A Power-Area-
Timing database
generation tool that
can be coupled
with C++ based
microarchitecture
simulators for
broad exploration
studies
Enables Adaptive Core Research
• Understand circuit-level overheads of
adaptivity
• Compare comprehensively adaptive cores
with other adaptive architectures such as
heterogeneous multicore
• Fabricate adaptive superscalar cores like
AnyCore-1
• Automatically synthesize, analyze, and
populate PAT database – easy to use for
computer architects who are not familiar with
low-level flows
• Uses UPF-based per-domain power
analysis to populate power database
• Industry-standard
synthesis and
analysis flow to
easily quantify
circuit-level
overheads
• Clock gating and
power gating
methodology to
maximize power
savings from
disabled
resources
RTL
UPF Constraints
Synopsys
Design
Compiler
Cell
Library
(.lib , .db,
.v)
Cadence
NC Sim
Synopsys
Prime
Time
UPF SDFNetlist
VCD
Area
Report
Timing
Report
Power
Report
Stall instruction fetch
Is pipeline empty?
New configuration
written?
Consolidate architectural registers to the first PRF
partition
Latch in the new configuration
Unstall instruction fetch and refill pipeline
Drain backend of the pipeline
Start
No Yes
No
Yes
10 – 128 cycles (CPU is doing useful work)
7 cycles
10 cycles
Fetch1 Fetch2 Decode Inst Buffer
Predecode
Next
PC
BTB, BP
BTB, BP
BTB, BP
BTB, BP
Predecode
Predecode
Predecode
Decode
Decode
Decode
Decode
Inst
RAM
Rename Dispatch Issue Queue
Rename
Rename
Rename
Rename
RMTCtrl
Queue
+32
+32
30
Free List
Select /
RSR
Select /
RSR
Select /
RSR
Select /
RSR
Select /
RSR
64
+32
+32
AL
Register Read /
Execute / Writeback
Execute
Mem
Execute
Ctrl
Execute
ALU (S+C)
Execute
ALU (S)
Execute
ALU (S)
Commit
Logic
AMT
Retire
4
5
IQ
+16
+16
16
+16
TOP
11
10
9
1
2
3 4
5
Inst/Tag
RAM
(64 lines, 8
instr. per
line)
I-Cache64
+32
+32
PRF
67
+16/+16
16/16
LQ/SQ
8
Hit
Logic
Data/Tag RAM
(128 lines, 16
bytes per line)
D-Cache
Miss Logic,
MSHR
Hit Logic
Miss
Logic,
MSHR
67
67
RTL
Config
Space
UPF
C++
Performance
Simulator
PAT
Estimation
Tool
PAT
Database
Power, area,
and cycle
time
estimates
Performance
estimates
(IPC, miss
rates, etc.)
PAT
Generation
Tool
Counters