Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1/4/18
1
VCL
The Design of the KiloCore Chip
Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas
DAC 2017: Design Challenges of New Processor Architectures
University of California, DavisVLSI Computation Laboratory
June 20, 2017
*Currently with California State University, Fresno
Outline
• Introduction and Motivation– Core Scaling Trend
• KiloCore Architectural Design• Physical Chip Design
– Chip Flow– Design Challenges
• Final Die• Application Measurements
2
1/4/18
2
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. BattenNew plot and data collected for 2010-2015 by K. RuppNew data added by B. Baas 3
Optimal Computational Tile Size
• The most efficient implementations (energy, throughput, circuit area) have:– Processor sizes that capture computational kernels with few
excess circuits
~~~ ~~~
Energy Effic.Clock rateArea Effic.
Unused or low benefit-per-cost circuits
Inter-processor
interconnect
4
1/4/18
3
Key Properties of KiloCore
• Very small processor tiles– Vast numbers of processors per chip– Processors can be used for non-traditional purposes in
highly efficient implementations• Processors dissipate very little energy per workload
when active• Processors dissipate very low power when idle• Essentially no algorithm-specific hardware in the
programmable processors
5
Outline
• Introduction and Motivation– Core Scaling Trend
• KiloCore Architectural Design• Physical Chip Design
– Chip Flow– Design Challenges
• Final Die• Application Measurements
6
1/4/18
4
KiloCore Design
• Contains exactly 1,000 processors on one chip
• One of the first fabricated chips to contain 1,000 processors
• Fastest clock rate processor designed at a university
• Didn’t receive all libraries until 34 days before taping out
• 12 memories containing 64 KB each for 768 KB of shared memory– Memories are accessible by two
processors directly above eachKiloCore Block Diagram
Single Processor One 64 KB memory7
GALS Clocking
• KiloCore contains a fully-independent Globally-Asynchronous Locally-Synchronous (GALS) clock domain in each of its 1000 processors, 1000 packet routers, and 12 independent memories– Processor programmable clock oscillators are ~1% of tile area– Router oscillators are simplified and very small
• Each of the 2012 clock oscillators are placed inside their own clock domains—there are no global clock signals (except three for configuration and testing)
• Each clock oscillator is fully-unconstrained—oscillators may change frequency (below their fmax), halt, and restart arbitrarily to minimize power consumption
• Data transfer across clock domains is handled by dual-clock FIFOs8
Processor
Router
1/4/18
5
Inter-Processor Communication
• Source synchronous 16-bit communication
• Two layer source-synchronous circuit switched network– Up to 28 Gbps per link, 456 Gbps
total tile I/O• Single layer dynamic packet
routing network– 9.1 Gbps maximum per port– 45.5 Gbps maximum
Processor Core
Circuit Switch x2
Packet Router
9
Outline
• Introduction and Motivation– Core Scaling Trend
• KiloCore Architectural Design• Physical Chip Design
– Chip Flow– Design Challenges
• Final Die• Application Measurements
10
1/4/18
6
Bottom-Up Physical Design FlowDesign Flow
Synthesis
Floorplanning/ Power Gridding
Placement
Clock Tree Synthesis
Optimization
DRC & LVS
Timing Analysis
Timing ConstraintsHDLMacro
Block(s)
Routing
Export Macro Block
• Bottom-Up design flow• Design aspects undecided when
the physical design was started:– Routers– Inter-processor I/O– Chip I/O– Memories (both local and global)– Oscillators– Configuration– Number of processors
11
6. Chip Finishing
2. Design Flow
1. Design Flow
3. Design Flow
Macro Blocksoscillators
Macro Blocksproc. & mem.
Macro Blocktest chip
4. Meets timing?
no
yes
chip
final GDS
7. To Fab
3x3
32x32
HDL chip
HDL test chip
HDL proc. & accel.
HDL oscillators
5. Design Flow
Bottom-Up Physical Design Flow
12
1/4/18
7
Processor Tile Implementation
Single processor with highlighted gates (percentage of tile area)
Imem(33%)Dmem
(29%)
Osc.(1%)
Core(20%)
Fifo2(4%)
Fifo1(4%)
Router(9%)
• Relatively simple processor design– No hard macros– Osc was pre-placed and routed– I/O pin locations specified for optimal
abutment– Power rails specified
• Quick design iterations– Easily changed design aspects such as
memory size or tile size– Verilog was changing up until 5 days
before tapeout• ~580k transistors per processor tile• 239 μm “wide” × 232.3 μm “tall”
Circ. Comm.(1%)
13
Global Metal PlanTile Metal Plan
Metal Plan
14
m11m10m9m8m7m6m5m4m3m2m1
Inter-Std. Cell Signals
Local Power Grid
(2.31%)(4.94%)(7.02%)(7.53%)
(18.32%)(24.71%)(22.12%)(12.83%)(0.21%)
Global Signals
(18.63%)(6.52%)
(21.49%)(7.27%)
(15.40%)(13.56%)(6.07%)(8.42%)
(0.13%)(2.50%)
Power GridStd. Cells IO Pads
Signal percentages are length per layer, per total signal length
1/4/18
8
Global Metal PlanTile Metal Plan
Metal Plan
15
m11m10m9m8m7m6m5m4m3m2m1
Inter-Std. Cell Signals
Local Power Grid
(16.81%)(6.49%)(8.41%)(6.49%)(4.44%)
(4.44%)(9.49%)
Global Signals Power Grid
(11.95%)(9.47%)
Std. Cells IO Pads
Power grid percentage is percentage of max allowable
Processor Level Power Rails
• Used 7 metal layers for local power grid– 39.6 μm horizontal pitch– 20.0 μm vertical pitch
• Connected local grid between all processors– Including the standard cell level
power rails• Manually drew in the grid of
metal at array level
View focused on single core, showing array painted power 16
1/4/18
9
Global Power Rails
• Metal layers 10 and 11 are used for global power grid– Only connected to
metal layer 9 power rails (top level local power grid)
Global power rails at chip level 17
Processor Abutment
• With 1000 homogeneous processors, very tight abutment is important– Communication wires were
constrained with I/O pin placement
– 0.4 μm horizontal spacing and4.5 μm vertical spacing
– Required fabrication cells, such as alignment cells cause interruption
• Removing a processor is non-ideal due to timing
• Specific rows required added vertical space to fit these cells
Configuration North Circuit Comm.
East
South
West
North Router Comm.
West SouthEast
Config.Processor Address 18
1/4/18
10
Processor Timing Paths• 196 independently timed
datapaths for direct communication up to only two tile hops
• Possible timing paths– Circuit switched (x2 layers):
• Each core has 2 origins:– One output buffer (Obuf)– One long-distance buffer (LD)
• Each core has 3 destinations:– One long-distance buffer (LD)– Input buffer 0 (Ibuf0)– Input buffer 1 (Ibuf1)
• Signals can pass through cores– Packet switched:
• Single hop routers on each processor
19
Obuf
Ibuf0LD
Ibuf1Ibuf0LD
Ibuf1
Ibuf0LD
Ibuf1
Ibuf0LD
Ibuf1Ibuf0LD
Ibuf1Ibuf0LD
Ibuf1
Ibuf0LD
Ibuf1
Ibuf0LD
Ibuf1
LD
Ibuf1
Ibuf0LD
Ibuf1
Ibuf0LD
Ibuf1
Ibuf0LD
Ibuf1
LD
router router
router
router
router
Figure shows all possible inter-processor communication paths up to two tiles away
Ibuf0
Inter-Processor Communication Wires
Circuit switch circuitry1% area
(9% including FIFOs)
Packet router circuitry 9% area
20
1/4/18
11
C4 Routing and Chip I/O
• Premade flip-chip BGA package was used– 564 C4 bumps
• 162 I/O (63 LVDS Pairs)• 402 Power (Vdd, Gnd, and VddIO)
– Non-ideal C4 placement and quantity
• All drivers and ESD clamps were on the periphery
• When all cores are at full speed and 100% activity, only sufficient current for center processors– Most applications average 50%
activity 21
Oscillator and Processor Clock Tree
• Inverter ring oscillator – Inverter chain cells
were hand selected– Placed and routed
before processor to maintain timing
• Clock tree with5,522 leafs to drive the flip-flop memories
Processor Skew (70 ps max.)
Osc.
22
Average Measured Processor Osc.
Frequencies1.1 V 1.78 GHz
900 mV 1.24 GHz560 mV 115 MHz
1/4/18
12
Array Placement with Memories
• Embedding memories in the array placement– Hard SRAM memory blocks forced the
shared memory to take a specific shape
• Memories placed on bottom to eliminate:– Configuration signal pass-through– Non-regular timing for signals over
memory– Wasted space
23
Shared Memory
SRAM Cell
24
1/4/18
13
KiloCore Chip
Technology 32nm IBM PDSOI CMOS
Num. Procs. 1000
Num. Mems. 12
Die Area 64 mm2
Array Area 60 mm2
Transistors 621 Million
C4 Bumps 564 (162 I/O)
Package 676 Pad Flip-Chip BGA25
Outline
• Introduction and Motivation– Core Scaling Trend
• KiloCore Architectural Design• Physical Chip Design
– Chip Flow– Design Challenges
• Final Die• Application Measurements
26
1/4/18
14
Applications
• Several applications have been implemented for KiloCore:– Fast Fourier Transform
• 4096 length, 16-bit fixed-point complex data– Advanced Encryption Standard
• 128-bit keys– Low Density Parity Check
• 4095 code length– Sorting
• 100 Byte records with 10 Byte keys• 1850 records per sorted block
27
Application Comparison
28
• KiloCore operating at 1.1 V
Data: B. Bohnenstiehl et al., JSSC 2017.
1.0 1.0 1.0
39.7
14.33.6 1.5 1.0
24.1
2.7 5.6
74.1
AES FFT LDPC SortNor
mal
ized
Thr
ough
put
per A
rea
CPU GPU KiloCore
1.0 1.0 1.015.725.6
5.4 1.6 1.0
52.8
6.6 7.3
207.5
AES FFT LDPC SortNor
mal
ized
Thr
ough
put
per W
att
CPU GPU KiloCore
1/4/18
15
Acknowledgments
• Funding and Support– DoD and ARL/ARO Grant W911NF-13-1-0090– TAPO– NSF CAREER award 546907
CCF Grant No. 430090CCF Grant No. 903549CCF Grant No. 1018972CCF Grant No. 1321163
– SRC GRC Grant 1598CSR Grant 1659GRC Grant 1971GRC Grant 2321
– ST Microelectronics– C2S2– Intel Corporation– UCD Faculty Research Grant– MOSIS– Artisan
29