The Design oftheKiloCoreChip · 2018-01-05 · 1/4/18 1 VCL The Design oftheKiloCoreChip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor

1/4/18

1

VCL

The Design of the KiloCore Chip

Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas

DAC 2017: Design Challenges of New Processor Architectures

University of California, DavisVLSI Computation Laboratory

June 20, 2017

*Currently with California State University, Fresno

Outline

• Introduction and Motivation– Core Scaling Trend

• KiloCore Architectural Design• Physical Chip Design

– Chip Flow– Design Challenges

• Final Die• Application Measurements

2

1/4/18

2

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. BattenNew plot and data collected for 2010-2015 by K. RuppNew data added by B. Baas 3

Optimal Computational Tile Size

• The most efficient implementations (energy, throughput, circuit area) have:– Processor sizes that capture computational kernels with few

excess circuits

~~~ ~~~

Energy Effic.Clock rateArea Effic.

Unused or low benefit-per-cost circuits

Inter-processor

interconnect

4

1/4/18

3

Key Properties of KiloCore

• Very small processor tiles– Vast numbers of processors per chip– Processors can be used for non-traditional purposes in

highly efficient implementations• Processors dissipate very little energy per workload

when active• Processors dissipate very low power when idle• Essentially no algorithm-specific hardware in the

programmable processors

5

Outline





6

1/4/18

4

KiloCore Design

• Contains exactly 1,000 processors on one chip

• One of the first fabricated chips to contain 1,000 processors

• Fastest clock rate processor designed at a university

• Didn’t receive all libraries until 34 days before taping out

• 12 memories containing 64 KB each for 768 KB of shared memory– Memories are accessible by two

processors directly above eachKiloCore Block Diagram

Single Processor One 64 KB memory7

GALS Clocking

• KiloCore contains a fully-independent Globally-Asynchronous Locally-Synchronous (GALS) clock domain in each of its 1000 processors, 1000 packet routers, and 12 independent memories– Processor programmable clock oscillators are ~1% of tile area– Router oscillators are simplified and very small

• Each of the 2012 clock oscillators are placed inside their own clock domains—there are no global clock signals (except three for configuration and testing)

• Each clock oscillator is fully-unconstrained—oscillators may change frequency (below their fmax), halt, and restart arbitrarily to minimize power consumption

• Data transfer across clock domains is handled by dual-clock FIFOs8

Processor

Router

1/4/18

5

Inter-Processor Communication

• Source synchronous 16-bit communication

• Two layer source-synchronous circuit switched network– Up to 28 Gbps per link, 456 Gbps

total tile I/O• Single layer dynamic packet

routing network– 9.1 Gbps maximum per port– 45.5 Gbps maximum

Processor Core

Circuit Switch x2

Packet Router

9

Outline





10

1/4/18

6

Bottom-Up Physical Design FlowDesign Flow

Synthesis

Floorplanning/ Power Gridding

Placement

Clock Tree Synthesis

Optimization

DRC & LVS

Timing Analysis

Timing ConstraintsHDLMacro

Block(s)

Routing

Export Macro Block

• Bottom-Up design flow• Design aspects undecided when

the physical design was started:– Routers– Inter-processor I/O– Chip I/O– Memories (both local and global)– Oscillators– Configuration– Number of processors

11

6. Chip Finishing

2. Design Flow

1. Design Flow

3. Design Flow

Macro Blocksoscillators

Macro Blocksproc. & mem.

Macro Blocktest chip

4. Meets timing?

no

yes

chip

final GDS

7. To Fab

3x3

32x32

HDL chip

HDL test chip

HDL proc. & accel.

HDL oscillators

5. Design Flow

Bottom-Up Physical Design Flow

12

1/4/18

7

Processor Tile Implementation

Single processor with highlighted gates (percentage of tile area)

Imem(33%)Dmem

(29%)

Osc.(1%)

Core(20%)

Fifo2(4%)

Fifo1(4%)

Router(9%)

• Relatively simple processor design– No hard macros– Osc was pre-placed and routed– I/O pin locations specified for optimal

abutment– Power rails specified

• Quick design iterations– Easily changed design aspects such as

memory size or tile size– Verilog was changing up until 5 days

before tapeout• ~580k transistors per processor tile• 239 μm “wide” × 232.3 μm “tall”

Circ. Comm.(1%)

13

Global Metal PlanTile Metal Plan

Metal Plan

14

m11m10m9m8m7m6m5m4m3m2m1

Inter-Std. Cell Signals

Local Power Grid

(2.31%)(4.94%)(7.02%)(7.53%)

(18.32%)(24.71%)(22.12%)(12.83%)(0.21%)

Global Signals

(18.63%)(6.52%)

(21.49%)(7.27%)

(15.40%)(13.56%)(6.07%)(8.42%)

(0.13%)(2.50%)

Power GridStd. Cells IO Pads

Signal percentages are length per layer, per total signal length

1/4/18

8

Global Metal PlanTile Metal Plan

Metal Plan

15

m11m10m9m8m7m6m5m4m3m2m1

Inter-Std. Cell Signals

Local Power Grid

(16.81%)(6.49%)(8.41%)(6.49%)(4.44%)

(4.44%)(9.49%)

Global Signals Power Grid

(11.95%)(9.47%)

Std. Cells IO Pads

Power grid percentage is percentage of max allowable

Processor Level Power Rails

• Used 7 metal layers for local power grid– 39.6 μm horizontal pitch– 20.0 μm vertical pitch

• Connected local grid between all processors– Including the standard cell level

power rails• Manually drew in the grid of

metal at array level

View focused on single core, showing array painted power 16

1/4/18

9

Global Power Rails

• Metal layers 10 and 11 are used for global power grid– Only connected to

metal layer 9 power rails (top level local power grid)

Global power rails at chip level 17

Processor Abutment

• With 1000 homogeneous processors, very tight abutment is important– Communication wires were

constrained with I/O pin placement

– 0.4 μm horizontal spacing and4.5 μm vertical spacing

– Required fabrication cells, such as alignment cells cause interruption

• Removing a processor is non-ideal due to timing

• Specific rows required added vertical space to fit these cells

Configuration North Circuit Comm.

East

South

West

North Router Comm.

West SouthEast

Config.Processor Address 18

1/4/18

10

Processor Timing Paths• 196 independently timed

datapaths for direct communication up to only two tile hops

• Possible timing paths– Circuit switched (x2 layers):

• Each core has 2 origins:– One output buffer (Obuf)– One long-distance buffer (LD)

• Each core has 3 destinations:– One long-distance buffer (LD)– Input buffer 0 (Ibuf0)– Input buffer 1 (Ibuf1)

• Signals can pass through cores– Packet switched:

• Single hop routers on each processor

19

Obuf

Ibuf0LD

Ibuf1Ibuf0LD

Ibuf1

Ibuf0LD

Ibuf1

Ibuf0LD

Ibuf1Ibuf0LD

Ibuf1Ibuf0LD

Ibuf1

Ibuf0LD

Ibuf1

Ibuf0LD

Ibuf1

LD

Ibuf1

Ibuf0LD

Ibuf1

Ibuf0LD

Ibuf1

Ibuf0LD

Ibuf1

LD

router router

router

router

router

Figure shows all possible inter-processor communication paths up to two tiles away

Ibuf0

Inter-Processor Communication Wires

Circuit switch circuitry1% area

(9% including FIFOs)

Packet router circuitry 9% area

20

1/4/18

11

C4 Routing and Chip I/O

• Premade flip-chip BGA package was used– 564 C4 bumps

• 162 I/O (63 LVDS Pairs)• 402 Power (Vdd, Gnd, and VddIO)

– Non-ideal C4 placement and quantity

• All drivers and ESD clamps were on the periphery

• When all cores are at full speed and 100% activity, only sufficient current for center processors– Most applications average 50%

activity 21

Oscillator and Processor Clock Tree

• Inverter ring oscillator – Inverter chain cells

were hand selected– Placed and routed

before processor to maintain timing

• Clock tree with5,522 leafs to drive the flip-flop memories

Processor Skew (70 ps max.)

Osc.

22

Average Measured Processor Osc.

Frequencies1.1 V 1.78 GHz

900 mV 1.24 GHz560 mV 115 MHz

1/4/18

12

Array Placement with Memories

• Embedding memories in the array placement– Hard SRAM memory blocks forced the

shared memory to take a specific shape

• Memories placed on bottom to eliminate:– Configuration signal pass-through– Non-regular timing for signals over

memory– Wasted space

23

Shared Memory

SRAM Cell

24

1/4/18

13

KiloCore Chip

Technology 32nm IBM PDSOI CMOS

Num. Procs. 1000

Num. Mems. 12

Die Area 64 mm2

Array Area 60 mm2

Transistors 621 Million

C4 Bumps 564 (162 I/O)

Package 676 Pad Flip-Chip BGA25

Outline





26

1/4/18

14

Applications

• Several applications have been implemented for KiloCore:– Fast Fourier Transform

• 4096 length, 16-bit fixed-point complex data– Advanced Encryption Standard

• 128-bit keys– Low Density Parity Check

• 4095 code length– Sorting

• 100 Byte records with 10 Byte keys• 1850 records per sorted block

27

Application Comparison

28

• KiloCore operating at 1.1 V

Data: B. Bohnenstiehl et al., JSSC 2017.

1.0 1.0 1.0

39.7

14.33.6 1.5 1.0

24.1

2.7 5.6

74.1

AES FFT LDPC SortNor

mal

ized

Thr

ough

put

per A

rea

CPU GPU KiloCore

1.0 1.0 1.015.725.6

5.4 1.6 1.0

52.8

6.6 7.3

207.5

AES FFT LDPC SortNor

mal

ized

Thr

ough

put

per W

att

CPU GPU KiloCore

1/4/18

15

Acknowledgments

• Funding and Support– DoD and ARL/ARO Grant W911NF-13-1-0090– TAPO– NSF CAREER award 546907

CCF Grant No. 430090CCF Grant No. 903549CCF Grant No. 1018972CCF Grant No. 1321163

– SRC GRC Grant 1598CSR Grant 1659GRC Grant 1971GRC Grant 2321

– ST Microelectronics– C2S2– Intel Corporation– UCD Faculty Research Grant– MOSIS– Artisan

29

Documents

The Design oftheKiloCoreChip · 2018-01-05 · 1/4/18 1 VCL The Design oftheKiloCoreChip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor