A building block computing system for AI applicationsmcsoc-forum.org/2017/wp-content/uploads/2017/09/keynote-Amano-Material-1.pdfA building block computing system for AI applications

A building block computing

system for AI applications

Keio University

Hideharu Amano

Background

• Rapidly Increasing

Requirements of

Embedded Systems

– Smart Phones,

Tablets, Car

electronics, Robotics,

IoTs, etc.

– High Performance,

Low Energy

Consumption, Low

Cost….

• Rapidly Increasing

NRE (Non-Recurring

Engineering) cost of

advanced LSIs.

– Complicated masks

– Design/Verification

• SoC (System-on-

Chip) is difficult to be

implemented for each

target.

Building Block Computing Systems

Building Block Computing

CPU Memory Accelerator1

Memory

CPU

MemoryMemory

CPU

MemoryMemory

Accelerator1Accelerator2

Accelerator2

CPU

Accelerator2Accelerator2Accelerator1

Accelerator1

CPU

1 step: Various Combination can be done after chip-fabrication

2 step: Chips can be replaced by users → Field Stackable

Key Technique: Inductive Coupling Through-Chip Interface（TCI)

Today’s talk

• What is TCI？

• The first prototype: Cube-1

• Intellectual Property of TCI

• Plug-and-play inter-chip network

• A heterogeneous AI system: Cube-2

3D IC technology for going verticalTwo

chips

(face

-to

-fa

ce)

Microbump

Through silicon via

Capacitive coupling

Inductive coupling

Wired Wireless

Scalability

Flexibility

Mor

e t

han

thre

e c

hips

Block diagram of scalable 3D NoC using inductive-coupling ThruChip Interface (TCI).

Rx TxTx RxTx

Network

Interface

TCI

3D Interconnect TCI

Packet En/DecodeNetworkInterface Packet En/Decode

Router

Accelerator Core

Rx TxTx RxTx

CPU Core

Rx TxTx

Network Interface

Data Link

Clock Link

Uplink Downlink

Accelerator 1

Accelerator 2

Host CPU

Header Payload

1323335

35bit Packet Structure

Transfer Type:Command/AddressSingle Packet DataBurst Multi-Packet Data

Benefits of Inductive Coupling TCI

• More than 8Gbps speed is achieved within 10mW power dissipation.

• The error rate is less than 10^-9.– No error correction circuits are required.

• Coils are built by metal wires of common CMOS.– No additional process is required.

• No ESD (Electro-Static Discharge) protection devices are required unlike TSV.

• Stacking cost is much lower than that for TSV.– Just grind chips to 40-80um thickness and stack

them as in a common interposer inside the package.

• First commercial products with TCI will be shipped this year from PEZY.

Problems• The footprint of a coil is much larger than a via

for TSV. – Chip does not scale in the vertical direction.

– Footprint of both coils and TSVs is increased in the advanced technology.

– Digital circuits can be implemented in coils.

→ The real footprint is not larger than that of TSV.

• Power and system clocks are supplied by bonding wires.– Now, the power supply by inductive coupling is tried.

– It may take some time….

• Heat dissipation cannot be done by TCI.– The bonding agent used in the current prototype well

transfers heat.

– But heat dissipation more than 1W chips is a challenge.

Today’s talk

• What is TCI？





The first prototype Ｃｕｂｅ-1Built in 2013. To demonstrate that the TCI can be used in a real system.

Geyser-Cube

CMA-Cube

InductiveCoupling

Ring based packet switching network is formedThe number of accelerators can be

changed.

Microphotograph of stacked test chips.

Host CPU

Accelerator 1

Accelerator 2

Accelerator 3

Host CPU + Accelerator x3 Chip Stack

Fabricated in 65nm CMOS

Host CPU Chip

Accelerator Chip

TEG

Netw

ork

IF

Netw

ork

IF

m-C

on

tro

lle

r

MIPS CPU

Core

8x8 PE Array

TCI Tx

TCI Rx

Rx

Rx

Tx

Tx

TCI

Cube-1 Cube-2

Process Fujitsu e-shuttle 65nm

12metal

Renesas SOTB 65nm

7metal

Area 2.1mm X 4.2mm 3mm X 6mm

TCI 3Gb/s 240um 2.5Gb/s 240um

CPU

Cache

TLB

MIPS R3000

4KW 2way sep.

16-entry shared

MIPS R3000

4KW 2way sep.

16-entry shared

CMA CMA-Cube 8x8 CC-SOTB 12x8

Supply Voltage 0.6-1.2V 0.3-1.2V

Target Frequency 50-100MHz 50-100MHz

Chip Thickness 40-80um 80um

Cube-1 vs. Cube-2

0

100,000

200,000

300,000

400,000

500,000

600,000

Alpha Gray Sepia

Ex

ec

uti

on

Tim

e [

cyc

les

]

0

20

40

60

80

100

120

140

Alpha Gray SepiaAlpha Gray Sepia

Ener

gy (

μ J

)

Exec

uti

on

Tim

e (m

sec)

200

400

600

800

1000

1200

2-chipstack

GeyserOnly

GeyserOnly

2-chipstack

CMA

Geyser

Execution Time Energy Reduction

Evaluation Results from a 2-chip stack system

評価結果• Cube-1 Quad-CoreにおけるJPEGデコーダの実行• 128x96pixelの画像をデコード

0

100000

200000

300000

400000

500000

600000

700000

800000

実行サイクル数

(cyc

le)

other

convert yuv to rgb

store intermediate

inverse dct

inverse quantization

huffman decode

collect result

dma trans

processing

image data trans

Cube-1 Quad-Core No

3.15倍の性能

Evaluation Results

•Using three accelerators.

•The target image block: 128 x 96 pixel

3.15 times speed up

Ex

ecu

tio

n c

ycle

s

Measured PER and TCI power dissipation dependence on supply voltage.

10-9

10-8

10-7

10-6

10-5

10-4

5

6

7

8

9

10

0.8 0.9 1 1.1 1.2 1.3Supply Voltage [V]

TC

I P

ow

er

Dis

sip

ati

on

[m

W/c

h]

Pack

et

Err

or

Rate

(P

ER

)

35bit Burst Packet Transmission @ 50MHz System

Clock

Continuous >1 Hour

Error-Free Operation

@ Nominal Supply Voltage

5.8mW/c

h @

0.92V

Demonstration system.

DisplayOutput

Chip Stack

Mother Board

Daughter Board

FPGA

Cube-1 was a good starting point

Please check Micro

2013.11/12http://www.am.ics.keio.ac.jp/kaken_s

Demonstration in HotChips2014,

Coolchips2015, and FPL2015.

http://www.am.ics.keio.ac.jp/kaken_s

The problems on Cube-1• Not stable with three chips

– Interference between vertical inductors which

must be placed for the ring network.

• OK. It’s an interesting system. But what for?

– Only two chip types were provided.

• For development of various types of chips:

– Easy design.

– Automatic performance tuning.

– Thermal problem on stacking.

– Practical power supply.

– Comprehensive Researches are needed.

What is the next step?

• Intellectual Property of the TCI

– It must be embedded in the LSI design flow.

– Higher level layers are needed.

• A new plug-on network based on the IP

• Automatic performance tuning

– Software approach

– Hardware approach

• Thermal dissipation problems

The new target systems Cube-2 for the AI application

Today’s talk

• What is TCI？





TCI-IP Physical layer

Txclk

TXDATA

ClockCounter

35

TXWrite

LA

TC

HTxBusy

MU

X

TX

TX

RX

DE

MU

X

RX

LA

TC

H

S/E BitDetector

Pari

tyC

hec

ker

RXDATA

OverW Detector

RxReady

RxReadOverWrite

35 42 42 44

7

TxSleepRxSleep

RxLatch

StartBit ->3bit

Payload ->35bit

Parity ->2bit

Endbit ->4bitPhysical layer = Inductors + Tx/Rx + SERDES

TCI physical layer

CLK

TxRx

receiver

SERDES

transceiver

SERDES

Data

TxRx

TXWRITE

TXBUSYRXREADY

RXREAD35bit data

35bit data

PERR, OVW,

Flame error

2GHz clk

Can be used for

the both direction

Timing chart

TxWrite

TxData0

TxLATCH0

TxBusy

Txclk

Rxclk

RxLatch

RxLATCH0

RxReady

RxRead

Serial transfer synchronized with Txclk

Fully Depleted Silicon on

Insulator(FD-SOI):

SOTB MOSFET

P-sub

(a) (b)

N-well P-wellSTI

Box layer

Back Gate

Deep n-well

☆Dopant less Structure (low variability)

☆Formed on an ultra thin BOX layer (about 10nm)

High efficiency of leakage reduction with body bias control

Body bias

24

Body Bias Control of SOTB

VBN > VS (Forward body bias : FBB)

VBP < VS

• High speed

• Large leakage current

leakagedelay

Body bias (+:forward -:reverse)[V]

It is important to decide optimal voltages of body bias

VBN < VS (Reverse body bias : RBB)

VBP > VS

• Low leakage current

• Low speed

Delay

[s]Pow

er[W

]

VBN

VBP

25See [Okuhara IEEE TVLSI 2017]

TCI IP

80μm

thickness

36bit/50MHz

with normal

bias.

Including

SER/DES

circuits.

Half-duplex

data transfer

Clock Data

CC SOTB Layout

TCI IP

×4sets

（Up/Down)

Renesas Electronics 65nm SOTB 7Metal layers

６mm X ３mm

Three layers for the TCI• Full duplex bi-directional channel is formed

with two physical IPs.

• The flow control is embedded into the link

controller.

• Various networks can be easily formed.

TCI

transceiverTCI receiver

Link Controller

Router

Physical layout. SERDES is embedded.

Verilog Soft core. 8-Virtual Channels.

Flow control. 17flits packet buffer

Verilog Soft core. Parameterized router.

3-stage pipeline.

Open in the VDEC web site

Today’s talk

• What is TCI？





Bonding wire

Bonding wire

Bonding wire

Bonding wire

The ring network

on Cube-1

TX RXTXRX

TX RXTXRX

TX RXTXRX

TX RXTXRX

Netw

ork

IF

m-C

on

tro

lle

r

8x8 PE Array

Rx

Rx

Tx

Tx

TCI

The escalator network• Just by stacking chips with three layers IP, any numbers

of chips can be stacked.

• The simplest 3-port routers are used.

• Vertical coils never interfere with each other.

RXRX

TXTX

Router

TCIRX

RX

TXTX

Router

TCIRX RXTXTX

IP Core

Router

TCI

RX RXTXTX

IP Core

Router

TCI

Chip 0

Chip 1

Chip 2

Chip 3

The spacer chips

Drops

Spacer chips

Ring vs. Escalator

Esc. Esc. (no VC.)

The overhead of flow-control2 or 17-flit packets

Uniform Bit reverse

34The overhead can be ignored→

0

40

80

120

160

0 0.2 0.4 0.6

Avera

ge l

ate

ncy [

cycle

]

Injection rate [flit/node/cycle]

Piggyback

Ideal

0

40

80

120

160

200

0 0.2 0.4 0.6

Avera

ge l

ate

ncy [

cycle

]

Injection rate [flit/node/cycle]

Piggyback

Ideal

Today’s talk

• What is TCI？





A heterogeneous multicore

system: Cube-2Image recognition for robotics, cars or other

edge systems.

• More than a single chip-stacking.

– Embedded CPU : GeyserTT

– Shared memory chip: SMTT

• Accelerator chips

– CC-SOTB2

– SNACC

– KVS

• Various combination of chips.

Cube-1 Cube-2

Process Fujitsu e-shuttle 65nm

12metal

Renesas SOTB 65nm

7metal

Area 2.1mm X 4.2mm 3mm X 6mm or

6mm X 6mm

TCI 3Gb/s 240um 2.5Gb/s 240um

Supply Voltage 0.6-1.2V 0.3-1.2V

Target Frequency 50-100MHz 50-100MHz

Chip Thickness 40-80um 80um

Cube-1 vs. Cube-2

Geyser-TT (Twin Tower)

TCI for a single tower

TCI for the twin tower

MIPS R3000 Compatible CPU

with three TCI IPs

Basic CPU

Core

TLB

I Cache

DMAC

D

Cache Router

I/F

Router

TCI

External Bus

Controller

Geyser CPU Core (MIPS

ISA)Geyser Internal Bus

Router

I/F

Router

TCI

Geyser External Bus: Main Memory, I/Os

Geyser CPU

TCI for Twin

Tower

Geyser-TT

TCI

TCI for Single

Tower

KVS

SNACC

CC-SOTB2

GeyserTT

A simple stacking

An escalator network is

formed.

Twin-tower with a single core

KVS

SNACC

GeyserTT

SNACC

CC

KVS

SNACC

Geyser TT is used as

a bridge of two chip

stacks

Shared Memory for

twin-tower (SMTT)• A large memory system is needed in

various configuration.

– A chip with DRAM module is needed.

– From the limitation of money and power

bedget, only 250KB static RAM is provided.

– Ultramemory’s DRAM with TCI will hopefully

take place in the future.

• Multiple hosts in multiple towers can

exchange the data with the shared

memory.

• Synchronization memory is also provided.

SMTT chipTCI for the twin tower

Twin-tower with multi cores

Shared Memory

KVS

SNACC

GeyserTT

SNACC

CC

GeyserTT

Accelerator family

• CC-SOTB2

– Coarse Grained Reconfigurable Array (CGRA)

– Low power image processing

• SNACC

– Convolutional Neural Network Accelerator

• KVS

– Key-value store accelerator

• All chips have the same TCI IP.

CC-SOTB2

12x8 PE array

PE array

with 12x8 PEVariable

Pipeline

structure

12x2

interleaved/banked

memory

Data manipulator

for flexible

load/store

Energy optimization with body

bias control

• The previous presentation by Anh-Vu-

san[McSoC2017]

• FPL2017 etc.

• More than 800MOPS/mW was achieved.

PE row

TCI IP

SNACC: a convolutional neural

network accelerator

TCI IPSIMD

cores

• SIMD cores

• Dedicated instruction set and ALUs

• Yesterday presentation by Sakamoto-

san[McSoC2017]

KVS Chip

• Memory search

• Set/Get operations

[HotInterconnect

2016]

Streaming processing in Cube-2

• Shared memory using cache/DMA transfer

– All memory modules of accelerators are

mapped on the same address space.

– Memory modules of accelerators are treated

as the 2nd-level cache.

– DMA transfer between accelerators is

triggered automatically.

• Streaming processing between chips can

be efficiently done.

Streaming processing between

chipsGeyserTT

CC

SOTB2

SNACC

KVS

Data

Write

Exec.

Data

DMA

Data

DMA

Data

Read

Exec.

Exec.

End

End

Read

Req.

Data

Write

Data

DMA

Execution of accelerators and communication between chips can

be overlapped.

Data

DMA

The current status

• 3-chip stack is now operational.

• Chips are available on this December.

Special thanks to:

• Prof. Kuroda (Keio Univ.)

• Prof. Kondo & Dr. Sakamoto (Univ. of Tokyo)

• Prof. Matsutani (Keio Univ.)

• Prof. Nakamura (Univ. of Tokyo)

• Prof. Namiki (Tokyo Agriculture & Tech.)

• Prof. Usami (Shibaura Inst. Tech.)

This work has been supported by JSPS

Kakenhi S Grant # 25220002

Thank you !

Please visit our web-site:

– http://www.am.ics.keio.ac.jp/kaken_s

For other projects:

– http://www.am.ics.keio.ac.jp

http://www.am.ics.keio.ac.jp/kaken_s

http://www.am.ics.keio.ac.jp/

Documents

A building block computing system for AI applicationsmcsoc-forum.org/2017/wp-content/uploads/2017/09/keynote-Amano-Material-1.pdfA building block computing system for AI applications