Big shifts in EDA, IP and HPC - SoC for HPC · Big shifts in EDA, IP and HPC Dr. Chris Rowen ... RTL Power Solution ... Cadence is a trusted partner enabling customer innovation

Big shifts in EDA, IP and HPC

Dr. Chris Rowen

CTO – IP Group

Cadence Design Systems, Inc.

June 7, 2015

2 ••© 2014 Cadence Design Systems, Inc. All rights reserved.

Agenda

• A mainstream view of big compute and advanced EDA/IP− System Design Enablement

− Design and Verification across digital and high-speed analog

− Universal adoption of IP-rich development

− New roles for processors

• But the world is changing…

• Example: vision super-computing


Big trends creating opportunities


Cadence focus: System Design EnablementFrom end product down to chip level

Partnerships with Ecosystem Leaders

Mobile Consumer Cloud

Datacenter

Auto Medical IoT

CHIP(Core EDA)

Design and implementation

IP/SoC verification

On-chip protocol IP

Dataplane processor IP

Software drivers

SYSTEM

INTEGRATION

System analysis

Hardware-Software verification

System-level IP protocols

Software applications

Software development

PACKAGE and

BOARD

PCB design

Package design

PCB and package analysis

Chip-to-chip protocol IP

Allegro®

package

and PCB

design,

OrCad®

PCB design

Encounter®

digital

design

platform

Processor

IP, Design

IP and

Verification

IP

Virtuoso®

analog,

custom, RF,

mixed-

signal

design

platform

Incisive®

Verification

Platform

System

Design

Protium

Palladium®

hybrid, VSP,

Jasper


HPC: Compute + memory bandwidth + high-speed IODigitally-rich but high-speed analog also crucial

Platform

Platform

Spectre®

Platform

Platform technologies

Verification

DesignDigital Analog


• Digital implementation tools share a common infrastructure

• Analysis

• Tempus Timer - Same SDC and View setup files for whole flow

• Quantus Extractor - Same R/C rules for whole flow

• Voltus IR, Power and EM - Same EM for whole flow

• User Interface

• Consistent tcl commands run whole flow

• Easy to learn and reuse across different tools

• Powerful reporting utilities

• 10-20% better PPA

• Up to 10x TAT & capacity

New Integrated Digital Flow: Genus and Innovus

Com

mon T

em

pus T

imer

Com

mon Q

uantu

s E

xtr

action

Com

mon V

oltu

s I

R, P

ow

er

& E

M

Com

mon U

I

Com

mon R

eport

ing U

tilit

ies

RTL

GDS

SDC &

Views

Tcl

Script

Com

mon G

UI

HTML/

graphs

Genus Synthesis

Solution

Innovus

Implementation

System

Tempus Timing

Signoff Solution

Voltus Power

Integrity Solution

RTL Power

Solution

Quantus QRC

Extraction Solution

Physical

Verification

System

R/C

RulesEM

Rules


Integration is not just about the silicon design

• Allegro platform: system-in-package

o Multi-radio gateway board & package design

o Sigrity for gateway board & package signal integrity analysis

o Mixed-signal design & verification across fabrics

o OrCAD platform: system-on-board

• Virtuoso & Encounter platforms: system-on-chip

• Incisive & Virtual System platforms: system-on-chip HW/SW verification

• Palladium & Rapid Prototyping platforms: full system development


IP – A Path for Getting to Market FasterCadence is a trusted partner enabling customer innovation

• Fastest Growing segment share

• 1st to adv. nodes with DDR, PCIe®

• Highest performance Analog IPDesign IP

• #1 in DSP IP licensing, #2 royalty bearing shipments

• Shipping over 2 Billion cores/year

• 225+ licensees world wideTensilica® IP

• #1 in Verification IP segment share

• 100% adoption by top 30 semiconductor companies

• 500+ customers world wideVerification IP


Cadence IP PortfolioBroad and growing solution

Interfaces

Controllers and PHYs

EthernetUSB

MIPI

PCIe®Display

AnalogMemory

Controllers and PHYs

DRAM

DDR4

DDR3

DDR1/2

LPDDR1/2

DDR/LPDDR3/4

Combo PHY

10/100M MAC

10/40G MAC

1G PHY

10/100 PHY

SGMII/QSGMII

PHY

XAUI/RXAUI

PHY

Storage

eMMC

Toggle 1/2

SD/SDIO

UFS 2.0

Analog Mixed Signal

10Gbps Multi-

protocol SerDes

320MHz 12bit

ADC

PLL/DLL

6Gbps Multi-

protocol SerDes

VT Monitors

M-PHY Gear1/2/3

D-PHY 1.1

CSI-2/DSI

SoundWire/

SLIMbus

DigRF v4

UniPro 1.6

16Gbps Multi-

protocol SerDes

3.2GHz 7bit

ADC/DAC

Specialty Memory

HMC SR PHY

ONFi 1/2/3

Wide I/O

LPDDR4

LPDDR3

PCIe 4.0

PCIe 3.0

PCIe 2/1

PCIe 4.0 PHY

PCIe 2/1 PHY

M-PCIe™

PCIe 3.0 PHY

10G-KR PHY

USB 3.0/2.0/1.1

PHY

USB 3.0/2.0/1.1

Device

USB 3.0/2.0 Host

USB 3.0 SSIC

USB 2.0 HSIC

USB 3.0/2.0 OTG

SATA PHY

Gen 1,2,3 3.1

HDCP 2.2ONFi 4

HBM

USB 3.1 Host

USB 3.1 Device

USB 3.1 PHY

25/50/100G MAC

HDMI 1.4/2.0

MHL 3.x

DP 1.2/eDP 1.4

GPON, EPON PHY

1/2.5G MAC

Analog

Storage

© 2015 Cadence Design Systems, Inc. All rights reserved.

Cadence VIP Catalog product lineSupports all major simulators and verification languages

Recent

Addition

Assertion-Based VIP

Simulation VIP Memory Models Accelerated VIP

UFS 2.0 Wide I/OWide I/O

2

OCP

USB 3.0

w/ OTG

USB

SSIC

SD Card

4.0SDIO

Synch

DRAM

Synch

Mask

ROM

Synch

RAM NECUFS 1.0

ARM

AMBA

ACE

ARM

AMBA

AXI

ARM

AMBA

AHB

DFI

SATA 6G SRIO 2.1 SRIO 3.0 UART

USB 2.0

w/ OTG

Rambus

Turbo

Mode

Register

FileRL DRAM

Scratch

padSD Card

SD Card

3.0

PCIe

MR-IOVM-PCIe PLB 6 SAS 6G

SAS 12G

NOR

FLASH

Spansion

One

NAND

FLASH

PROM

Pseudo

Burst

SRAM

QDR

SRAM

Rambus

DRAM

PCIPCIe©

Gen2

PCIe

Gen3

PCIe

Gen4

PCIe

SR-IOV

LPDDR3 LPDDR4 LR DIMMMemory

Stick

Memory

Stick Pro

NAND

FLASH

MIPI

Sound

Wire

MIPI

UniPro

NVM

ExpressOCP 2.2

OCP 3.0

HBM HMCLBA

NANDLL DRAM LPDDR LPDDR2

MIPI

D-PHY

MIPI DSIincl.

DBI, DPI

MIPI LLI

2.0

MIPI

M-PHY

MIPI

SLIMbus

FLASH

PPN DDR

FLASH

Toggle

NAND

FLASH

Toggle

NAND 2

GDDR2 GDDR3 GDDR4

MHL 3.0MIPI CSI-

2

MIPI CSI-

3

MIPI

C-PHY

MIPI

DigRF

Enhance

d SDRAMFCRAM FIFO

FLASH

(basic)

FLASH

ONFi

Flash

ONFi 3/4

HDMI 1.4 HDMI 2.0 I2CJTAG

cJTAGLIN DFI

Embed.

SSRAM

Embed.

SSRAM

TI

eMMC 4.4 eMMC 4.5 eMMC 5.0

CANDisplay

Port

Ethernet

10/100

1G/10G

Ethernet

25G/50G

Ethernet

40G/100GDDR2 DDR3

DDR4

Incl. 3DS

DDR4

LRDIMM

DDR4

SDRAMDelay line

ARM

AMBA 5

CHI

ARM

AMBA 4

ACE

ARM

AMBA

AXI 3/4

ARM

AMBA

AHB

ARM

AMBA 4

Stream

Cellular

SRAM

Compact

FLASH

DDR

DIMM

DDR

SDRAM

DDR

Sync

GFX RAM

DDR

Sync

RAM

*beta

Wireless

802.11

MAC

ARM

AMBA 5

CHI*

ARM

AMBA 4

ACE

ARM

AMBA

AXI 3/4

ARM

AMBA

AHB

ARM

AMBA

APB

Ethernet

10/100

1G/10G

Ethernet

40G/100G

Ethernet

25G/50G*

HDMI 2.0* I2C I2SHDMI 1.4

MIPI CSI-

2MIPI DBI MIPI DSIKeypad

NVM

Express*

PCIe

Gen2/3

PCIe

SR-IOV*

MIPI

UniPro*

SIM CardUSB 2.0

w/ OTG*

USB 3.0

Host*

SATA

3G/6G

Device

MIPI DSI2

incl.

DBI, DPI

PureView

Indago

Protocol

Debug App

Interconnect

Validator

Basic

Interconnect

Validator

Coherent

Interconnect

Workbench

TripleCheck

Ethernet

40G/100G

TripleCheck

MIPI UniProTripleCheck

PCI Express

Productivity Tools

USB 3.1

w/ OTG

10


ControlEmbedded

• Advanced compiler and development tools support

• Energy & area efficient

HiFiAudio / Voice /

Speech

• Encode + Decode

• Voice Trigger

• Noise Reduction

• Post-Processing

ConnXCommunications

• Narrow to wide band wireless

• LTE/LTE-A, WiFi, SmartGrid

• Infrastructure & Terminals

VisionVision / Imaging

• Image Processing & Analytics

• Video Pre-Post Processing

FusionGeneral Purpose

• Scalable DSP

• Always-alert

• Sensor processing

• Audio / Voice / Speech

• Comm’s

Custom ISA

Application-specific

• Low Energy

• High Performance

• Application datatypes

Wide range of DSPs

Xtensa® Platform for InnovationCommon Architecture, Development Tools, 3rd Party Ecosystem

Thousands

of designs

>2B cores

per year

CustomControl

Used across most electronics marketsIoT, Mobile Phones, Storage/SSD, Networking, Video, Security, Cameras, Watches, Printers, others...

Tensilica IP is #2 overall in shipments of royalty-bearing,

licensable processors (CPU or DSP)


Tensilica automated hardware/software synthesis

Processor

Extensions

Xtensa ProcessorGenerator

Full hardware designSource pre-verified RTL, EDA scripts, test suite

Customized software toolsC/C++ compiler Debuggers,

Simulators, RTOSes

1. Select from menu

2. Explicit instruction

description (TIE)

Processor configuration

Architects’ needs:

• Higher compute throughput

• More memory bandwidth

• Higher efficiency control

structure update

• Continuous software

upgrade of EVERYTHING

No manual intervention needed in

hardware or software completeness


Xtensa LX6Block Diagram - System

VLIW (FLIX)

Parallel Execution

pipelines

Inst. Memory

Management,

Protection & Error

Recovery

Data Memory

Management,

Protection & Error

Recovery

Instruction

RAM x2Instruction

ROM

Data

RAM x2Data

ROM

External Interface

Processor

Interface

Control

Write

Buffer

PIF

Bri

dge

XLMI Local Memory

Interface

Base ISA Feature

Designer-Defined Features (TIE)

External RTL & Peripherals

Configurable Function

Optional Function

Optional & Configurable Function

QIF32

RTL, FIFO,

Memory,

Xtensa

GPIO32

Designer-Defined

Queues, Ports &

Lookups

KEY

Trace Port

JTAG Tap Control

Data Address

Watch Registers

Instruction Address

Watch Registers

Timers

Interrupt Control

On-Chip Debug

Processor Controls

Exception Support

Exception Registers

Base

Register

File

Data

Load/Store

Unit

Instruction Fetch / Decode

Base ISA

Execution

Pipeline

Base ALU

Optional Functional Units

Register Files

Processor State DeviceDevice

Bus

Bri

dge

AH

B-L

ite/AXI

RAM

DMA

Device

System

Bus

Designer-Defined

Dual Load/Store

Unit

Designer-Defined Functional Units

Register Files

Processor State

Register Files

Processor State

Instruction

Cache

Data

Cache

Prefetch


Other I/O

Example: Flash-Based Memory Systems with Smart Storage Processors

NAND

Flash

CTRL 1

Boot

ROM I/F

NAND

Flash

CTRL 2

AXI Interconnect Fabric

CE(7:0)D(7:0)

CE(7:0)D(7:0)

AXI-APB APB Interconnect Bus

UART GPIO Timers I2C

PCIe PHY

Host

Interface

PCIe Controller

PCIe 4 @ x 4 lanes

Flash PHY Flash PHYDDR PHY

DDR SDRAM

SPI

NOR FLASH

Storage Processor

1

DDR Controller

DRAM

3,200MT/s * 64 b=

25.6GB/s peak

16Gbps * 4 lanes =

8GB/s peak

NVMe Processor

Software-managed SRAM block buffer (~256KB)

AES Encrypt

LZ77 Compress

Mux

NAND

Flash

CTRL 15

NAND

Flash

CTRL 16

CE(7:0)D(7:0)

CE(7:0)D(7:0)

Flash PHY Flash PHY

Storage Processor

8

AES Encrypt

LZ77 Compress

Mux

. . .

400MB/s per channel x 16 channels = 6.4GB/s peak

NAND FLASH NAND FLASH NAND FLASH NAND FLASH

Capacity

4-32 TB

Flash

NVMe Processor

0

NAND FLASH NAND FLASHNAND FLASH NAND FLASH


Agenda

• A mainstream view of big compute and advanced EDA/IP


− The new data imperative

− Energy and layered cognition

− Scaling to ultra-low power



The New Data Imperative:Compute locally, share globally!

1E-15

1E-12

1E-09

1E-06

1E-03

1E-06 1E-03 1E+00 1E+03 1E+06

Energ

y p

er

64b (

Joule

s)

Physical Scale (meters)

Cloud access

Disk access

DRAM access

L3 cache (1MB)

L2 cache (32KB)

L1 cache (8KB)

Register file

Register file

ALU Operation

16FF NAND gate

Sources:

• Rick Zarr, TI, 2008- The True Cost of an Internet “Click” - estimate of transfer cost for 30KB page from server: http://energyzarr.typepad.com/energyzarrnationalcom/2008/08/the-true-cost-o.html

• J Kunkel et al, University of Hamburg 2010,Collecting Energy Consumption of Scientific Data

• Horowitz ISSCC 2014, 1300-2600 pJ per 64b access

16nm


Energy and layered cognition

Quiescent system

nW

Active Power per

Processor

Always Alert:

Real-time base

compute

Application Filter

Cloud server farm

Always Alert

1-1000 uW

Core real-time vision

Subsystem processing

1mW - 50 mW

Outer-loop feature reduction

System application processing

50mW-1000mW

Local filesystem access

Cloud application processing

>> 1Watt

Database search


Scaling to ultra-low-energyXtensa core scales from ~1uW/MHz to 1.5GHz

Near-threshold core development

• SPICE (SPECTRE APS) simulation of full

Tensilica base 32b processor core

• Verifies correctness, speed and power

independent of library characterization.

• Full parasitics extracted circuit gives

accurate power and functionality

• Running basic power and functionality

diagnostics

• Optimally-pipelined core runs >1.5GHz

in16FF, but same RTL also achieves

micro-power level. Sustains high MHz at

lower voltage

0

2

4

6

0.4 0.5 0.6 0.7 0.8 0.9

Pro

cesor

Active P

ow

er

(uW

/MH

z)

Processor Active Power

0

5

10

15

20

0.4 0.5 0.6 0.7 0.8 0.9

Pro

cesorL

eakage P

ow

er

(uW

)

Supply Voltage

Processor Leakage Power

28nm 9T

28nm 7T

16FF

TT corner


Agenda

• A mainstream view of big compute and advanced EDA/IP



− Vision DSPs

− Convolutional Neural Network compute

− Hybrid architectures for high-density computation


• A family of high-performance DSPs for imaging, video and vision

• Rich SIMD/VLIW architecture

o 4-way instruction issue

o Up to 200 separate ALU operations per cycle

• Huge pixel bandwidth

o Integrated DMA for data streaming

o >2000b per cycle data memory bandwidth

• Rich software environment

o World’s best DSP C compilers: 0 assembly code

o Full OpenCV and OpenVX support with 800 optimized functions

o Wide third-party program and support

Tensilica IVP Vision DSPs

PIF - AXI

128

S

S

M

M

128

DM

A

-

Bank 0 Bank 1 Bank 0 Bank 1

IVP-EPcore

WER

RER

DR1

PIF Master

PIF Slave

512

BInterrupt

128128128

128

DRam 0 DRam 1

512512512512

128

Banked Memory Interface Mux

512512512512

IRam 0

IRam 1

DR0

ICache

PIF Mux

128

128

optional

optional

optional

High-bandwidth Network on

Chip to DDR and CPUs

DDR

Ctrl

DDR

PHY

DDRDDR

DDRDDR


IVP-EPA new category of application performance

1

10

100

1000

RIS

C =

1

Speedup:Raw Performance

Speedup:Autovector C

Speedup:App Major Function


Input Convolution Non-linearity Sub-sampling Repeat … ClassifierResult – face

identified

Example: Convolutional Neural Network (CNN)Sorting through the hype

Convolutions model locally

receptive visual cortex cells

by sampling a small region

and generating features

Non-linearity like

tanh function

models on-off

behavior of

neurons

Subsampling

models cells with

larger receptive

fields (provide local

invariance)

Repeat previous

steps for neural

network layers

Final classifier

stage

one layer

At every offset in

image at multiple

scales: K

Do recognition by

applying convolution

at every offset in

patch: L

Each convolution spans M

results of prior layer

Sum of products of pixels x

weights in NxN sub-patch

Compute for layer:

O (K * L * M * N * N)

Up to 30 x 1012 MACsKevin Jarrett , Koray Kavukcuoglu , Yann Lecun

What is the Best Multi-Stage Architecture for

Object Recognition


Vision supercomputing

• 80-90% of internet traffic is video/image

• Expect image recognition, video analytics, augmented reality, visual AI to grow to similar % of compute

• By 2020, world’s image sensors will theoretically capture 1028B/year

• Key characteristics of visual HPC: 1. Applications in flux

2. High-bandwidth data * very large parameter sets – e.g. for CNN weights

3. Need extreme compute efficiency, yet high flexibility in algorithms, data reference patterns and data types.

• Expect “cognitive layering” in computing architectures – 95% of ops in 5% of code

Cognitive

EnginesCognitive

EnginesCognitive

EnginesCognitive

Engines

Cognitive

EnginesCognitive

EnginesCognitive

EnginesCognitive

Engines

Data

processors

Data

processors

Application CPUs

DDR Resources

Persistent Storage

Resources

Streaming IO

Resources

Conventional Server Network


CNN Engine Example

• Demonstration of a specialized DSP architecture using TIE on Xtensa –source available

• Very high throughput (256 16x8 MAC/cycle) on filter and convolution algorithms for vision and object recognition

• Architecture:• 512 x 8 data vectors (3r/1w)

• 2048 x 2 accumulators (1r/1w)

• 512 x 2 align regfile 1r/1w)

• 2 slot, 40b instruction width, 47 added ops

• 48GB/s local data BW/core

• High compute density:• 4x higher efficiency than IVP-EP

• 9T full layout

• With 32KB I/ 64 KB D

• 0.95mm2 @ 750MHz

• 200 GMACS/mm2

-

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

0 200 400 600 800 1000

Are

a (m

m2)

MHz

CNN Engine Area vs. MHz (WC)

28hpm 12t synth cell area

28hpc 7t synth cell area

28hpc 9t synth cell area

28hpc 9t layout core area


Huge upside in vision platform performance: >60 Trillion ops per 100mm2

A

A

B

CD

0

2000

Nov-13 Apr-15 Aug-16

Peak P

rocessor

Rate

(GO

PS

)

Leading ADAS Platform Performance

A

A

B

CD0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

Aug-13 Dec-14 May-16

AA

B CD02,0004,0006,0008,000

10,00012,00014,00016,00018,00020,00022,00024,00026,00028,00030,00032,00034,00036,00038,00040,00042,00044,00046,00048,00050,00052,00054,00056,00058,00060,00062,00064,00066,00068,00070,000

Nov-13 Apr-15 Aug-16

IVP-EP MP

(100 mm2)Xtensa CNN

MP (100 mm2)

Core + memory 28nm


A closing thoughtSpecialization proliferation of designs

All S

OC

Desig

ns

General-purpose all-in-one

system platforms:

cellphone, PC, server,

gateway – extreme scale

SoCs

Specialized low-cost, low-energy

platforms: wearables, smart-home,

medical, industrial, mil-aero,

sensor-rich extreme fit SoCs

1,000,000

102,248 85,000 68,82747,000

31,300

9,998 9,0846,433 5,490

1,000

10,000

100,000

1,000,000

Nu

mb

er

of C

urr

en

t S

pe

cie

s


Agenda

• A mainstream view of big compute and advanced EDA/IP− System Design Enablement

− Design and Verification across digital and high-speed analog

− Universal adoption of IP-rich development

− New roles for processors


− The new data imperative

− Energy and layered cognition

− Scaling to ultra-low power


− Vision DSPs

− Convolutional Neural Network compute

− Hybrid architectures for high-density computation


Documents

Big shifts in EDA, IP and HPC - SoC for HPC · Big shifts in EDA, IP and HPC Dr. Chris Rowen ... RTL Power Solution ... Cadence is a trusted partner enabling customer innovation