Upload
ngotuyen
View
213
Download
0
Embed Size (px)
Citation preview
Big shifts in EDA, IP and HPC
Dr. Chris Rowen
CTO – IP Group
Cadence Design Systems, Inc.
June 7, 2015
2 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Agenda
• A mainstream view of big compute and advanced EDA/IP− System Design Enablement
− Design and Verification across digital and high-speed analog
− Universal adoption of IP-rich development
− New roles for processors
• But the world is changing…
• Example: vision super-computing
3 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Big trends creating opportunities
4 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Cadence focus: System Design EnablementFrom end product down to chip level
Partnerships with Ecosystem Leaders
Mobile Consumer Cloud
Datacenter
Auto Medical IoT
CHIP(Core EDA)
Design and implementation
IP/SoC verification
On-chip protocol IP
Dataplane processor IP
Software drivers
SYSTEM
INTEGRATION
System analysis
Hardware-Software verification
System-level IP protocols
Software applications
Software development
PACKAGE and
BOARD
PCB design
Package design
PCB and package analysis
Chip-to-chip protocol IP
Allegro®
package
and PCB
design,
OrCad®
PCB design
Encounter®
digital
design
platform
Processor
IP, Design
IP and
Verification
IP
Virtuoso®
analog,
custom, RF,
mixed-
signal
design
platform
Incisive®
Verification
Platform
System
Design
Protium
Palladium®
hybrid, VSP,
Jasper
5 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
HPC: Compute + memory bandwidth + high-speed IODigitally-rich but high-speed analog also crucial
Platform
Platform
Spectre®
Platform
Platform technologies
Verification
DesignDigital Analog
6 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
• Digital implementation tools share a common infrastructure
• Analysis
• Tempus Timer - Same SDC and View setup files for whole flow
• Quantus Extractor - Same R/C rules for whole flow
• Voltus IR, Power and EM - Same EM for whole flow
• User Interface
• Consistent tcl commands run whole flow
• Easy to learn and reuse across different tools
• Powerful reporting utilities
• 10-20% better PPA
• Up to 10x TAT & capacity
New Integrated Digital Flow: Genus and Innovus
Com
mon T
em
pus T
imer
Com
mon Q
uantu
s E
xtr
action
Com
mon V
oltu
s I
R, P
ow
er
& E
M
Com
mon U
I
Com
mon R
eport
ing U
tilit
ies
RTL
GDS
SDC &
Views
Tcl
Script
Com
mon G
UI
HTML/
graphs
Genus Synthesis
Solution
Innovus
Implementation
System
Tempus Timing
Signoff Solution
Voltus Power
Integrity Solution
RTL Power
Solution
Quantus QRC
Extraction Solution
Physical
Verification
System
R/C
RulesEM
Rules
7 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Integration is not just about the silicon design
• Allegro platform: system-in-package
o Multi-radio gateway board & package design
o Sigrity for gateway board & package signal integrity analysis
o Mixed-signal design & verification across fabrics
o OrCAD platform: system-on-board
• Virtuoso & Encounter platforms: system-on-chip
• Incisive & Virtual System platforms: system-on-chip HW/SW verification
• Palladium & Rapid Prototyping platforms: full system development
8 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
IP – A Path for Getting to Market FasterCadence is a trusted partner enabling customer innovation
• Fastest Growing segment share
• 1st to adv. nodes with DDR, PCIe®
• Highest performance Analog IPDesign IP
• #1 in DSP IP licensing, #2 royalty bearing shipments
• Shipping over 2 Billion cores/year
• 225+ licensees world wideTensilica® IP
• #1 in Verification IP segment share
• 100% adoption by top 30 semiconductor companies
• 500+ customers world wideVerification IP
9 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Cadence IP PortfolioBroad and growing solution
Interfaces
Controllers and PHYs
EthernetUSB
MIPI
PCIe®Display
AnalogMemory
Controllers and PHYs
DRAM
DDR4
DDR3
DDR1/2
LPDDR1/2
DDR/LPDDR3/4
Combo PHY
10/100M MAC
10/40G MAC
1G PHY
10/100 PHY
SGMII/QSGMII
PHY
XAUI/RXAUI
PHY
Storage
eMMC
Toggle 1/2
SD/SDIO
UFS 2.0
Analog Mixed Signal
10Gbps Multi-
protocol SerDes
320MHz 12bit
ADC
PLL/DLL
6Gbps Multi-
protocol SerDes
VT Monitors
M-PHY Gear1/2/3
D-PHY 1.1
CSI-2/DSI
SoundWire/
SLIMbus
DigRF v4
UniPro 1.6
16Gbps Multi-
protocol SerDes
3.2GHz 7bit
ADC/DAC
Specialty Memory
HMC SR PHY
ONFi 1/2/3
Wide I/O
LPDDR4
LPDDR3
PCIe 4.0
PCIe 3.0
PCIe 2/1
PCIe 4.0 PHY
PCIe 2/1 PHY
M-PCIe™
PCIe 3.0 PHY
10G-KR PHY
USB 3.0/2.0/1.1
PHY
USB 3.0/2.0/1.1
Device
USB 3.0/2.0 Host
USB 3.0 SSIC
USB 2.0 HSIC
USB 3.0/2.0 OTG
SATA PHY
Gen 1,2,3 3.1
HDCP 2.2ONFi 4
HBM
USB 3.1 Host
USB 3.1 Device
USB 3.1 PHY
25/50/100G MAC
HDMI 1.4/2.0
MHL 3.x
DP 1.2/eDP 1.4
GPON, EPON PHY
1/2.5G MAC
Analog
Storage
© 2015 Cadence Design Systems, Inc. All rights reserved.
Cadence VIP Catalog product lineSupports all major simulators and verification languages
Recent
Addition
Assertion-Based VIP
Simulation VIP Memory Models Accelerated VIP
UFS 2.0 Wide I/OWide I/O
2
OCP
USB 3.0
w/ OTG
USB
SSIC
SD Card
4.0SDIO
Synch
DRAM
Synch
Mask
ROM
Synch
RAM NECUFS 1.0
ARM
AMBA
ACE
ARM
AMBA
AXI
ARM
AMBA
AHB
DFI
SATA 6G SRIO 2.1 SRIO 3.0 UART
USB 2.0
w/ OTG
Rambus
Turbo
Mode
Register
FileRL DRAM
Scratch
padSD Card
SD Card
3.0
PCIe
MR-IOVM-PCIe PLB 6 SAS 6G
SAS 12G
NOR
FLASH
Spansion
One
NAND
FLASH
PROM
Pseudo
Burst
SRAM
QDR
SRAM
Rambus
DRAM
PCIPCIe©
Gen2
PCIe
Gen3
PCIe
Gen4
PCIe
SR-IOV
LPDDR3 LPDDR4 LR DIMMMemory
Stick
Memory
Stick Pro
NAND
FLASH
MIPI
Sound
Wire
MIPI
UniPro
NVM
ExpressOCP 2.2
OCP 3.0
HBM HMCLBA
NANDLL DRAM LPDDR LPDDR2
MIPI
D-PHY
MIPI DSIincl.
DBI, DPI
MIPI LLI
2.0
MIPI
M-PHY
MIPI
SLIMbus
FLASH
PPN DDR
FLASH
Toggle
NAND
FLASH
Toggle
NAND 2
GDDR2 GDDR3 GDDR4
MHL 3.0MIPI CSI-
2
MIPI CSI-
3
MIPI
C-PHY
MIPI
DigRF
Enhance
d SDRAMFCRAM FIFO
FLASH
(basic)
FLASH
ONFi
Flash
ONFi 3/4
HDMI 1.4 HDMI 2.0 I2CJTAG
cJTAGLIN DFI
Embed.
SSRAM
Embed.
SSRAM
TI
eMMC 4.4 eMMC 4.5 eMMC 5.0
CANDisplay
Port
Ethernet
10/100
1G/10G
Ethernet
25G/50G
Ethernet
40G/100GDDR2 DDR3
DDR4
Incl. 3DS
DDR4
LRDIMM
DDR4
SDRAMDelay line
ARM
AMBA 5
CHI
ARM
AMBA 4
ACE
ARM
AMBA
AXI 3/4
ARM
AMBA
AHB
ARM
AMBA 4
Stream
Cellular
SRAM
Compact
FLASH
DDR
DIMM
DDR
SDRAM
DDR
Sync
GFX RAM
DDR
Sync
RAM
*beta
Wireless
802.11
MAC
ARM
AMBA 5
CHI*
ARM
AMBA 4
ACE
ARM
AMBA
AXI 3/4
ARM
AMBA
AHB
ARM
AMBA
APB
Ethernet
10/100
1G/10G
Ethernet
40G/100G
Ethernet
25G/50G*
HDMI 2.0* I2C I2SHDMI 1.4
MIPI CSI-
2MIPI DBI MIPI DSIKeypad
NVM
Express*
PCIe
Gen2/3
PCIe
SR-IOV*
MIPI
UniPro*
SIM CardUSB 2.0
w/ OTG*
USB 3.0
Host*
SATA
3G/6G
Device
MIPI DSI2
incl.
DBI, DPI
PureView
Indago
Protocol
Debug App
Interconnect
Validator
Basic
Interconnect
Validator
Coherent
Interconnect
Workbench
TripleCheck
Ethernet
40G/100G
TripleCheck
MIPI UniProTripleCheck
PCI Express
Productivity Tools
USB 3.1
w/ OTG
10
11 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
ControlEmbedded
• Advanced compiler and development tools support
• Energy & area efficient
HiFiAudio / Voice /
Speech
• Encode + Decode
• Voice Trigger
• Noise Reduction
• Post-Processing
ConnXCommunications
• Narrow to wide band wireless
• LTE/LTE-A, WiFi, SmartGrid
• Infrastructure & Terminals
VisionVision / Imaging
• Image Processing & Analytics
• Video Pre-Post Processing
FusionGeneral Purpose
• Scalable DSP
• Always-alert
• Sensor processing
• Audio / Voice / Speech
• Comm’s
Custom ISA
Application-specific
• Low Energy
• High Performance
• Application datatypes
Wide range of DSPs
Xtensa® Platform for InnovationCommon Architecture, Development Tools, 3rd Party Ecosystem
Thousands
of designs
>2B cores
per year
CustomControl
Used across most electronics marketsIoT, Mobile Phones, Storage/SSD, Networking, Video, Security, Cameras, Watches, Printers, others...
Tensilica IP is #2 overall in shipments of royalty-bearing,
licensable processors (CPU or DSP)
12 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Tensilica automated hardware/software synthesis
Processor
Extensions
Xtensa ProcessorGenerator
Full hardware designSource pre-verified RTL, EDA scripts, test suite
Customized software toolsC/C++ compiler Debuggers,
Simulators, RTOSes
1. Select from menu
2. Explicit instruction
description (TIE)
Processor configuration
Architects’ needs:
• Higher compute throughput
• More memory bandwidth
• Higher efficiency control
structure update
• Continuous software
upgrade of EVERYTHING
No manual intervention needed in
hardware or software completeness
13 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Xtensa LX6Block Diagram - System
VLIW (FLIX)
Parallel Execution
pipelines
Inst. Memory
Management,
Protection & Error
Recovery
Data Memory
Management,
Protection & Error
Recovery
Instruction
RAM x2Instruction
ROM
Data
RAM x2Data
ROM
External Interface
Processor
Interface
Control
Write
Buffer
PIF
Bri
dge
XLMI Local Memory
Interface
Base ISA Feature
Designer-Defined Features (TIE)
External RTL & Peripherals
Configurable Function
Optional Function
Optional & Configurable Function
QIF32
RTL, FIFO,
Memory,
Xtensa
GPIO32
Designer-Defined
Queues, Ports &
Lookups
KEY
Trace Port
JTAG Tap Control
Data Address
Watch Registers
Instruction Address
Watch Registers
Timers
Interrupt Control
On-Chip Debug
Processor Controls
Exception Support
Exception Registers
Base
Register
File
Data
Load/Store
Unit
Instruction Fetch / Decode
Base ISA
Execution
Pipeline
Base ALU
Optional Functional Units
Register Files
Processor State DeviceDevice
Bus
Bri
dge
AH
B-L
ite/AXI
RAM
DMA
Device
System
Bus
Designer-Defined
Dual Load/Store
Unit
Designer-Defined Functional Units
Register Files
Processor State
Register Files
Processor State
Instruction
Cache
Data
Cache
Prefetch
14 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Other I/O
Example: Flash-Based Memory Systems with Smart Storage Processors
NAND
Flash
CTRL 1
Boot
ROM I/F
NAND
Flash
CTRL 2
AXI Interconnect Fabric
CE(7:0)D(7:0)
CE(7:0)D(7:0)
AXI-APB APB Interconnect Bus
UART GPIO Timers I2C
PCIe PHY
Host
Interface
PCIe Controller
PCIe 4 @ x 4 lanes
Flash PHY Flash PHYDDR PHY
DDR SDRAM
SPI
NOR FLASH
Storage Processor
1
DDR Controller
DRAM
3,200MT/s * 64 b=
25.6GB/s peak
16Gbps * 4 lanes =
8GB/s peak
NVMe Processor
Software-managed SRAM block buffer (~256KB)
AES Encrypt
LZ77 Compress
Mux
NAND
Flash
CTRL 15
NAND
Flash
CTRL 16
CE(7:0)D(7:0)
CE(7:0)D(7:0)
Flash PHY Flash PHY
Storage Processor
8
AES Encrypt
LZ77 Compress
Mux
. . .
400MB/s per channel x 16 channels = 6.4GB/s peak
NAND FLASH NAND FLASH NAND FLASH NAND FLASH
Capacity
4-32 TB
Flash
NVMe Processor
0
NAND FLASH NAND FLASHNAND FLASH NAND FLASH
15 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Agenda
• A mainstream view of big compute and advanced EDA/IP
• But the world is changing…
− The new data imperative
− Energy and layered cognition
− Scaling to ultra-low power
• Example: vision super-computing
16 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
The New Data Imperative:Compute locally, share globally!
1E-15
1E-12
1E-09
1E-06
1E-03
1E-06 1E-03 1E+00 1E+03 1E+06
Energ
y p
er
64b (
Joule
s)
Physical Scale (meters)
Cloud access
Disk access
DRAM access
L3 cache (1MB)
L2 cache (32KB)
L1 cache (8KB)
Register file
Register file
ALU Operation
16FF NAND gate
Sources:
• Rick Zarr, TI, 2008- The True Cost of an Internet “Click” - estimate of transfer cost for 30KB page from server: http://energyzarr.typepad.com/energyzarrnationalcom/2008/08/the-true-cost-o.html
• J Kunkel et al, University of Hamburg 2010,Collecting Energy Consumption of Scientific Data
• Horowitz ISSCC 2014, 1300-2600 pJ per 64b access
16nm
17 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Energy and layered cognition
Quiescent system
nW
Active Power per
Processor
Always Alert:
Real-time base
compute
Application Filter
Cloud server farm
Always Alert
1-1000 uW
Core real-time vision
Subsystem processing
1mW - 50 mW
Outer-loop feature reduction
System application processing
50mW-1000mW
Local filesystem access
Cloud application processing
>> 1Watt
Database search
18 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Scaling to ultra-low-energyXtensa core scales from ~1uW/MHz to 1.5GHz
Near-threshold core development
• SPICE (SPECTRE APS) simulation of full
Tensilica base 32b processor core
• Verifies correctness, speed and power
independent of library characterization.
• Full parasitics extracted circuit gives
accurate power and functionality
• Running basic power and functionality
diagnostics
• Optimally-pipelined core runs >1.5GHz
in16FF, but same RTL also achieves
micro-power level. Sustains high MHz at
lower voltage
0
2
4
6
0.4 0.5 0.6 0.7 0.8 0.9
Pro
cesor
Active P
ow
er
(uW
/MH
z)
Processor Active Power
0
5
10
15
20
0.4 0.5 0.6 0.7 0.8 0.9
Pro
cesorL
eakage P
ow
er
(uW
)
Supply Voltage
Processor Leakage Power
28nm 9T
28nm 7T
16FF
TT corner
19 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Agenda
• A mainstream view of big compute and advanced EDA/IP
• But the world is changing…
• Example: vision super-computing
− Vision DSPs
− Convolutional Neural Network compute
− Hybrid architectures for high-density computation
20 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
• A family of high-performance DSPs for imaging, video and vision
• Rich SIMD/VLIW architecture
o 4-way instruction issue
o Up to 200 separate ALU operations per cycle
• Huge pixel bandwidth
o Integrated DMA for data streaming
o >2000b per cycle data memory bandwidth
• Rich software environment
o World’s best DSP C compilers: 0 assembly code
o Full OpenCV and OpenVX support with 800 optimized functions
o Wide third-party program and support
Tensilica IVP Vision DSPs
PIF - AXI
128
S
S
M
M
128
DM
A
-
Bank 0 Bank 1 Bank 0 Bank 1
IVP-EPcore
WER
RER
DR1
PIF Master
PIF Slave
512
BInterrupt
128128128
128
DRam 0 DRam 1
512512512512
128
Banked Memory Interface Mux
512512512512
IRam 0
IRam 1
DR0
ICache
PIF Mux
128
128
optional
optional
optional
High-bandwidth Network on
Chip to DDR and CPUs
DDR
Ctrl
DDR
PHY
DDRDDR
DDRDDR
21 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
IVP-EPA new category of application performance
1
10
100
1000
RIS
C =
1
Speedup:Raw Performance
Speedup:Autovector C
Speedup:App Major Function
22 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Input Convolution Non-linearity Sub-sampling Repeat … ClassifierResult – face
identified
Example: Convolutional Neural Network (CNN)Sorting through the hype
Convolutions model locally
receptive visual cortex cells
by sampling a small region
and generating features
Non-linearity like
tanh function
models on-off
behavior of
neurons
Subsampling
models cells with
larger receptive
fields (provide local
invariance)
Repeat previous
steps for neural
network layers
Final classifier
stage
one layer
At every offset in
image at multiple
scales: K
Do recognition by
applying convolution
at every offset in
patch: L
Each convolution spans M
results of prior layer
Sum of products of pixels x
weights in NxN sub-patch
Compute for layer:
O (K * L * M * N * N)
Up to 30 x 1012 MACsKevin Jarrett , Koray Kavukcuoglu , Yann Lecun
What is the Best Multi-Stage Architecture for
Object Recognition
23 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Vision supercomputing
• 80-90% of internet traffic is video/image
• Expect image recognition, video analytics, augmented reality, visual AI to grow to similar % of compute
• By 2020, world’s image sensors will theoretically capture 1028B/year
• Key characteristics of visual HPC: 1. Applications in flux
2. High-bandwidth data * very large parameter sets – e.g. for CNN weights
3. Need extreme compute efficiency, yet high flexibility in algorithms, data reference patterns and data types.
• Expect “cognitive layering” in computing architectures – 95% of ops in 5% of code
Cognitive
EnginesCognitive
EnginesCognitive
EnginesCognitive
Engines
Cognitive
EnginesCognitive
EnginesCognitive
EnginesCognitive
Engines
Data
processors
Data
processors
Application CPUs
DDR Resources
Persistent Storage
Resources
Streaming IO
Resources
Conventional Server Network
24 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
CNN Engine Example
• Demonstration of a specialized DSP architecture using TIE on Xtensa –source available
• Very high throughput (256 16x8 MAC/cycle) on filter and convolution algorithms for vision and object recognition
• Architecture:• 512 x 8 data vectors (3r/1w)
• 2048 x 2 accumulators (1r/1w)
• 512 x 2 align regfile 1r/1w)
• 2 slot, 40b instruction width, 47 added ops
• 48GB/s local data BW/core
• High compute density:• 4x higher efficiency than IVP-EP
• 9T full layout
• With 32KB I/ 64 KB D
• 0.95mm2 @ 750MHz
• 200 GMACS/mm2
-
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
0 200 400 600 800 1000
Are
a (m
m2)
MHz
CNN Engine Area vs. MHz (WC)
28hpm 12t synth cell area
28hpc 7t synth cell area
28hpc 9t synth cell area
28hpc 9t layout core area
25 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Huge upside in vision platform performance: >60 Trillion ops per 100mm2
A
A
B
CD
0
2000
Nov-13 Apr-15 Aug-16
Peak P
rocessor
Rate
(GO
PS
)
Leading ADAS Platform Performance
A
A
B
CD0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
Aug-13 Dec-14 May-16
AA
B CD02,0004,0006,0008,000
10,00012,00014,00016,00018,00020,00022,00024,00026,00028,00030,00032,00034,00036,00038,00040,00042,00044,00046,00048,00050,00052,00054,00056,00058,00060,00062,00064,00066,00068,00070,000
Nov-13 Apr-15 Aug-16
IVP-EP MP
(100 mm2)Xtensa CNN
MP (100 mm2)
Core + memory 28nm
26 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
A closing thoughtSpecialization proliferation of designs
All S
OC
Desig
ns
General-purpose all-in-one
system platforms:
cellphone, PC, server,
gateway – extreme scale
SoCs
Specialized low-cost, low-energy
platforms: wearables, smart-home,
medical, industrial, mil-aero,
sensor-rich extreme fit SoCs
1,000,000
102,248 85,000 68,82747,000
31,300
9,998 9,0846,433 5,490
1,000
10,000
100,000
1,000,000
Nu
mb
er
of C
urr
en
t S
pe
cie
s
27 ••© 2014 Cadence Design Systems, Inc. All rights reserved.
Agenda
• A mainstream view of big compute and advanced EDA/IP− System Design Enablement
− Design and Verification across digital and high-speed analog
− Universal adoption of IP-rich development
− New roles for processors
• But the world is changing…
− The new data imperative
− Energy and layered cognition
− Scaling to ultra-low power
• Example: vision super-computing
− Vision DSPs
− Convolutional Neural Network compute
− Hybrid architectures for high-density computation
28 ••© 2014 Cadence Design Systems, Inc. All rights reserved.