Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Novel NanoSystems to Enable AI
Department of EE & Department of CS
Stanford University
Subhasish Mitra
Thanks: Students, Sponsors, Collaborators
2
10010101010101010101010101100101001010101010100110101010101010011101001100010101010101100101000111001010010101010110001011101010101010101001101001010101010101010110101001100101011001010101010110100101101010101010100111110011111011101001001011101010110101011010
Edge to Cloud
Military Science Health Care Government
Abundant data Genomics
Smart Cities
Security
Finance STOP
3
World Relies on Computing
De
sig
n te
ch
niq
ue
s
Device performance 6
Few experimental demos
Device ≠ system
Option 1: Better Devices
De
sig
n te
ch
niq
ue
s
Device performance 7
Few “tricks”
Design complexity
Multi-core
Power /
thermal
Option 2: Design Tricks
De
sig
n te
ch
niq
ue
s
Device performance 8
Multi-core
Power,
thermal
Target:
1,000× performance
Improve Computing Performance
New innovations required
NanoSystems New nanotech New systems New applications Devices
Fabrication
Sensors
imperfections?
large-scale fabrication?
variability?
New
architectures
a
9
Abundant-Data Applications
Processors, accelerators
Compute Memory
5%
95%
10
Chip realization ?
Memory Wall Brain-inspired ⊃ Neural Nets
•Compute + memory
•Dense connectivity
•Energy efficiency
•Footprint
Computation immersed in memory
Memory
Increased
functionality
Ultra-dense 3D
Computing logic
Impossible with business as usual
N3XT NanoSystems
12
Nano-Engineered Computing Systems Technology
13 [Aly IEEE Computer 15, Proc. IEEE 19] Stanford + CMU + MIT + NTU Singapore + UC Berkeley + U. Michigan
DARPA 3DSoC Program
14
Max Shulaker
Anantha Chandrakasan
Subhasish Mitra H.-S. Philip Wong, Simon S. Wong
Brad Ferguson
Mark Nelson Jefford Humes
Carbon Nanotube FET (CNFET)
15
CNT: d = 1.2nm
2 µm
Gate
2 µm
Gated
CNFET
Sub-litho
Energy Delay Product
~ 10× benefit
Full-design level
Example: OpenSPARC T2 Processor Core
16
0.05
0.5
0.1 1 10
tota
l e
ne
rgy p
er
cycle
(nJ)
clock frequency (GHz)
FinFET Nanowire FET
CNFET
[Hills IEEE TNANO 18] Stanford + IMEC + TSMC
preferred
Big Benefits, Major (Past) Obstacles
17 [Zhang IEEE TCAD 12]
Mis-positioned CNTs Metallic CNTs
circa 2005
Imperfection-immune: Process + Design
Process alone inadequate
18
Stanford
Ph.D. student
MIT
Professor
First CNT computer (Stanford) CNT RISC-V (MIT, Analog Devices)
[Nature 2013] [Nature 2019]
178 CNFETs: PMOS logic
Single instruction (Turing complete)
1-bit data
14,702 CNFETs: CMOS logic
All RV32E instructions
16-bit data
3D Integration
19
Massive ILV density >> TSV density
Conventional BEOL nano-scale
inter-layer vias (ILVs)
TSV (chip stacking)
Through silicon via (TSV)
Dense, e.g., monolithic
Realizing Monolithic 3D
20
+ + Emerging logic Emerging memory Monolithic 3D
Naturally enabled: < 400 °C
Combine device + architecture benefits
3D NanoSystem
22
Millions of sensors
Memory
1 Megabit RRAM
CNT computing logic
Ultra-dense
vertical connections
CNTs
X100,000
Abundant data: Terabytes / second
In-situ classification: extensive, accurate
Classification accelerator
HD: Brain-Inspired ⊃ Neural Nets
23 [Wu ISSCC 18, IEEE JSSC 18] Stanford + UC Berkeley HD = Hyperdimensional
CNT logic
(1,952 CNFETs)
RRAM TCAM
(224 RRAM cells)
Monolithic 3D: dense ILVs
Exploit: inherent variations, RRAM gradual Reset
Live ISSCC demo: one-shot learning, language classification
N3XT Simulation Framework
24
Explore architectures
Energy, exec. time
Thermal, lifetime
Physical design,
yield, reliability
Heterogeneous
nanotechnologies System analysis Abundant-
data
apps
[Aly Proc. IEEE 19]
Massive Benefits
Deep Learning, Graph Analytics, …
25
1×
10×
100×
PageRank Connected Components
Breadth- First
Search
Linear Regression
Language model (LSTM Neural
Network)
AlexNet (Neural
Network)
Energy Execution Time
Benefits
851× 400× 510× 970× 1,950× 210×
~1,000× benefits, existing software
26
Cross-layer: device + circuit + architecture
Dense compute + thermal
New software optimizations
Many NanoSystems Opportunities
27
RRAM Cross-Layer Solutions Monolithic 3D
Endurance: ENDURER
Non-volatile
Multiple
bits per cell
Low Resistance
State
High Resistance
State
Set
Reset
28
RRAM + Industry Silicon CMOS
CEA LETI
RRAM
Silicon CMOS
compute [Wu ISSCC 19] Stanford + CEA LETI + NTU Singapore
First Multiple bits-per-cell RRAM System
29
Bits
per cell
Cells
measured
Our work
new
algorithms
3 Full arrays
Prior work
ad hoc
2-6.5
Single cell,
few
hand-picked
cells
Neural nets
Optimized weight encoding
On-chip RRAM
multiple bits per cell
Cross-layer
2.3× accurate inference (measured)
Same hardware, bigger neural net