Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Distribution A. Approved for public release; distribution is unlimitedDistribution A. Approved for public release; distribution is unlimited
NARESH
UNIVERSITY OF ILLINOIS AT
SHANBHAG
URBANA-CHAMPAIGN
Distribution A. Approved for public release; distribution is unlimited
MRAM-BASED DEEP IN-MEMORY ARCHITECTURES
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views,opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views orpolicies of the Department of Defense or the U.S. Government.
Distribution A. Approved for public release; distribution is unlimited
THE TEAM
Mike Burkland Naveen Verma(Princeton)
Pavan Hanumolu(UIUC)
Naresh Shanbhag(UIUC)
[systems & ckts][mixed-signal ICs][systems & ckts]
• L.-Y. Chen• P. Deaville• B. Zhang
• A. Patil, H. Hua, S. Gonugondla, K.-H. Kim
Ajey Jacob
[devices, foundry]
• S. Soss, D. Brown
• N. Gaul, B. Paul, B. Lanza
• J. Broussard• L. Suantak• J. Goulding• S. Cole• T. Gangwer• R. Kelley• C. Contreras
[DoD systems]
Raytheon MS GLOBALFOUNDRIES Princeton University of Illinois at Urbana-Champaign
Distribution A. Approved for public release; distribution is unlimited
REALIZING ARTIFICIAL INTELLIGENCE @ THE EDGE
AI at the Edge
model
data
AI in the Cloud(autonomous cognitive agent)
• client-server model• efficiency challenged• security + privacy issues
• energy-latency efficient, securebut……
• volume + resource constraineddecisions
[source: pexels.com][source: pexels.com]
Distribution A. Approved for public release; distribution is unlimited
THE DATA MOVEMENT PROBLEM𝑬𝑬𝒎𝒎𝒎𝒎𝒎𝒎𝑬𝑬𝒎𝒎𝒎𝒎𝒎𝒎
≈ ~𝟏𝟏𝟏𝟏𝟏𝟏 × (SRAM) → ~𝟓𝟓𝟏𝟏𝟏𝟏 × (DRAM) → ~𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 × (Flash)
[Canziani, Arxiv’16]
The Memory Wall in von Neumann Architectures
Distribution A. Approved for public release; distribution is unlimited
fundamental question
how do we design intelligent autonomous machinesthat operate at the limits of energy-delay-accuracy?
Distribution A. Approved for public release; distribution is unlimited
Proceedings of IEEE, Special Issue on non von Neumann Computing, January 2019.
Systems on Nanoscale Information fabriCswww.sonic-center.org
[2013-’17](STARnet Program by Semiconductor Research Corporation & DARPA)
Distribution A. Approved for public release; distribution is unlimited
SHANNON-INSPIRED MODEL OF COMPUTING
use information-based metrics e.g., mutual information 𝐼𝐼(𝑌𝑌𝑜𝑜; �𝑌𝑌)1design low SNR fabrics, e.g., deep in-memory architecture (DIMA)2
develop statistical error-compensation (SEC) techniques3
(noisy channel)
DIMA
[ISSCC’18]
Distribution A. Approved for public release; distribution is unlimited
integratethe MRAM device
within a
deep in-memory architecture (DIMA)
using
Shannon-inspired compute models
FRANC OBJECTIVEGLOBALFOUNDRIES PRINCETON & UIUC RAYTHEON MISSILE SYSTEMS
thermally limited
MRAM device
deep in-memory arch.
Shannon-inspired compute models
autonomous platforms
Distribution A. Approved for public release; distribution is unlimited
[ICASSP 2014 UIUC, JSSC 2017 Princeton, JSSC 2018 UIUC]
read functions of datastandard bit-cell array(preserves storage density)
THE DEEP IN-MEMORY ARCHITECTURE (DIMA)
Distribution A. Approved for public release; distribution is unlimited
SRAM DIMA PROTOTYPES
Multi-functionalinference processor
(65nm CMOS)
Random forest processor
(65nm CMOS)
On-chip training processor
(65nm CMOS)
Fully (128) row-parallel compute
(130nm CMOS)
100X EDP reduction over von Neumann equivalent* @ iso-accuracy* 8b fixed-function digital architecture with identical SRAM size
[Feb. JSSC’18] [ESSCIRC’17, JSSC July’18] [ISSCC’18, JSSC Nov.’18] [VLSI’16, JSSC’17]
Distribution A. Approved for public release; distribution is unlimited
MRAM OPPORTUNITIES & CHALLENGES
high conductance → large array currentssmall TMR → high sensitivity readouthuge 𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀 − 𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀 limits cell compute modelshigh cell density → severe pitch-matching const.high MTJ process variation restricts bit-line SNR
MTJ resistance distribution
Free Layer
Ref. Layer
Tunnel Barrier
Select Transistor
Top Electrode
BottomElectrode
WL
Source Line
pMTJ stack 40Mb MRAM demo
Challenges:Opportunities:high density → high on-chip compute densitynon-volatility → ultra low-power duty-cycled operationin production → enables system prototyping &
reduces time to DoD availability
Distribution A. Approved for public release; distribution is unlimited
MRAM-BASED DEEP IN-MEMORY ARCHITECTURES
four DIMAs proposed – two selected for prototypingBit-line Sensing (DIMA-BLS) Current Drive Time-based
Sensing (DIMA-CDTS)
𝑥𝑥[𝑖]
𝑀𝑀 × 𝑁𝑁 single-shot matrix vector multiply Functional read: 5-bit × 4-bit dot products on SLs Compute cell: 1T-1MTJ ( 4 × 1 multiplication) SL current sensing via time-based 5-bit ADCs SNR vs. energy trade-off due to sensing circuits
𝑀𝑀 × 𝑁𝑁 single-shot matrix vector multiply Functional read: binary dot products on BLs Compute cell: 2T-2MTJ (1b XNOR/AND) BL voltage sensing via 4-bit SAR ADCs SNR vs. energy trade-off due to sensing circuits
Distribution A. Approved for public release; distribution is unlimited
EDP GAINS OVER VON NEUMANN EQUIVALENT
(includes all peripherals)
43× EDP reduction (256-dimensional dot products)
250×-to-1000× EDP reduction (128-dimensional dot products)
DIMA-BLS
DIMA-CDTS
MVM: matrix-vector multiply
Distribution A. Approved for public release; distribution is unlimited
SHANNON-INSPIRED COMPUTE MODELS
• relaxes MRAM-based DIMA’s compute SNR requirements → enables substantial energy savings
• two methods: stochastic data-driven hardware resilience & coded DIMA
MRAM-based DIMA
use information-based metrics develop error-compensation techniques
Distribution A. Approved for public release; distribution is unlimited
STOCHASTIC DATA-DRIVEN HARDWARE RESILIENCE
• noise-tolerance enhanced well beyond MTJ variations
• can be exploited for energy-SNR tradeoff
MRAM-based DIMA InferenceModel Parameters𝜃𝜃𝑀𝑀𝑀𝑀𝑀𝑀𝑆𝑆𝑆𝑆(𝑥𝑥,𝐺𝐺)G (MRAM variation, ADC noise) gi
DIMA-aware Stochastic Training
[B. Zhang, ICASSP’19]
Simulated MRAM-based BNN (CIFAR-10):
Distribution A. Approved for public release; distribution is unlimited
CODED DIMA - ENHANCING DIMA’S COMPUTE SNRMTJ variation aware coding enables HIGH system SNR in spite of LOW circuit SNR
part
ial d
ecim
atio
n-in
-tim
eA
rchi
tect
ure
for m
appi
ng
FFT
on D
IMA
N-point Input Sequence
N/ M-pointInput
Sequence
N/ M-pointInput
Sequence
Input shuffling
N/ M-pointDFT
N/ M-pointDFT
Merge N/ M-point DFTs
N-point DFTCoefficients
N- p
oint
DFT
DFT, DHT, QFT
Coefficient precision (𝐵𝐵𝑊𝑊)
Out
put S
NR(d
B)
Limited byinput SNR
Limited byMRAM device
variation
Limited byquantization
noise
Input SNR: 40 dB; = 8; = 128
18dB
61 dB
41 dB
MRAM-DI MAWrite data
Functional read
Calculate error
Calibrate
write-time compensation (WTC)of MRAM variations
WTC
MTJ
𝐵𝐵𝑥𝑥 𝑁𝑁
Distribution A. Approved for public release; distribution is unlimited
CURRENT STATUS
Technology transition via SRAM-based DIMA DIMA prototype designs taped out
(2019) [single-bank] validate EDPgains & calibrate models
(2020) [multi-bank] enable scaling and system integration
DIMA technology transition to RMS foralgorithm prototyping & evaluation
facilitates future MRAM-based DIMAintegration by RMS
verified & quantified RMS’s mission capability enabled by MRAM-based DIMA MRAM-based DIMA demonstrates > 2 orders-of-magnitude EDP reduction
[H. Jia, arXiv:1811.04047, Princeton]
Distribution A. Approved for public release; distribution is unlimited
NEXT STEPS & SUMMARY
Mission: To realize > 200X in EDP gains in DoD workloads by integrating MRAM devicewithin Deep In-memory Architectures using Shannon-inspired compute models
FRANC Program: “to provide the foundation for new materials technology and newintegration approaches to be exploited in pursuit of novel compute architectures”
Multi-functioned DIMAs for Diverse Applications Parameterized DIMAs for Platform-design Tools
Summary
[Srivastava, et al., ISCA’18]