naresh - ERI SummitNARESH UNIVERSITY OF ILLINOIS AT SHANBHAG. URBANA-CHAMPAIGN. Distribution A. Approved for public release; distribution is unlimited. MRAM-BASED DEEP IN-MEMORY ARCHITECTURES

Distribution A. Approved for public release; distribution is unlimitedDistribution A. Approved for public release; distribution is unlimited

NARESH

UNIVERSITY OF ILLINOIS AT

SHANBHAG

URBANA-CHAMPAIGN

Distribution A. Approved for public release; distribution is unlimited

MRAM-BASED DEEP IN-MEMORY ARCHITECTURES

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views,opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views orpolicies of the Department of Defense or the U.S. Government.


THE TEAM

Mike Burkland Naveen Verma(Princeton)

Pavan Hanumolu(UIUC)

Naresh Shanbhag(UIUC)

[systems & ckts][mixed-signal ICs][systems & ckts]

• L.-Y. Chen• P. Deaville• B. Zhang

• A. Patil, H. Hua, S. Gonugondla, K.-H. Kim

Ajey Jacob

[devices, foundry]

• S. Soss, D. Brown

• N. Gaul, B. Paul, B. Lanza

• J. Broussard• L. Suantak• J. Goulding• S. Cole• T. Gangwer• R. Kelley• C. Contreras

[DoD systems]

Raytheon MS GLOBALFOUNDRIES Princeton University of Illinois at Urbana-Champaign


REALIZING ARTIFICIAL INTELLIGENCE @ THE EDGE

AI at the Edge

model

data

AI in the Cloud(autonomous cognitive agent)

• client-server model• efficiency challenged• security + privacy issues

• energy-latency efficient, securebut……

• volume + resource constraineddecisions

[source: pexels.com][source: pexels.com]


THE DATA MOVEMENT PROBLEM𝑬𝑬𝒎𝒎𝒎𝒎𝒎𝒎𝑬𝑬𝒎𝒎𝒎𝒎𝒎𝒎

≈ ~𝟏𝟏𝟏𝟏𝟏𝟏 × (SRAM) → ~𝟓𝟓𝟏𝟏𝟏𝟏 × (DRAM) → ~𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 × (Flash)

[Canziani, Arxiv’16]

The Memory Wall in von Neumann Architectures


fundamental question

how do we design intelligent autonomous machinesthat operate at the limits of energy-delay-accuracy?


Proceedings of IEEE, Special Issue on non von Neumann Computing, January 2019.

Systems on Nanoscale Information fabriCswww.sonic-center.org

[2013-’17](STARnet Program by Semiconductor Research Corporation & DARPA)

http://www.sonic-center.org/


SHANNON-INSPIRED MODEL OF COMPUTING

use information-based metrics e.g., mutual information 𝐼𝐼(𝑌𝑌𝑜𝑜; �𝑌𝑌)1design low SNR fabrics, e.g., deep in-memory architecture (DIMA)2

develop statistical error-compensation (SEC) techniques3

(noisy channel)

DIMA

[ISSCC’18]


integratethe MRAM device

within a

deep in-memory architecture (DIMA)

using

Shannon-inspired compute models

FRANC OBJECTIVEGLOBALFOUNDRIES PRINCETON & UIUC RAYTHEON MISSILE SYSTEMS

thermally limited

MRAM device

deep in-memory arch.

Shannon-inspired compute models

autonomous platforms


[ICASSP 2014 UIUC, JSSC 2017 Princeton, JSSC 2018 UIUC]

read functions of datastandard bit-cell array(preserves storage density)

THE DEEP IN-MEMORY ARCHITECTURE (DIMA)


SRAM DIMA PROTOTYPES

Multi-functionalinference processor

(65nm CMOS)

Random forest processor

(65nm CMOS)

On-chip training processor

(65nm CMOS)

Fully (128) row-parallel compute

(130nm CMOS)

100X EDP reduction over von Neumann equivalent* @ iso-accuracy* 8b fixed-function digital architecture with identical SRAM size

[Feb. JSSC’18] [ESSCIRC’17, JSSC July’18] [ISSCC’18, JSSC Nov.’18] [VLSI’16, JSSC’17]


MRAM OPPORTUNITIES & CHALLENGES

high conductance → large array currentssmall TMR → high sensitivity readouthuge 𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀 − 𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀 limits cell compute modelshigh cell density → severe pitch-matching const.high MTJ process variation restricts bit-line SNR

MTJ resistance distribution

Free Layer

Ref. Layer

Tunnel Barrier

Select Transistor

Top Electrode

BottomElectrode

WL

Source Line

pMTJ stack 40Mb MRAM demo

Challenges:Opportunities:high density → high on-chip compute densitynon-volatility → ultra low-power duty-cycled operationin production → enables system prototyping &

reduces time to DoD availability


MRAM-BASED DEEP IN-MEMORY ARCHITECTURES

four DIMAs proposed – two selected for prototypingBit-line Sensing (DIMA-BLS) Current Drive Time-based

Sensing (DIMA-CDTS)

𝑥𝑥[𝑖]

𝑀𝑀 × 𝑁𝑁 single-shot matrix vector multiply Functional read: 5-bit × 4-bit dot products on SLs Compute cell: 1T-1MTJ ( 4 × 1 multiplication) SL current sensing via time-based 5-bit ADCs SNR vs. energy trade-off due to sensing circuits

𝑀𝑀 × 𝑁𝑁 single-shot matrix vector multiply Functional read: binary dot products on BLs Compute cell: 2T-2MTJ (1b XNOR/AND) BL voltage sensing via 4-bit SAR ADCs SNR vs. energy trade-off due to sensing circuits


EDP GAINS OVER VON NEUMANN EQUIVALENT

(includes all peripherals)

43× EDP reduction (256-dimensional dot products)

250×-to-1000× EDP reduction (128-dimensional dot products)

DIMA-BLS

DIMA-CDTS

MVM: matrix-vector multiply


SHANNON-INSPIRED COMPUTE MODELS

• relaxes MRAM-based DIMA’s compute SNR requirements → enables substantial energy savings

• two methods: stochastic data-driven hardware resilience & coded DIMA

MRAM-based DIMA

use information-based metrics develop error-compensation techniques


STOCHASTIC DATA-DRIVEN HARDWARE RESILIENCE

• noise-tolerance enhanced well beyond MTJ variations

• can be exploited for energy-SNR tradeoff

MRAM-based DIMA InferenceModel Parameters𝜃𝜃𝑀𝑀𝑀𝑀𝑀𝑀𝑆𝑆𝑆𝑆(𝑥𝑥,𝐺𝐺)G (MRAM variation, ADC noise) gi

DIMA-aware Stochastic Training

[B. Zhang, ICASSP’19]

Simulated MRAM-based BNN (CIFAR-10):


CODED DIMA - ENHANCING DIMA’S COMPUTE SNRMTJ variation aware coding enables HIGH system SNR in spite of LOW circuit SNR

part

ial d

ecim

atio

n-in

-tim

eA

rchi

tect

ure

for m

appi

ng

FFT

on D

IMA

N-point Input Sequence

N/ M-pointInput

Sequence

N/ M-pointInput

Sequence

Input shuffling

N/ M-pointDFT

N/ M-pointDFT

Merge N/ M-point DFTs

N-point DFTCoefficients

N- p

oint

DFT

DFT, DHT, QFT

Coefficient precision (𝐵𝐵𝑊𝑊)

Out

put S

NR(d

B)

Limited byinput SNR

Limited byMRAM device

variation

Limited byquantization

noise

Input SNR: 40 dB; = 8; = 128

18dB

61 dB

41 dB

MRAM-DI MAWrite data

Functional read

Calculate error

Calibrate

write-time compensation (WTC)of MRAM variations

WTC

MTJ

𝐵𝐵𝑥𝑥 𝑁𝑁


CURRENT STATUS

Technology transition via SRAM-based DIMA DIMA prototype designs taped out

(2019) [single-bank] validate EDPgains & calibrate models

(2020) [multi-bank] enable scaling and system integration

DIMA technology transition to RMS foralgorithm prototyping & evaluation

facilitates future MRAM-based DIMAintegration by RMS

verified & quantified RMS’s mission capability enabled by MRAM-based DIMA MRAM-based DIMA demonstrates > 2 orders-of-magnitude EDP reduction

[H. Jia, arXiv:1811.04047, Princeton]


NEXT STEPS & SUMMARY

Mission: To realize > 200X in EDP gains in DoD workloads by integrating MRAM devicewithin Deep In-memory Architectures using Shannon-inspired compute models

FRANC Program: “to provide the foundation for new materials technology and newintegration approaches to be exploited in pursuit of novel compute architectures”

Multi-functioned DIMAs for Diverse Applications Parameterized DIMAs for Platform-design Tools

Summary

[Srivastava, et al., ISCA’18]

Documents

naresh - ERI SummitNARESH UNIVERSITY OF ILLINOIS AT SHANBHAG. URBANA-CHAMPAIGN. Distribution A. Approved for public release; distribution is unlimited. MRAM-BASED DEEP IN-MEMORY ARCHITECTURES