Mixed-Signal Techniques for Embedded Machine Learning Systems · 2020. 12. 5. · Mixed-Signal...

Preview:

Citation preview

Mixed-Signal Techniques for Embedded

Machine Learning Systems

Boris Murmann

June 16, 2019

Applications

Speed of response

Bandwidth utilized

Privacy

Power consumed

CloudEdge

Conversational

Interfaces

…natural?

…seamless?

…real-time?

2

Task Complexity, Memory and Classification Energy

Face Detection

SVM

Digit RecognitionImage Classification

10 classes

Binary CNN

Image Classification

1000 classes

MobileNet v1

Self-Driving

ResNet-50 HD

pJ

nJ

µJ

mJ

J

3

Task Complexity, Memory and Classification Energy

pJ

nJ

µJ

mJ

J

My main interest

4

Edge Inference System

5

A/D

A/D

Sensors

Wake-up

Detector

Pre-

Processin

g

Deep

Neural

Network

DRA

M

SRA

M

Buffer

PE

Array

Register

File

Compute

MAC

Opportunities for Analog/Mixed-Signal Design

6

A/D

A/D

Sensors

Wake-up

Detector

Pre-

Processin

g

Deep

Neural

Network

DRA

M

SRA

M

Buffer

PE

Array

Register

File

Compute

MAC

Analog

Feature

Extraction

Analog

MAC

In-Memory

Computing

Outline

7

Data-Compressive Imager for Object Detection

› Omid-Zohoor & Young, TCSVT 2018 & ISSCC 2019

Mixed-Signal ConvNet

› Bankman, ISSCC 2018 & JSSC 2019

RRAM-based ConvNet with In-Memory Compute

› Ongoing work

Wake-Up Detector with Hand-Crafted Features

Feature

Extractor

Simple

Classifier

Sensor

Signal LabelA/D

Training

Algorithm

Labeled

Training Data

High-dimensional

data

Low-dimensional

representation

Data Deluge

8

Analog Feature Extractor

Low-rate and/or low–resolution ADC

Low data rate digital I/O

Reduced memory requirements

Feature

Extractor

Simple

Classification

Sensor

SignalLabelA/D

Training

Algorithm

Labeled

Training Data

Low-dimensional

representation

9

Histogram of Oriented Gradients

10

Analog Gradient Computation

H1 H2𝐺𝐻 =

V1 V2𝐺𝑉 =

horizontal

vertical

V2

H

1

PH

2

V1

𝐺𝐻 = 400𝑚𝑉 − 100𝑚𝑉 = 𝟑𝟎𝟎𝒎𝑽

𝐺𝐻 =1

4400𝑚𝑉 −

1

4100𝑚𝑉 = 𝟕𝟓𝒎𝑽

Dark patch

Bright patch

11

High Dynamic Range Images

12

Gradient magnitude varies

significantly across image

Would require high-

resolution ADCs (≥ 9b) to

digitize computed gradients

› But, we want to produce

as little data as possible

Ratio-Based (“Log”) Gradients

V2

H

1

PH

2

V1

H1

H2

𝐺𝐻 =V1

V2

𝐺𝑉 =

H1

H2

𝐺𝐻 = =𝛼 ×

𝛼 ×

H1

H2

Illumination Invariant

13

Ratio Quantization

V2

H

1

PH

2

V1

H1

H2

𝐺𝐻 =

V1

V2

𝐺𝑉 =

𝑉𝑝𝑖𝑥1𝑉𝑝𝑖𝑥2

1

22

Negative

edge

Positive

edge

No

edge

*Empirically determined

thresholds

𝑉𝑝𝑖𝑥1𝑉𝑝𝑖𝑥2

1

22

1.5b

2.75b

4.21.31

1.3

1

4.2

H1

H2

V1

V2

𝐺𝑉 = 𝑄𝐺𝐻 = 𝑄

14

HOG Feature Compression with 1.5b Gradients

Quant

Fewer angle bins

Quantizing histogram

magnitudes

25× less data in HOG

features compared to 8-bit

image

15

Log vs. Linear GradientsLess Illu

min

atio

n

16

Log vs. Linear GradientsLess Illu

min

atio

n

17

Log vs. Linear GradientsLess Illu

min

atio

n

18

Prototype Chip

19

Row Buffers with Pixel Binning (Image Pyramid)

20

Ratio-to-Digital Converter (RDC)

21

Data-Driven Spec Derivation

22

𝐻1 + 𝝐𝟏𝐻2 + 𝝐𝟐

Chip Summary

• 0.13 µm CIS 1P4M

• 5µm 4T pixels

• QVGA 320(V) x 240(H)

• 229 mW @ 30 FPS

Supply Voltages

Pixel: 2.5V

Analog: 1.5V, 2.5V

Digital: 0.9V

23

Sample Images

24

25

Results using Deformable Parts Model detection & custom database (PascalRAW)

Comparison to State of the Art

26

Information Preservation

27

Use Log Gradients as ConvNet Input?

28

Ongoing work; comparable performance using ResNet-10 (PascalRaw dataset)

Digital ConvNet Fabric

Can We Play Mixed-Signal Tricks in a ConvNet?

29

BinaryNet

Courbariaux et al., NIPS 2016

Weights and activations constrained to +1

and -1, multiplication becomes XNOR

Minimizes D/A and A/D overhead

Nice option for small/medium-size

problems and mixed-signal exploration

30

Bankman et al., ISSCC 2018

Mixed-Signal Binary CNN Processor

1. Binary CNN with “CMOS-inspired” topology, engineered for minimal circuit-level path loading

2. Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB)

3. Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree

31

CIFAR-10

Zhao et al., FPGA 2017

Original BinaryNet Topology

88.54% accuracy on CIFAR-10

1.67 MB weight memory (68% FC layers)

27.9 mJ/classification with FPGA

32

Mixed-Signal BinaryNet Topology

Sacrificed accuracy for regularity and energy efficiency

86.05% accuracy on CIFAR-10

328 KB weight memory

3.8 mJ per classification

33

Neuron

34

Naïve Sequential Computation

35

Weight-Stationary

256

x2 (north/south)

x4 (neuron mux)

36

Weight-Stationary and Data-ParallelParallel

broadcast

37

Complete Architecture

38

𝑧 = sgn

𝑖=0

1023

𝑤𝑖𝑥𝑖 + 𝑏 𝑧

𝑤 𝑏Filter Weights

𝑥Image Patch Filter Bias

9b sign-magnitude

𝑏 = −1 𝑠

𝑖=0

7

2𝑖𝑚𝑖

Output

2×2×256 2×2×256

Neuron Function

39

𝑣diff𝑉𝐷𝐷

=𝐶𝑢𝐶𝑡𝑜𝑡

𝑖=0

1023

𝑤𝑖𝑥𝑖 + 𝑏

𝑏 = −1 𝑠

𝑖=0

7

2𝑖𝑚𝑖

Switched-Capacitor Implementation

Bias & offset cal.Weights x inputs

Batch normalization

folded into weight

signs and bias

40

Behavioral Simulations

41

Significant margin in noise, offset, and mismatch (VFS = 460 mV)

𝐶𝑢 = 1 fF𝑣𝑜𝑠 = 4.6 mV RMS𝑣𝑛 = 460 µV RMS

“Memory-Cell-Like” Processing Element

1 fF metal-oxide-metal fringe capacitorStandard-cell-based

42 transistors

24107 F2

42

TSMC 28nm HPL 1P8M

6 mm2 area

328 KB SRAM

10 MHz clock

Supply Voltages

VDD – Digital Logic, 0.6V – 1.0V

VMEM – SRAM, 0.53V – 1.0V

VNEU – Neuron Array, 0.6V

VCOMP – Comparators, 0.8V

Die Photo

43

μ = 86.05%

σ = 0.40%

Measured Classification Accuracy

10 chips, 180 runs each through

10,000 CIFAR-10 test images

VDD = 0.8V, VMEM = 0.8V

3.8 μJ/classification

237 FPS, 899 μW

0.43 μJ in 1.8V I/O

Mean accuracy μ = 86.05% same

as perfect digital model

44

Comparison to Synthesized Digital

BinarEye(Moons et al., CICC 2018)

Synthesized Digital Mixed-Signal

45

Digital vs. Mixed-Signal Binary CNN Processor

Synthesized

Digital

Moons et al.

CICC 2018

Hand-Designed

Digital

(Projected)

Mixed-Signal

Bankman et al.

ISSCC 2018

Energy @

86.05%

CIFAR-10

46

CIFAR-10 Energy vs. Accuracy Neuromorphic› [1] TrueNorth, Esser PNAS

2016

GPU› [2] Zhao FPGA 2017

FPGA› [2] Zhao FPGA 2017

› [3] Umuroglu FPGA 2017

MCU› [4] CMSIS-NN, Lai arXiv 2018

Memory-like, mixed-signal› [5] Bankman ISSCC 2018

BinarEye, digital› [6] Moons CICC 2018

In-memory, mixed-signal› [7] Jia arXiv 2018

*energy excludes off-chip DRAM

47

Limitations of Mixed-Signal BinaryNet

48

Poor programmability

Relatively limited accuracy (even on CIFAR-10) due to 1b arithmetic

Energy advantage over customized digital is not revolutionary

› Same SRAM, essentially same dataflow

Need a more “analog” memory system to unleash larger gains

› In-memory computing

BinaryNet Synapse versus Resistive RAM

• TBD

• 25 F2

• Multi-bit (?)

49

• 0.93 fJ per 1b-MAC in 28 nm

• 24107 F2

• Single-bit

Tsai, 2018

Matrix-Vector Multiplication with Resistive Cells

50

Typically use two cells to

achieve pos/neg weights

(other schemes possible)

256 columns

10

24

row

s

25 F2 cell @ F = 90 nm

Side length s = 0.45 mm

2 x 1024 x 0.45 um x 256 x 0.45 um = 0.106 mm2

per layer

0.84 mm2 for the complete 8-layer network

RRAM Density – BinaryNet Example

51

(2x)

Ongoing Research

52

What is the best architecture?

How many levels can be stored in each cell?

What is the most efficient readout?

Can we cope with nonidealities using training techniques?

53

(Content deleted for online distribution)

54

(Content deleted for online distribution)

55

(Content deleted for online distribution)

56

(Content deleted for online distribution)

VGG-7 Experiment (4.8 Million Parameters)

57

8192

10

3×3×128

3×3×128

3×3×256 3×3×512

3×3×2563×3×3

2-bit weights, 2-bit

activations

Accuracy on CIFAR-10

2b quantization only: 93%

2b quantization + RRAM/ADC model: 92%

Work in progress!

Optimal energy/MAC = 0.38 fJ

Energy Model for Column in Conv6 Layer

58

Summary

59

Analog feature extraction is attractive for wake-up detectors

Adding analog compute in ConvNets can be beneficial when it

simultaneously lets us reduce data movement

› In-memory analog compute looks most promising

› Can consider SRAM or emerging memories (e.g. RRAM)

Expect significant progress as more application drivers for “machine

learning at the edge” emerge

Recommended