View
226
Download
3
Category
Preview:
Citation preview
2/12/2015
1
Hao Jiang
A collaborative research among San Francisco State University, EI-Lab at
University of Pittsburgh, HP Labs, and AFRL
Neuromorphic Computing based Processors
2
Outline
• WhyNeuromorphicComputing?
• ChallengesandNewOpportunity
• SpikingNeuromorphicDesign
• AFrameworkofHeterogeneousComputingSystems
• Conclusion
2/12/2015
2
3
WhyNeuromorphicComputing?
•VonNeumannarch.isfacingseverechallenges– VonNeumannbottleneck– Inefficientincognitivecomputations
•Humanbrain:highefficiency– 100TFLOPSvs.20Watt– Highlyconnected:50Bneurons&1014 synapses
– Verylight:3lbsHead
Computation & ControlComputation & Control
Tape
Neuromorphic Design by Leveraging Memristor Technology
4
GraymatterWhitematter
Neocortex6layersSignalstravelwithinandbetweenlayers
Brain– TheMostEfficientComputingMachine
Brain:15–30BneuronsExtremelycomplex4km/mm3
35w
Neuron:Processsignalsfromotherneurons.
Synapse:MemoryWeightsignals
NeuralNetwork
2/12/2015
3
5
Brain‐likeNeuromorphicCircuits
• Slowprogressinneuromorphichardwareimplementation− Lackofefficientsynapsedesign− Notsupportivetomassconnection
Highlyparallel Ultrapowerefficient
Flexible Extremelyrobust
Realworldinput
Humanfriendlyoutput
Datafriendly
6
Outline
• WhyNeuromorphicComputing?
• ChallengesandNewOpportunity
• SpikingNeuromorphicDesign
• AFrameworkofHeterogeneousComputingSystems
• Conclusion
2/12/2015
4
7
ChallengesinTraditionalApproach
Developingandimplementingneuralnetworkmodelsonlargescalecomputerclustersorsupercomputers.
Performance (100M MIPS) Challenge Energy Challenge
10-20 Megawatts
11000 U.S. households
8
TraditionalAnalogApproachWeight matrix W
Y = f (W X)
X Y
Implementation
floating gates, capacitor, etc.
op-amps, analog voltage multipliers and differentiators
Difficulties
Volatile data, low precision, control signal
Voltage offset, noise generation, voltage saturation
Scaling
O(N2) for weight carrier
O(N2) for voltage multiplier
Weight
Compute
Intrinsic,hardtoovercome
Successfulinsmallscalesystems
Designcomplexity,power,andareagrowveryfast
2/12/2015
5
9
LatestProgress
NumentaHTM
Micron Automata
IBM TrueNorth
10
Memristor– RebirthofAnalogApproach
Memristor
Naturalweightcarriers:• Non‐volatility,highdensity
• Analogresistancestates
• Twoterminalprogramming
MemristorCrossbar
• Naturalweightsummation
• MIMO:avoidsneakpath
• Cost~O(N),notO(N2)
M = RL α + RH (1- α)
I = VM1/M1 + VM2/M2 +…+ VMn/Mn
2/12/2015
6
11
0 10 20 30 40 50 60 70300
400
500
600
700
Pulse number
Res
ista
nce
()
0 10 20 30 40 50 60 70-4
-2
0
2
4
Vol
tage
(V
)
Memristor– RebirthofNeuromorphicCircuits
• Twoterminal,highdensity• Non‐volatility• Analog/multi‐levelstates
• Naturalmatrixfunction• AMIMOsystem• Goodcombinationwithmemristor
Memristor↔Synapse Crossbar↔Network
TaN1+x
HPlab,2012
EIlab,DAC’12VO
VI
WLi
mi,j
N2
N3
N4
Ni
Ni+1
Nn
N1 N2 N3 Nj-1 Nj Nn-1 Nn
BLjN1
EIlab,APL’13
EIlab&HPlabTiN-TaOx device, pulses grows linearly in amplitude
12
Outline
• WhyNeuromorphicComputing?
• ChallengesandNewOpportunity
• SpikingNeuromorphicDesign
• AFrameworkofHeterogeneousComputingSystems
• Conclusion
2/12/2015
7
13
Spiking‐based Neuromorphic Computing
• Why spiking?
– Inspired by human brains
– Minimized transition electrical charge
– Reduced data communication distance
– High parallelism in processing
• Approaches in hardware system
– Analoganddigitalcircuitblocks, capacitors
– Crossbar array basing on SRAM, PCM,Memristor cell
– In this work: Memristor based crossbar arrayas synapse
14
Spiking Neuromorphic System
010
101
Integrate and Fire
100 Counters
• Spiking‐based computationsystem
Closer to biological system Power efficiency High reliability
• Matrix computationtransformation
Σ i=1 gij Vi M
I1 I2 In …
V1
V2
Vm
…
g11 g12 g1n
g21 g22 g2n
gm1 gm2 gmn
Mathematicalmatrixcomputation
Memristor‐based crossbar arrayfor matrix computation
2/12/2015
8
15
Computation Methodology
Traditional integrate and firemodel
Rate coding
ts
Vm
Cm Vth
Vout
T
mi = 2
Ideal: nj ∝ Σ i=1 gij miN
Working mechanism:
• Vm < Vth Cm integrates
Vm ≥ Vth A spike is fired out,
then Cm is rest to 0
• Spike occurs: WeightedcurrenttointegratorNo spike:Nocurrentto&fromintegrator
16
Memristor crossbar array structure
Voltage (V)
Cur
rent
(A
)
100
10-2
10-4
10-6
10-8
10-10
10-12
10-14
102
-3 -2 -1 0 1 2 3
Memristor
Selector
• 1S1Mmemristor‐basedcell
– Alleviate the impact of sneak path leakage
– A thin‐film based selector after eachmemristor
– Minimal unit cell area of 4F2
• Selector property and operation
– Spikeoccurs:SelectorON, and Rs_on
<<RM
– No spike: SelectorOFF, and Rs_off>>RM
gij · gs gij = ~ gij + gs
gij
2/12/2015
9
17
High‐speedIntegrateandFireCircuit(H‐IFC)
VREF=33% of vin
18
Integrate and Fire Circuit
Cm
Rmem
VCm
。Vout
VTH
VOUT VCm
Comparator
Iin
Vin
• Structure and property
– Integrate capacitor, Resettransistor, Comparator withpositive feedback
– High speed
– Vth is much smaller than Vdd
• Power and area
– ~100μW
– 28μm×12μm
(180μm Technology)
2/12/2015
10
19
SystemVerification
Sp
ikin
g In
pu
ts
I1
g11 g12 g1n
g21 g22 g2n
gn1 gn2 gnn
. . .
I2 In
I&F I&F I&F . . . Cm Fire Circuit
20
OutputPulseNumber
Output Pulse Number
Sensing Capacitance
Pulse Duration
Input Voltage
Comparator Trigging Voltage
Memristor Conductance
Selected Row
2/12/2015
11
21
TheoreticalComputationvs.SimulationResults
0
20
40
60
80
100
0 0.5 1 1.5 2 2.5 3 3.5
∑ i=1 N mi gij
Ou
tpu
t S
pik
e N
um
ber
Ideal Linear Curve
Simulation Result
× 10-4
Parameters Vth 0.5V
Cm 50fF
Vdd 1.8V
• Real: Nonlinearity
• Sums of weighted signaldependent
Reasons:
• Reset time• IFC delay
Optimization:
• Larger Cm• Higher speed of IFC
Be used in neural network
22
AdaptabilityinNeuronNetwork
WL1
WL2
WL31
WL32
. . .
Vout
0 50 100 150 200 250 (ns) (a)
0 50 100 150 200 250 (ns) (b)
WL1
WL2
WL31
WL32
. . .
Vout
0 50 100 150 200 250 (ns) (c)
WL1
WL2
WL31
WL32
. . .
Vout
Output Spike Number (nj)
a 19
b 20
c 19
Good adaptability in neuralnetwork
2/12/2015
12
23
Outline
• WhyNeuromorphicComputing?
• ChallengesandNewOpportunity
• SpikingNeuromorphicDesign
• AFrameworkofHeterogeneousComputingSystems
• Conclusion
24
OurApproach
• Aframeworkofheterogeneouscomputingsystemsenhancedwithneuromorphiccomputingaccelerators (NCAs).
• Purpose: Tocombinetheflexibility ofconventionalarchitectureinlogicandscientificcomputationandtheefficiency ofneuromorphicarchitectureforANNapplications.
2/12/2015
13
25
Frontend:PrepareData&Instructions
Instruction Type Descriptionsetpreg Configuration Placetheroutinginformationstoredat
registerregtocentralroutermovd#(reg) I/O LoadthedatafrommemorytoNCAlaunch Configuration Notifythecentralroutertostarttransmittingdeqreg I/O DequeuetheheaddataofOut‐queueandwrite
ittoregisterreg
bool RecallBSB(float *vec, float *wm)
{ /* simulate the synapse network*/ for(i=0;i<BsbSize;++i) wx[i] += �wm[i*BsbSize+j] * vec[j]; …… /* activation function*/ for(i=0;i<BsbSize; ++i) wx[i] = ALPHA*wx[i] + LAMDA*vec[i]; …… /* check convergence */ for(i=0;i<BsbSize;++i) if(fabs f(vec[i]) != 1.0) return false; return true;}
bool RecallBSB(float *vec)
{ /*inputs to NCA*/
Send(NCA.id, vec);
/*outpus from NCA*/
return Receive(NCA.id)
}
; send each input from register to input; buffer associated with specific NCALW R1, $(vec)
MOVD NCA.id, R1
……; launch the NCA
SET NCA.id, #VAL LAUNCH; put the output from output buffer of; NCA to register, here is only one output
DEQ R1, NCA.id
RETSource‐to‐source
translation
Training
NCA‐aware compilation
26
Backend:SystemDesign
Arbiter
NCA
I/OCfg
Buffers
Arbiter
NCA
I/OCfg
Buffers
Arbiter
NCA
I/OCfg
Buffers
Arbiter
NCA
I/OCfg
Buffers
Bridge
ADC
00101101
Bridge
ADC
00101101
GeneralPurposeProcessor
SRAM
I/O
ConventionalProcessing Neuromorphic Computing Accelerators
Fetc
h
Dec
ode
Issu
e
Exe
cute
Mem
ory
Wri
te
back
RF NCA $ NCA
RFPipeline
• Tightlycoupleddesign
• Invokedbyspecialinst.
2/12/2015
14
27
NCAArchitecture
• AhierarchicalstructureofMBCarrays
• MixedsignalNoC– Analogcomputation– Digitalcontrolandroutingsignaltransition– MBCarraysconnectedinametamorphouscentralizedmesh(MCMesh)manner
SUM AMP
GroupRouter
GroupRouter
GroupRouter
GroupRouter
CentralRouter
28
SystemLevelEvaluation• Twoimplementationsrepresentingtradeoffsbetweencomputationperformanceandaccuracy
• 7classificationbenchmarks• Classificationrateisusedas
reliabilitymetric
Multi-layer perception (MLP) Auto-associative memory (AAM)
Benchmark Description
cancer breast cancer diagnose
connect-4 connect-4 game
gene nucleotide sequences detection
lymphography lymph diagnose
MNIST digit recognition
mushroompoisonous mushroom discrimination
thyroid thyroid diagnose
2/12/2015
15
29
ExperimentalSetup
The Design Parameters of NCA Components
The Benchmark Implementation Details
30
ImpactofDeficientHardware
• Programmingprecisionduetolimiteddeviceresolution
• Devicevariationsandsignalfluctuations
• AAMismorerobustthanMLP
2/12/2015
16
31
MBCSizeExploration
• Largerarraysizeispreferablefromperformanceperspective• However,asarraysizeincreasestheclassificationratedegradesinducedbytheaggravatedvariations
32
Comparisonw/OtherDesigns
• Baseline: CPUasgeneralpurposeprocessor
• D‐NPU: apopulardigitalneuromorphicaccelerator(MICRO’12)
• MBCs+D‐Net:MBCarraysw/digitalNoCinordertoevaluatetheefficiencyofmixed‐signalNoC
• NCA: ourdesignw/MBCarraysandmixed‐signalNoC
Sigmoid Unit
Multiply-add
Output buffer
Input buffer
AccumulatorRegister
Weight buffer
PE PE PE PE
PE PE PE PE
PE PE
PE PE
PE PE
PE PE
2/12/2015
17
33
Comparisonw/OtherDesigns
D‐NPU islimitedbythecomputationalbandwidth.
MBCs+D‐Net islimitedbythecostlyAD/DAconversions.
MLP AAM
Speedup 177.67 27.20
Energy Saving
184.71 25.18
NCAImprovement
36
Outline
• WhyNeuromorphicComputing?
• ChallengesandNewOpportunity
• SpikingNeuromorphicDesign
• AFrameworkofHeterogeneousComputingSystems
• Conclusion
2/12/2015
18
37
Conclusion&Perspective
• Invention of new devices inspires the study of the next generation neuromorphic computing systems.
• A spiking neuromorphic computing system by leveraging the memristor crossbar array is demonstrated.
• We propose a heterogeneous system that combines the flexibility of conventional architecture and the efficiency of neuromorphic architecture.
• In the future research, we plan to extend the investigation to larger scale ANN applications.
• The techniques to enhance the run-time robustness in training and testing procedures will be studied.
38
ThankYouandQuestions?
Recommended