Introduction to Neural Networks
Ritchie Zhao, Zhiru ZhangSchool of Electrical and Computer Engineering
ECE 5775 (Fall’17)High-Level Digital Design Automation
▸ Neural networks have revolutionized the world– Self-driving vehicles– Advanced image recognition– Game AI
1
Rise of the Machines
2
Rise of the Machines
▸ Neural networks have revolutionized research
▸ 1950’s: First artificial neural networks created based on biological structures in the human visual cortex
▸ 1980’s – 2000’s: NNs considered inferior to other, simpler algorithms (e.g. SVM, logistic regression)
▸ Mid 2000’s: NN research considered “dead”, machine learning conferences outright reject most NN papers
▸ 2010 – 2012: NNs begin winning large-scale image and document classification contests, beating other methods
▸ 2012 – Now: NNs prove themselves in many industrial applications (web search, translation, image analysis)
3
A Brief History
▸ We’ll discuss neural networks for solving supervised classification problems
4
Classification Problems
Cat
Bird
Person
Inputs Labels
Training Set
???
Inputs Predictions
Test Set
• Given a training set consisting of labeled inputs
• Learn a function which maps inputs to labels
• This function is used to predict the labels of new inputs
▸ Inputs: Pairs of numbers (𝒙𝟏, 𝒙𝟐)▸ Labels: 0 or 1
▸ This is a binary decision problem
▸ Real-life analogue:– Label = Raining or Not Raining– 𝒙𝟏 = relative humidity– 𝒙𝟐 = cloud coverage
5
A Simple Classification Problem
x1 x2 Label-0.7 -0.6 0-0.6 0.4 0-0.5 0.7 1-0.5 -0.2 0-0.1 0.7 10.1 -0.2 00.3 0.5 10.4 0.1 10.4 -0.7 00.7 0.2 1
Training Set
6
Visualizing the Data
Plot of the data points
𝒙𝟏𝒙𝟐
x1 x2 Label-0.7 -0.6 0-0.6 0.4 0-0.5 0.7 1-0.5 -0.2 0-0.1 0.7 10.1 -0.2 00.3 0.5 10.4 0.1 10.4 -0.7 00.7 0.2 1
Training Set
▸ In this case, the data points can be classified with a linear decision boundary
7
Decision Function
Decision boundary
▸ The perceptron is the simplest possible neural network, containing only one neuron
▸ Described by the following equation:
𝑦 = 𝜎(*𝑤,𝑥,
.
,/0
+ 𝑏)
8
The Perceptron
𝑤, = weights𝑏 = bias𝜎 = activation function
▸ A 2-input, 1-output perceptron:
9
Visualizing the Perceptron
𝑤0𝜎
𝑤3+
𝑥0
𝑥3𝑦
𝑏
Linear function of �⃗� =𝑥0𝑥3
𝑊�⃗� + 𝑏
Non-linear function“squashes” output to
the interval (0,1)
OutputProbability that the label should be “1”
▸ The activation function σ is non-linear:
10
Activation Function
Unit Step Sigmoid ReLU
Discontinuous at origin, zero derivative
everywhere
Continuous anddifferentiable everywhere
Continuous but non-differentiable
at origin
11 + 𝑒9: max(0, 𝑥)
▸ First perceptrons used Unit Step. Early neural networks used Sigmoid. Modern deep networks use ReLU.
▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)
11
The Perceptron Decision Boundary
𝑤0 = 10𝑤3 = 10
𝑤0 = 5𝑤3 = 20
𝑤0 = 20𝑤3 = 5
Ratio of weights change the direction of the decision boundary
▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)
12
The Perceptron Decision Boundary
𝑤0 = 10𝑤3 = 10
𝑤0 = 5𝑤3 = 5
𝑤0 = 2𝑤3 = 2
Magnitude of weights change the steepness of the decision boundary
▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)
13
The Perceptron Decision Boundary
𝑏 = 0 𝑏 = 5 𝑏 = −5
Bias moves boundary away the from origin
▸ With the right weights and bias we can create any (linear) decision boundary we want!
▸ But how do we compute the weights and bias?
▸ Solution: backpropagation training
14
Finding the Weights and Bias
▸ Let’s take the 2-input perceptron and initialize the weights and bias randomly
15
Backpropagation
𝑤0
𝜎
𝑤3
+
𝑥0
𝑥3
𝑦
𝑏1.4
5.2
−8.7
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
▸ Send one input pair through and calculate the output
16
Backpropagation
1.4
x1 x2 Label-0.7 -0.6 0
−0.7
−0.6
1.582.98 0.95 Output: 0.95
Target: 0.0
𝑤0
𝜎
𝑤3
+
𝑥0
𝑥3
𝑦
𝑏
5.2
−8.7
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
▸ How to change {𝑤0, 𝑤3, 𝑏} to move 𝑦 towards the target?
▸ Answer: partial derivatives JKJLM
, JKJLN
, JKJO
17
Backpropagation Training
1.4
−0.7
−0.6
1.582.98 0.95 Output: 0.95
Target: 0.0
𝑤0
𝜎
𝑤3
+
𝑥0
𝑥3
𝑦
𝑏
5.2
−8.7
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
▸ We can calculate the derivatives function-by-function▸ Green = partial derivative
18
Backpropagation
1.4
−0.7
−0.6
2.98 0.95
1.0𝜕𝑦𝜕𝑦 = 1.00.046
𝑠 𝑦𝑤0
𝜎
𝑤3
+
𝑥0
𝑥3
𝑦
𝑏
5.2
−8.7
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
𝑦 = σ(𝑠) =1
1 + 𝑒9S
𝜕𝑦𝜕𝑠 =
𝑒9S
(1 + 𝑒9S)3 = 0.046
▸ We can calculate the derivatives function-by-function ▸ Propagate partial derivatives backwards
19
Backpropagation
5.2
−8.7
1.4
−0.7
−0.6
2.98 0.95
1.0
𝑠 = 𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏
𝜕𝑠𝜕𝑏 = 1
𝜕𝑠𝜕𝑤0
= 𝑥0
𝜕𝑠𝜕𝑤3
= 𝑥3
0.046
𝑠 𝑦
0.046
−0.032
−0.028
𝑤0
𝜎
𝑤3
+
𝑥0
𝑥3
𝑦
𝑏
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
▸ What we’ve done is apply the chain rule of derivatives
20
Backpropagation
5.2
−8.7
1.4
−0.7
−0.6
2.98 0.95
1.0
𝜕𝑦𝜕𝑤0
=𝜕𝑦𝜕𝑠
𝜕𝑠𝜕𝑤0
𝜕𝑦𝜕𝑠 = 0.046
𝜕𝑠𝜕𝑤0
= −0.7
0.046
𝑠 𝑦
0.046
−0.032
−0.028
𝑤0
𝜎
𝑤3
+
𝑥0
𝑥3
𝑦
𝑏
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
▸ In general, any operator in a neural network defines a forwards pass and a backwards pass
21
Backpropagation
𝑓
𝑥
𝑦
𝑓(𝑥, 𝑦)
𝜕𝐸𝜕𝑓
𝜕𝐸𝜕𝑥 =
𝜕𝐸𝜕𝑓
𝜕𝑓𝜕𝑥
𝜕𝐸𝜕𝑦 =
𝜕𝐸𝜕𝑓
𝜕𝑓𝜕𝑦
Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University
▸ Backpropagation allows us to build complex networks and still be able to update the weights
▸ We now know how to compute JKJLM
, JKJLN
, JKJO
▸ Parameter update rule (for minimizing 𝑦):
𝑤 = 𝑤 − η JKJL
η = learning rate or step size
▸ This is optimization by gradient descent, as we move in the direction of the steepest gradient
22
Weight Update
▸ For simplicity we minimized 𝑦, since label 𝑡 = 0▸ In practice we use a loss function 𝐿(𝑦, 𝑡)
– Square Loss: 𝐿 𝑦, 𝑡 = (𝑦 − 𝑡)3
– Cross Entropy: 𝐿 𝑦, 𝑡 = −𝑡 ln 𝑦 − 1 − 𝑡 ln(1 − 𝑦)
▸ For simplicity we dealt with 1 training sample▸ In practice we process multiple training samples
(a minibatch) and sum the gradients before updating the weights
23
Practical Differences
▸ The perceptron’s linear decision boundary is not very useful for complex problems▸ A deep neural network (DNN) consists of many
layers, each containing multiple neurons
24
Deep Neural Network
Image credit: http://www.opennn.net/
This network is fully connected
Input Layer Hidden LayersOutput Layer
▸ Each hidden unit learns a hidden feature▸ Later features depends on earlier features in the
network, so they will get more complex
25
What Happens in a DNN?
Image credit: http://www.opennn.net/
Simple FeaturesMore Complex Features
▸ A deep neural network creates highly non-linear decision boundaries
26
Deep Neural Network
2 units 3 units 4 units
Image credit: https://www.carl-olsson.com/fall-semester-2013/
▸ Convolutional neural network (CNN) is a neural network specifically built to analyze images▸ State-of-the-art in image classification▸ Key difference: the layers at the front are
convolutional layers27
Convolutional Neural Network
Convolutional Layers Fully-connected Layers
Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Fully-connected Layer 1. An output neuron is connected
to each input neuron
Convolutional Layer1. Inputs and outputs are vectors
of 2D feature maps2. An output pixel is connected
to its neighboring region oneach input feature map
3. All pixels on an output feature map use the same weights
28
The Convolutional Layer
Image credit: https://inspirehep.net/record/1501944/plots
29
The Convolutional Layer
Image credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Input Feature Map
Output Feature Map
▸ The conv layer moves a weight filter across the input map, doing a dot product at each location
▸ The weight filter detects local features in the input map
▸ Convolutional layers detect visual features▸ This shows features in the input image which cause
high activity, across 3 different conv layers in a CNN
30
Intuition on CNNs
Image credit: https://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/; H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, CACM Oct 2011
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksCheng Zhang1, Peng Li2, Guangyu Sun1,3, Yijin Guan1, Bingjun Xiao2, Jason Cong2,3,1
1Center for Energy-Efficient Computing and Applications, Peking University2Computer Science Department, University of California, Los Angeles3PKU/UCLA Joint Research Institute in Science and EngineeringFPGA’15, Feb 2016
31
Accelerating CNNs on FPGA
32
Convolutional Layer Code
1 for(row=0; row<R; row++) {2 for(col=0; col<C; col++) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(i=0; i<K; i++) {6 for(j=0; j<K; j++) {
output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
Loop over output pixelsLoop over output feature mapsLoop over input feature mapsLoop over filter pixels
Filters
Output Feature MapsInput Feature Maps
N MC
R
K
33
Challenges for FPGA
...Computation Resource
PE-1 PE-2 PE-n
Inter-connect
buffer1On-chip memory
buffer2
External Memory
Off-chip Bus
Limited on-chip storage
Limited off-chip bandwidth
Limited compute and routing resources
▸ We can’t just unroll all the loops due to limited FPGA resources▸ Must choose the right pragmas and code transformations
Solution: Loop pipelining
Solution: Loop tiling
Solution: Data reuse, compute-memory balance
34
Loop Tiling
1 for(row=0; row<R; row++) {2 for(col=0; col<C; col++) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(i=0; i<K; i++) {6 for(j=0; j<K; j++) {
output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
Loop over pixels in an output map
Filters
Output Feature MapsInput Feature Maps
N MC
R
K
1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(i=0; i<K; i++) {8 for(j=0; j<K; j++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];
}}}}}}}}}} 35
Loop Tiling
Filters
Output Feature MapsInput Feature Maps
N MC
R
K
Loops over pixelsin each tile
Tc
Tr
Loop over different tiles in output map
1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to+=Tm) {4 for(ti=0; ti<N; ti+=Tn) {
// software: write output feature map// software: write input feature map + filters
5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];
}}}}}}
// software: read output feature map}}}}
36
Code with Loop Tiling
Software Portion
FPGA Portion
Maximize memory reuse
Maximize computational performance
5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];
}}}}}}
5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];
}}}}}}
9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];
}}}}}}
Reorder these two loops to the top level9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];
}}}}}}
9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {
#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {
output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];
}}}}}}
9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {
#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {
#pragma HLS unrolloutput_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];
}}}}}}
Generated HardwarePerformance determined by tile factors Tn and TmAmount of data transfer determined by Tr and Tc
Tm
Tn
37
Optimizing for On-Chip Performance
▸ Challenge: Number of available optimizations present a huge space of possible designs– Optimal loop order? Tile size for each loop?
▸ Implementing and testing each design by hand will be slow and error-prone– Some designs exceed the on-chip resources
▸ Solution: Performance modeling + automated design space exploration
38
Design Space Complexity
▸ We calculate the following design metrics:– Total number of operations (FLOP)
• Depends on the CNN model parameters
– Total external memory access (Byte)• Depends on the CNN weight and activation size
– Total execution time (S)• Depends on the hardware architecture (e.g. tile factors Tm and Tn)• Ignore resource constrains for now
▸ Execution Time = Number of Cycles×Clock Period
▸ Number of Cycles ≈ ^_̀
× a_b
×𝑅×𝐶×𝐾3
39
Performance Modeling
40
Performance Modeling
...Computation Resource
PE-1 PE-2 PE-n
Inter-connect
buffer1On-chip memory
buffer2
External Memory
Off-chip Bus
Computational Throughput
Required Bandwidth
Computation To Communication (CTC) Ratio
GFLOP/S
FLOP / (Byte Accessed)
GByte/S
=Totalnumberofoperations
Totalexecutiontime
=TotalnumberofoperationsTotalexternalmemoryaccess
𝐺𝐵𝑦𝑡𝑒𝑆
= 𝐺𝐹𝐿𝑂𝑃/𝑠
𝐹𝐿𝑂𝑃/(𝐵𝑦𝑡𝑒𝐴𝑐𝑐𝑒𝑠𝑠𝑒𝑑)
=ComputationalThroughput
CTCRatio
41
Roofline MethodC
ompu
tatio
nal T
hrou
ghpu
t
Computation To Communication (CTC) Ratio
Required Bandwidth
GFLOP/S
FLOP / Byte
Gbyte/S
Design
42
Roofline MethodC
ompu
tatio
nal T
hrou
ghpu
t
Computation To Communication (CTC) Ratio
GFLOP/S
FLOP / Byte
Bandwidth Roof(Device memory
bandwidth)Computational Roof(Max device throughput)
“The Roofline”Exceeds On-Chip Resources
Bandwidth-BottleneckedTheoretical Throughput
Real Throughput
0
20
40
60
80
100
120
0 10 20 30 40 50 60
43
Design Space Exploration
Computational RoofB
Bandwidth Roof
A
Com
puta
tiona
l Thr
ough
put
Computation To Communication (CTC) Ratio
▸ The accelerator design is implemented with Vivado HLS on the VC707 board containing a Virtex7 FPGA
▸ Hardware resource utilization
▸ Comparison of theoretical and real performance
44
Experimental Evaluation
Resource DSP BRAM LUT FFUsed 2240 1024 186251 205704Available 2800 2060 303600 607200Utilization 80% 50% 61% 34%
Layer 2 Theoretical Measured Difference5.11ms 5.25ms ~5%
45
Experimental Results
0
10
20
30
40
50
60
70
80
90
100
1 thread -O3 16 threads -O3 FPGA
Energy
0
2
4
6
8
10
12
14
16
18
20
1 thread -O3 16 threads -O3 FPGA
Speedup
90x
20x
18x
5x
CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHzgcc 4.7.2 –O3OpenMP 3.0
FPGA Virtex7-485t (28nm) 448 PEs 100MHzVivado 2013.4Vivado HLS 2013.4
FIN
46