Introduction to Neural Networks - Cornell University · Introduction to Neural Networks Ritchie...

Preview:

Citation preview

Introduction to Neural Networks

Ritchie Zhao, Zhiru ZhangSchool of Electrical and Computer Engineering

ECE 5775 (Fall’17)High-Level Digital Design Automation

▸ Neural networks have revolutionized the world– Self-driving vehicles– Advanced image recognition– Game AI

1

Rise of the Machines

2

Rise of the Machines

▸ Neural networks have revolutionized research

▸ 1950’s: First artificial neural networks created based on biological structures in the human visual cortex

▸ 1980’s – 2000’s: NNs considered inferior to other, simpler algorithms (e.g. SVM, logistic regression)

▸ Mid 2000’s: NN research considered “dead”, machine learning conferences outright reject most NN papers

▸ 2010 – 2012: NNs begin winning large-scale image and document classification contests, beating other methods

▸ 2012 – Now: NNs prove themselves in many industrial applications (web search, translation, image analysis)

3

A Brief History

▸ We’ll discuss neural networks for solving supervised classification problems

4

Classification Problems

Cat

Bird

Person

Inputs Labels

Training Set

???

Inputs Predictions

Test Set

• Given a training set consisting of labeled inputs

• Learn a function which maps inputs to labels

• This function is used to predict the labels of new inputs

▸ Inputs: Pairs of numbers (𝒙𝟏, 𝒙𝟐)▸ Labels: 0 or 1

▸ This is a binary decision problem

▸ Real-life analogue:– Label = Raining or Not Raining– 𝒙𝟏 = relative humidity– 𝒙𝟐 = cloud coverage

5

A Simple Classification Problem

x1 x2 Label-0.7 -0.6 0-0.6 0.4 0-0.5 0.7 1-0.5 -0.2 0-0.1 0.7 10.1 -0.2 00.3 0.5 10.4 0.1 10.4 -0.7 00.7 0.2 1

Training Set

6

Visualizing the Data

Plot of the data points

𝒙𝟏𝒙𝟐

x1 x2 Label-0.7 -0.6 0-0.6 0.4 0-0.5 0.7 1-0.5 -0.2 0-0.1 0.7 10.1 -0.2 00.3 0.5 10.4 0.1 10.4 -0.7 00.7 0.2 1

Training Set

▸ In this case, the data points can be classified with a linear decision boundary

7

Decision Function

Decision boundary

▸ The perceptron is the simplest possible neural network, containing only one neuron

▸ Described by the following equation:

𝑦 = 𝜎(*𝑤,𝑥,

.

,/0

+ 𝑏)

8

The Perceptron

𝑤, = weights𝑏 = bias𝜎 = activation function

▸ A 2-input, 1-output perceptron:

9

Visualizing the Perceptron

𝑤0𝜎

𝑤3+

𝑥0

𝑥3𝑦

𝑏

Linear function of �⃗� =𝑥0𝑥3

𝑊�⃗� + 𝑏

Non-linear function“squashes” output to

the interval (0,1)

OutputProbability that the label should be “1”

▸ The activation function σ is non-linear:

10

Activation Function

Unit Step Sigmoid ReLU

Discontinuous at origin, zero derivative

everywhere

Continuous anddifferentiable everywhere

Continuous but non-differentiable

at origin

11 + 𝑒9: max(0, 𝑥)

▸ First perceptrons used Unit Step. Early neural networks used Sigmoid. Modern deep networks use ReLU.

▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)

11

The Perceptron Decision Boundary

𝑤0 = 10𝑤3 = 10

𝑤0 = 5𝑤3 = 20

𝑤0 = 20𝑤3 = 5

Ratio of weights change the direction of the decision boundary

▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)

12

The Perceptron Decision Boundary

𝑤0 = 10𝑤3 = 10

𝑤0 = 5𝑤3 = 5

𝑤0 = 2𝑤3 = 2

Magnitude of weights change the steepness of the decision boundary

▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)

13

The Perceptron Decision Boundary

𝑏 = 0 𝑏 = 5 𝑏 = −5

Bias moves boundary away the from origin

▸ With the right weights and bias we can create any (linear) decision boundary we want!

▸ But how do we compute the weights and bias?

▸ Solution: backpropagation training

14

Finding the Weights and Bias

▸ Let’s take the 2-input perceptron and initialize the weights and bias randomly

15

Backpropagation

𝑤0

𝜎

𝑤3

+

𝑥0

𝑥3

𝑦

𝑏1.4

5.2

−8.7

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ Send one input pair through and calculate the output

16

Backpropagation

1.4

x1 x2 Label-0.7 -0.6 0

−0.7

−0.6

1.582.98 0.95 Output: 0.95

Target: 0.0

𝑤0

𝜎

𝑤3

+

𝑥0

𝑥3

𝑦

𝑏

5.2

−8.7

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ How to change {𝑤0, 𝑤3, 𝑏} to move 𝑦 towards the target?

▸ Answer: partial derivatives JKJLM

, JKJLN

, JKJO

17

Backpropagation Training

1.4

−0.7

−0.6

1.582.98 0.95 Output: 0.95

Target: 0.0

𝑤0

𝜎

𝑤3

+

𝑥0

𝑥3

𝑦

𝑏

5.2

−8.7

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ We can calculate the derivatives function-by-function▸ Green = partial derivative

18

Backpropagation

1.4

−0.7

−0.6

2.98 0.95

1.0𝜕𝑦𝜕𝑦 = 1.00.046

𝑠 𝑦𝑤0

𝜎

𝑤3

+

𝑥0

𝑥3

𝑦

𝑏

5.2

−8.7

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

𝑦 = σ(𝑠) =1

1 + 𝑒9S

𝜕𝑦𝜕𝑠 =

𝑒9S

(1 + 𝑒9S)3 = 0.046

▸ We can calculate the derivatives function-by-function ▸ Propagate partial derivatives backwards

19

Backpropagation

5.2

−8.7

1.4

−0.7

−0.6

2.98 0.95

1.0

𝑠 = 𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏

𝜕𝑠𝜕𝑏 = 1

𝜕𝑠𝜕𝑤0

= 𝑥0

𝜕𝑠𝜕𝑤3

= 𝑥3

0.046

𝑠 𝑦

0.046

−0.032

−0.028

𝑤0

𝜎

𝑤3

+

𝑥0

𝑥3

𝑦

𝑏

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ What we’ve done is apply the chain rule of derivatives

20

Backpropagation

5.2

−8.7

1.4

−0.7

−0.6

2.98 0.95

1.0

𝜕𝑦𝜕𝑤0

=𝜕𝑦𝜕𝑠

𝜕𝑠𝜕𝑤0

𝜕𝑦𝜕𝑠 = 0.046

𝜕𝑠𝜕𝑤0

= −0.7

0.046

𝑠 𝑦

0.046

−0.032

−0.028

𝑤0

𝜎

𝑤3

+

𝑥0

𝑥3

𝑦

𝑏

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ In general, any operator in a neural network defines a forwards pass and a backwards pass

21

Backpropagation

𝑓

𝑥

𝑦

𝑓(𝑥, 𝑦)

𝜕𝐸𝜕𝑓

𝜕𝐸𝜕𝑥 =

𝜕𝐸𝜕𝑓

𝜕𝑓𝜕𝑥

𝜕𝐸𝜕𝑦 =

𝜕𝐸𝜕𝑓

𝜕𝑓𝜕𝑦

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ Backpropagation allows us to build complex networks and still be able to update the weights

▸ We now know how to compute JKJLM

, JKJLN

, JKJO

▸ Parameter update rule (for minimizing 𝑦):

𝑤 = 𝑤 − η JKJL

η = learning rate or step size

▸ This is optimization by gradient descent, as we move in the direction of the steepest gradient

22

Weight Update

▸ For simplicity we minimized 𝑦, since label 𝑡 = 0▸ In practice we use a loss function 𝐿(𝑦, 𝑡)

– Square Loss: 𝐿 𝑦, 𝑡 = (𝑦 − 𝑡)3

– Cross Entropy: 𝐿 𝑦, 𝑡 = −𝑡 ln 𝑦 − 1 − 𝑡 ln(1 − 𝑦)

▸ For simplicity we dealt with 1 training sample▸ In practice we process multiple training samples

(a minibatch) and sum the gradients before updating the weights

23

Practical Differences

▸ The perceptron’s linear decision boundary is not very useful for complex problems▸ A deep neural network (DNN) consists of many

layers, each containing multiple neurons

24

Deep Neural Network

Image credit: http://www.opennn.net/

This network is fully connected

Input Layer Hidden LayersOutput Layer

▸ Each hidden unit learns a hidden feature▸ Later features depends on earlier features in the

network, so they will get more complex

25

What Happens in a DNN?

Image credit: http://www.opennn.net/

Simple FeaturesMore Complex Features

▸ A deep neural network creates highly non-linear decision boundaries

26

Deep Neural Network

2 units 3 units 4 units

Image credit: https://www.carl-olsson.com/fall-semester-2013/

▸ Convolutional neural network (CNN) is a neural network specifically built to analyze images▸ State-of-the-art in image classification▸ Key difference: the layers at the front are

convolutional layers27

Convolutional Neural Network

Convolutional Layers Fully-connected Layers

Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Fully-connected Layer 1. An output neuron is connected

to each input neuron

Convolutional Layer1. Inputs and outputs are vectors

of 2D feature maps2. An output pixel is connected

to its neighboring region oneach input feature map

3. All pixels on an output feature map use the same weights

28

The Convolutional Layer

Image credit: https://inspirehep.net/record/1501944/plots

29

The Convolutional Layer

Image credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Input Feature Map

Output Feature Map

▸ The conv layer moves a weight filter across the input map, doing a dot product at each location

▸ The weight filter detects local features in the input map

▸ Convolutional layers detect visual features▸ This shows features in the input image which cause

high activity, across 3 different conv layers in a CNN

30

Intuition on CNNs

Image credit: https://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/; H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, CACM Oct 2011

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksCheng Zhang1, Peng Li2, Guangyu Sun1,3, Yijin Guan1, Bingjun Xiao2, Jason Cong2,3,1

1Center for Energy-Efficient Computing and Applications, Peking University2Computer Science Department, University of California, Los Angeles3PKU/UCLA Joint Research Institute in Science and EngineeringFPGA’15, Feb 2016

31

Accelerating CNNs on FPGA

32

Convolutional Layer Code

1 for(row=0; row<R; row++) {2 for(col=0; col<C; col++) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(i=0; i<K; i++) {6 for(j=0; j<K; j++) {

output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];

}}}}}}

Loop over output pixelsLoop over output feature mapsLoop over input feature mapsLoop over filter pixels

Filters

Output Feature MapsInput Feature Maps

N MC

R

K

33

Challenges for FPGA

...Computation Resource

PE-1 PE-2 PE-n

Inter-connect

buffer1On-chip memory

buffer2

External Memory

Off-chip Bus

Limited on-chip storage

Limited off-chip bandwidth

Limited compute and routing resources

▸ We can’t just unroll all the loops due to limited FPGA resources▸ Must choose the right pragmas and code transformations

Solution: Loop pipelining

Solution: Loop tiling

Solution: Data reuse, compute-memory balance

34

Loop Tiling

1 for(row=0; row<R; row++) {2 for(col=0; col<C; col++) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(i=0; i<K; i++) {6 for(j=0; j<K; j++) {

output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];

}}}}}}

Loop over pixels in an output map

Filters

Output Feature MapsInput Feature Maps

N MC

R

K

1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(i=0; i<K; i++) {8 for(j=0; j<K; j++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];

}}}}}}}}}} 35

Loop Tiling

Filters

Output Feature MapsInput Feature Maps

N MC

R

K

Loops over pixelsin each tile

Tc

Tr

Loop over different tiles in output map

1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to+=Tm) {4 for(ti=0; ti<N; ti+=Tn) {

// software: write output feature map// software: write input feature map + filters

5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];

}}}}}}

// software: read output feature map}}}}

36

Code with Loop Tiling

Software Portion

FPGA Portion

Maximize memory reuse

Maximize computational performance

5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}

5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];

}}}}}}

Reorder these two loops to the top level9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];

}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {

#pragma HLS unrolloutput_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}

Generated HardwarePerformance determined by tile factors Tn and TmAmount of data transfer determined by Tr and Tc

Tm

Tn

37

Optimizing for On-Chip Performance

▸ Challenge: Number of available optimizations present a huge space of possible designs– Optimal loop order? Tile size for each loop?

▸ Implementing and testing each design by hand will be slow and error-prone– Some designs exceed the on-chip resources

▸ Solution: Performance modeling + automated design space exploration

38

Design Space Complexity

▸ We calculate the following design metrics:– Total number of operations (FLOP)

• Depends on the CNN model parameters

– Total external memory access (Byte)• Depends on the CNN weight and activation size

– Total execution time (S)• Depends on the hardware architecture (e.g. tile factors Tm and Tn)• Ignore resource constrains for now

▸ Execution Time = Number of Cycles×Clock Period

▸ Number of Cycles ≈ ^_̀

× a_b

×𝑅×𝐶×𝐾3

39

Performance Modeling

40

Performance Modeling

...Computation Resource

PE-1 PE-2 PE-n

Inter-connect

buffer1On-chip memory

buffer2

External Memory

Off-chip Bus

Computational Throughput

Required Bandwidth

Computation To Communication (CTC) Ratio

GFLOP/S

FLOP / (Byte Accessed)

GByte/S

=Totalnumberofoperations

Totalexecutiontime

=TotalnumberofoperationsTotalexternalmemoryaccess

𝐺𝐵𝑦𝑡𝑒𝑆

= 𝐺𝐹𝐿𝑂𝑃/𝑠

𝐹𝐿𝑂𝑃/(𝐵𝑦𝑡𝑒𝐴𝑐𝑐𝑒𝑠𝑠𝑒𝑑)

=ComputationalThroughput

CTCRatio

41

Roofline MethodC

ompu

tatio

nal T

hrou

ghpu

t

Computation To Communication (CTC) Ratio

Required Bandwidth

GFLOP/S

FLOP / Byte

Gbyte/S

Design

42

Roofline MethodC

ompu

tatio

nal T

hrou

ghpu

t

Computation To Communication (CTC) Ratio

GFLOP/S

FLOP / Byte

Bandwidth Roof(Device memory

bandwidth)Computational Roof(Max device throughput)

“The Roofline”Exceeds On-Chip Resources

Bandwidth-BottleneckedTheoretical Throughput

Real Throughput

0

20

40

60

80

100

120

0 10 20 30 40 50 60

43

Design Space Exploration

Computational RoofB

Bandwidth Roof

A

Com

puta

tiona

l Thr

ough

put

Computation To Communication (CTC) Ratio

▸ The accelerator design is implemented with Vivado HLS on the VC707 board containing a Virtex7 FPGA

▸ Hardware resource utilization

▸ Comparison of theoretical and real performance

44

Experimental Evaluation

Resource DSP BRAM LUT FFUsed 2240 1024 186251 205704Available 2800 2060 303600 607200Utilization 80% 50% 61% 34%

Layer 2 Theoretical Measured Difference5.11ms 5.25ms ~5%

45

Experimental Results

0

10

20

30

40

50

60

70

80

90

100

1 thread -O3 16 threads -O3 FPGA

Energy

0

2

4

6

8

10

12

14

16

18

20

1 thread -O3 16 threads -O3 FPGA

Speedup

90x

20x

18x

5x

CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHzgcc 4.7.2 –O3OpenMP 3.0

FPGA Virtex7-485t (28nm) 448 PEs 100MHzVivado 2013.4Vivado HLS 2013.4

FIN

46

Recommended