Introduction to Neural Networks - Cornell University · Introduction to Neural Networks Ritchie...

Introduction to Neural Networks

Ritchie Zhao, Zhiru ZhangSchool of Electrical and Computer Engineering

ECE 5775 (Fall’17)High-Level Digital Design Automation

▸ Neural networks have revolutionized the world– Self-driving vehicles– Advanced image recognition– Game AI

Rise of the Machines

▸ Neural networks have revolutionized research

▸ 1950’s: First artificial neural networks created based on biological structures in the human visual cortex

▸ 1980’s – 2000’s: NNs considered inferior to other, simpler algorithms (e.g. SVM, logistic regression)

▸ Mid 2000’s: NN research considered “dead”, machine learning conferences outright reject most NN papers

▸ 2010 – 2012: NNs begin winning large-scale image and document classification contests, beating other methods

▸ 2012 – Now: NNs prove themselves in many industrial applications (web search, translation, image analysis)

A Brief History

▸ We’ll discuss neural networks for solving supervised classification problems

Classification Problems

Person

Inputs Labels

Training Set

Inputs Predictions

Test Set

• Given a training set consisting of labeled inputs

• Learn a function which maps inputs to labels

• This function is used to predict the labels of new inputs

▸ Inputs: Pairs of numbers (𝒙𝟏, 𝒙𝟐)▸ Labels: 0 or 1

▸ This is a binary decision problem

▸ Real-life analogue:– Label = Raining or Not Raining– 𝒙𝟏 = relative humidity– 𝒙𝟐 = cloud coverage

A Simple Classification Problem

x1 x2 Label-0.7 -0.6 0-0.6 0.4 0-0.5 0.7 1-0.5 -0.2 0-0.1 0.7 10.1 -0.2 00.3 0.5 10.4 0.1 10.4 -0.7 00.7 0.2 1

Training Set

Visualizing the Data

Plot of the data points

𝒙𝟏𝒙𝟐

x1 x2 Label-0.7 -0.6 0-0.6 0.4 0-0.5 0.7 1-0.5 -0.2 0-0.1 0.7 10.1 -0.2 00.3 0.5 10.4 0.1 10.4 -0.7 00.7 0.2 1

Training Set

▸ In this case, the data points can be classified with a linear decision boundary

Decision Function

Decision boundary

▸ The perceptron is the simplest possible neural network, containing only one neuron

▸ Described by the following equation:

𝑦 = 𝜎(*𝑤,𝑥,

+ 𝑏)

The Perceptron

𝑤, = weights𝑏 = bias𝜎 = activation function

▸ A 2-input, 1-output perceptron:

Visualizing the Perceptron

𝑤0𝜎

𝑤3+

𝑥3𝑦

Linear function of �⃗� =𝑥0𝑥3

𝑊�⃗� + 𝑏

Non-linear function“squashes” output to

the interval (0,1)

OutputProbability that the label should be “1”

▸ The activation function σ is non-linear:

Activation Function

Unit Step Sigmoid ReLU

Discontinuous at origin, zero derivative

everywhere

Continuous anddifferentiable everywhere

Continuous but non-differentiable

at origin

11 + 𝑒9: max(0, 𝑥)

▸ First perceptrons used Unit Step. Early neural networks used Sigmoid. Modern deep networks use ReLU.

▸ Let’s plot the 2-input perceptron (sigmoid activation)𝑦 = 𝜎(𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏)

The Perceptron Decision Boundary

𝑤0 = 10𝑤3 = 10

𝑤0 = 5𝑤3 = 20

𝑤0 = 20𝑤3 = 5

Ratio of weights change the direction of the decision boundary

𝑤0 = 10𝑤3 = 10

𝑤0 = 5𝑤3 = 5

𝑤0 = 2𝑤3 = 2

Magnitude of weights change the steepness of the decision boundary

𝑏 = 0 𝑏 = 5 𝑏 = −5

Bias moves boundary away the from origin

▸ With the right weights and bias we can create any (linear) decision boundary we want!

▸ But how do we compute the weights and bias?

▸ Solution: backpropagation training

Finding the Weights and Bias

▸ Let’s take the 2-input perceptron and initialize the weights and bias randomly

Backpropagation

𝑏1.4

−8.7

Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University

▸ Send one input pair through and calculate the output

Backpropagation

x1 x2 Label-0.7 -0.6 0

−0.7

−0.6

1.582.98 0.95 Output: 0.95

Target: 0.0

−8.7

▸ How to change {𝑤0, 𝑤3, 𝑏} to move 𝑦 towards the target?

▸ Answer: partial derivatives JKJLM

, JKJLN

, JKJO

Backpropagation Training

−0.7

−0.6

1.582.98 0.95 Output: 0.95

Target: 0.0

−8.7

▸ We can calculate the derivatives function-by-function▸ Green = partial derivative

Backpropagation

−0.7

−0.6

2.98 0.95

1.0𝜕𝑦𝜕𝑦 = 1.00.046

𝑠 𝑦𝑤0

−8.7

𝑦 = σ(𝑠) =1

1 + 𝑒9S

𝜕𝑦𝜕𝑠 =

𝑒9S

(1 + 𝑒9S)3 = 0.046

▸ We can calculate the derivatives function-by-function ▸ Propagate partial derivatives backwards

Backpropagation

−8.7

−0.7

−0.6

2.98 0.95

𝑠 = 𝑤0𝑥0 + 𝑤3𝑥3 + 𝑏

𝜕𝑠𝜕𝑏 = 1

𝜕𝑠𝜕𝑤0

= 𝑥0

𝜕𝑠𝜕𝑤3

= 𝑥3

𝑠 𝑦

−0.032

−0.028

▸ What we’ve done is apply the chain rule of derivatives

Backpropagation

−8.7

−0.7

−0.6

2.98 0.95

𝜕𝑦𝜕𝑤0

=𝜕𝑦𝜕𝑠

𝜕𝑠𝜕𝑤0

𝜕𝑦𝜕𝑠 = 0.046

𝜕𝑠𝜕𝑤0

= −0.7

𝑠 𝑦

−0.032

−0.028

▸ In general, any operator in a neural network defines a forwards pass and a backwards pass

Backpropagation

𝑓(𝑥, 𝑦)

𝜕𝐸𝜕𝑓

𝜕𝐸𝜕𝑥 =

𝜕𝐸𝜕𝑓

𝜕𝑓𝜕𝑥

𝜕𝐸𝜕𝑦 =

𝜕𝐸𝜕𝑓

𝜕𝑓𝜕𝑦

▸ Backpropagation allows us to build complex networks and still be able to update the weights

▸ We now know how to compute JKJLM

, JKJLN

, JKJO

▸ Parameter update rule (for minimizing 𝑦):

𝑤 = 𝑤 − η JKJL

η = learning rate or step size

▸ This is optimization by gradient descent, as we move in the direction of the steepest gradient

Weight Update

▸ For simplicity we minimized 𝑦, since label 𝑡 = 0▸ In practice we use a loss function 𝐿(𝑦, 𝑡)

– Square Loss: 𝐿 𝑦, 𝑡 = (𝑦 − 𝑡)3

– Cross Entropy: 𝐿 𝑦, 𝑡 = −𝑡 ln 𝑦 − 1 − 𝑡 ln(1 − 𝑦)

▸ For simplicity we dealt with 1 training sample▸ In practice we process multiple training samples

(a minibatch) and sum the gradients before updating the weights

Practical Differences

▸ The perceptron’s linear decision boundary is not very useful for complex problems▸ A deep neural network (DNN) consists of many

layers, each containing multiple neurons

Deep Neural Network

Image credit: http://www.opennn.net/

This network is fully connected

Input Layer Hidden LayersOutput Layer

▸ Each hidden unit learns a hidden feature▸ Later features depends on earlier features in the

network, so they will get more complex

What Happens in a DNN?

Image credit: http://www.opennn.net/

Simple FeaturesMore Complex Features

▸ A deep neural network creates highly non-linear decision boundaries

Deep Neural Network

2 units 3 units 4 units

Image credit: https://www.carl-olsson.com/fall-semester-2013/

▸ Convolutional neural network (CNN) is a neural network specifically built to analyze images▸ State-of-the-art in image classification▸ Key difference: the layers at the front are

convolutional layers27

Convolutional Neural Network

Convolutional Layers Fully-connected Layers

Image credit: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Fully-connected Layer 1. An output neuron is connected

to each input neuron

Convolutional Layer1. Inputs and outputs are vectors

of 2D feature maps2. An output pixel is connected

to its neighboring region oneach input feature map

3. All pixels on an output feature map use the same weights

The Convolutional Layer

Image credit: https://inspirehep.net/record/1501944/plots

The Convolutional Layer

Image credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Input Feature Map

Output Feature Map

▸ The conv layer moves a weight filter across the input map, doing a dot product at each location

▸ The weight filter detects local features in the input map

▸ Convolutional layers detect visual features▸ This shows features in the input image which cause

high activity, across 3 different conv layers in a CNN

Intuition on CNNs

Image credit: https://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/; H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, CACM Oct 2011

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksCheng Zhang1, Peng Li2, Guangyu Sun1,3, Yijin Guan1, Bingjun Xiao2, Jason Cong2,3,1

1Center for Energy-Efficient Computing and Applications, Peking University2Computer Science Department, University of California, Los Angeles3PKU/UCLA Joint Research Institute in Science and EngineeringFPGA’15, Feb 2016

Accelerating CNNs on FPGA

Convolutional Layer Code

1 for(row=0; row<R; row++) {2 for(col=0; col<C; col++) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(i=0; i<K; i++) {6 for(j=0; j<K; j++) {

output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];

}}}}}}

Loop over output pixelsLoop over output feature mapsLoop over input feature mapsLoop over filter pixels

Filters

Output Feature MapsInput Feature Maps

Challenges for FPGA

...Computation Resource

PE-1 PE-2 PE-n

Inter-connect

buffer1On-chip memory

buffer2

External Memory

Off-chip Bus

Limited on-chip storage

Limited off-chip bandwidth

Limited compute and routing resources

▸ We can’t just unroll all the loops due to limited FPGA resources▸ Must choose the right pragmas and code transformations

Solution: Loop pipelining

Solution: Loop tiling

Solution: Data reuse, compute-memory balance

Loop Tiling

1 for(row=0; row<R; row++) {2 for(col=0; col<C; col++) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(i=0; i<K; i++) {6 for(j=0; j<K; j++) {

output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];

}}}}}}

Loop over pixels in an output map

Filters

1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to++) {4 for(ti=0; ti<N; ti++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(i=0; i<K; i++) {8 for(j=0; j<K; j++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];

}}}}}}}}}} 35

Loop Tiling

Filters

Loops over pixelsin each tile

Loop over different tiles in output map

1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to+=Tm) {4 for(ti=0; ti<N; ti+=Tn) {

// software: write output feature map// software: write input feature map + filters

5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {

}}}}}}

// software: read output feature map}}}}

Code with Loop Tiling

Software Portion

FPGA Portion

Maximize memory reuse

Maximize computational performance

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {

}}}}}}

Reorder these two loops to the top level9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {

}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {

}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {

#pragma HLS unrolloutput_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}

Generated HardwarePerformance determined by tile factors Tn and TmAmount of data transfer determined by Tr and Tc

Optimizing for On-Chip Performance

▸ Challenge: Number of available optimizations present a huge space of possible designs– Optimal loop order? Tile size for each loop?

▸ Implementing and testing each design by hand will be slow and error-prone– Some designs exceed the on-chip resources

▸ Solution: Performance modeling + automated design space exploration

Design Space Complexity

▸ We calculate the following design metrics:– Total number of operations (FLOP)

• Depends on the CNN model parameters

– Total external memory access (Byte)• Depends on the CNN weight and activation size

– Total execution time (S)• Depends on the hardware architecture (e.g. tile factors Tm and Tn)• Ignore resource constrains for now

▸ Execution Time = Number of Cycles×Clock Period

▸ Number of Cycles ≈ ^_̀

× a_b

×𝑅×𝐶×𝐾3

Performance Modeling

...Computation Resource

PE-1 PE-2 PE-n

Inter-connect

buffer1On-chip memory

buffer2

External Memory

Off-chip Bus

Computational Throughput

Required Bandwidth

Computation To Communication (CTC) Ratio

GFLOP/S

FLOP / (Byte Accessed)

GByte/S

=Totalnumberofoperations

Totalexecutiontime

=TotalnumberofoperationsTotalexternalmemoryaccess

𝐺𝐵𝑦𝑡𝑒𝑆

= 𝐺𝐹𝐿𝑂𝑃/𝑠

𝐹𝐿𝑂𝑃/(𝐵𝑦𝑡𝑒𝐴𝑐𝑐𝑒𝑠𝑠𝑒𝑑)

=ComputationalThroughput

CTCRatio

Roofline MethodC

Required Bandwidth

GFLOP/S

FLOP / Byte

Gbyte/S

Design

Roofline MethodC

GFLOP/S

FLOP / Byte

Bandwidth Roof(Device memory

bandwidth)Computational Roof(Max device throughput)

“The Roofline”Exceeds On-Chip Resources

Bandwidth-BottleneckedTheoretical Throughput

Real Throughput

0 10 20 30 40 50 60

Design Space Exploration

Computational RoofB

Bandwidth Roof

▸ The accelerator design is implemented with Vivado HLS on the VC707 board containing a Virtex7 FPGA

▸ Hardware resource utilization

▸ Comparison of theoretical and real performance

Experimental Evaluation

Resource DSP BRAM LUT FFUsed 2240 1024 186251 205704Available 2800 2060 303600 607200Utilization 80% 50% 61% 34%

Layer 2 Theoretical Measured Difference5.11ms 5.25ms ~5%

Experimental Results

1 thread -O3 16 threads -O3 FPGA

Energy

1 thread -O3 16 threads -O3 FPGA

Speedup

CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHzgcc 4.7.2 –O3OpenMP 3.0

FPGA Virtex7-485t (28nm) 448 PEs 100MHzVivado 2013.4Vivado HLS 2013.4

Introduction to Neural Networks - Cornell University · Introduction to Neural Networks Ritchie...

Documents

Higher Order Neural Networks and Neural Networks for

Artiﬁcial Neural Networks and Fuzzy Neural Networks for ...downloads.hindawi.com/journals/specialissues/378312.pdfArtificial Neural Networks and Fuzzy Neural Networks for Solving

ICSD Summer School 2016 Data Science - Week 4 …people.inf.elte.hu/jehad/Aldahdooh/ISCD-FR/f1.pdf · ... Convolutional Neural Networks ... CS231n: Convolutional Neural Networks for

Lecture 4: Backpropagation and Neural Networks part 1vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture… · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture

Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2020/lecture_8.pdf · Lecture 8 - 1 April 30, 2020 Lecture 8: Training Neural Networks, Part

Cs231n 2017 lecture10 Recurrent Neural Networks

Artificial Neural Networks Lect8: Neural networks for constrained optimization

Lecture 10: Neural Networks and Deep Learningsaravanan-thirumuruganathan.github.io/cse5334... · Neural Networks Deep Learning Convolutional Neural Networks Recurrent Neural Networks

ARTIFICIAL NEURAL NETWORKS. Introduction to Neural Networks

Lecture 10: Convolutional Neural Networks · CS231n: Convolutional Neural Networks K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large -Scale Image Recognition,

Introduction to Neural Networks - Databricks · • Introduction to Neural Networks • Training Neural Networks • Applying your Neural Networks This series will be make use of

Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Lecture 7: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2020/lecture_7.pdf · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part

Neural Networks: Backpropagation - sviveksvivek.com/.../fall2018/slides/neural-networks/neural-networks-backpropagation.pdfNeural Networks: Backpropagation 1 Based on slides and material

CS231N Project: Coloring black and white world using Deep ...cs231n.stanford.edu/reports/2016/pdfs/205_Report.pdf · CS231N Project: Coloring black and white world using Deep Neural

Neural networks Computer vision Neural Networks and ... · Neural networks Computer vision Neural Networks and Learning Machines (3rd Edition) Author: Simon's H. Haykin Published:

CS231n Convolutional Neural Networks for Visual Recognition.pdf

NEURAL NETWORKS,CELLULAR NEURAL NETWORKS AND ADAPTIVE FUZZY FILTERS

Neural Networks and Lecture 4: Backpropagationvision.stanford.edu/teaching/cs231n/slides/2020/lecture_4.pdf · Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 16, 2020 24

Neural Networks Neural Networks based on Competition CHAPTER 4