CSE446: Neural Networks Spring 2017 › ... › 17sp › Slides › 14_NeuralNet… · Multilayer...

CSE446: Neural NetworksSpring 2017

Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Human Neurons

• Switching time• ~ 0.001 second

• Number of neurons– 1010

• Connections per neuron– 104-5

• Scene recognition time– 0.1 seconds

• Number of cycles per scene recognition?– 100 much parallel computation!

Perceptron as a Neural Network

This is one neuron:

– Input edges x1 ... xn, along with basis

– The sum is represented graphically

– Sum passed through an activation function g

Sigmoid Neuron

-6 -4 -2 0 2 4 60

Just change g!• Why would be want to do this?

• Notice new output range [0,1]. What was it before?

• Look familiar?

Optimizing a neuron

We train to minimize sum-squared error

Solution just depends on g’: derivative of activation function!

Re-deriving the perceptron update

For a specific, incorrect example:• w = w + y*x (our familiar update!)

Sigmoid units: have to differentiate g

Aside: Comparison to logistic regression

• P(Y|X) represented by:

• Learning rule – MLE:

-6 -4 -2 0 2 4 60

Perceptron, linear classification, Boolean functions: xi∈{0,1}

• Can learn x1 ∨ x2?• -0.5 + x1 + x2

• Can learn x1 ∧ x2?• -1.5 + x1 + x2

• Can learn any conjunction or disjunction?• 0.5 + x1 + … + xn

• (-n+0.5) + x1 + … + xn

• Can learn majority?• (-0.5*n) + x1 + … + xn

• What are we missing? The dreaded XOR!, etc.

Going beyond linear classification

Solving the XOR problemy = x1 XOR x2

v1 = (x1 ∧¬x2) = -1.5+2x1-x2

v2 = (x2 ∧¬x1) = -1.5+2x2-x1

y = v1∨ v2

= -0.5+v1+v2

-1-1.5

= (x1 ∧ ¬x2) ∨ (x2 ∧¬x1)

Hidden layer

• Single unit:

• 1-hidden layer:

• No longer convex function!

©Carlos Guestrin 2005-2009

Example

data for NN

with hidden

Learned

weights for

hidden layer

Forward propagation1-hidden layer:

Compute values left to right

1. Inputs: x1, …, xn

2. Hidden: v1 ,…, vn

3. Output: y

©Carlos Guestrin 2005-2009 18

Gradient descent for 1-hidden layer

Dropped w0 to make derivation simpler

Gradient for last layer same as the single node

case, but with hidden nodes v as input!

Gradient descent for 1-hidden layer

Dropped w0 to make derivation simpler

For hidden layer,

two parts:

• Normal update

for single neuron

• Recursive

computation of

gradient on

output layer

Multilayer neural networks

Inference and

Learning:

• Forward pass:

left to right, each

hidden layer in

• Gradient

computation:

right to left,

propagating

gradient for

each nodeForward

Gradient

Forward propagation – prediction

• Recursive algorithm

• Start from input layer

• Output of node Vk with parents U1,U2,…:

Back-propagation – learning

• Just gradient descent!!!

• Recursive algorithm for computing gradient

• For each example

– Perform forward propagation

– Start from output layer

• Compute gradient of node Vk with parents U1,U2,…

• Update weight wik

• Repeat (move to preceding layer)

Back-propagation – pseudocode

Initialize all weights to small random numbers

• Until convergence, do:– For each training example x,y:

1. Forward propagation, compute node values Vk

2. For each output unit o (with labeled output y):

δo = Vo(1-Vo)(y-Vo)3. For each hidden unit h:

δh = Vh(1-Vh) Σk in output(h) wh,kδk

4. Update each network weight wi,j from node i to node j

wi,j = wi,j + ηδjxi,j

Convergence of backprop

• Perceptron leads to convex optimization

– Gradient descent reaches global minima

• Multilayer neural nets not convex

– Gradient descent gets stuck in local minima

– Selecting number of hidden units and layers = fuzzy process

– NNs have made a HUGE comeback in the last few years!!!• Neural nets are back with a new name!!!!

– Deep belief networks

– Huge error reduction when trained with lots of data on GPUs

Overfitting in NNs

• Are NNs likely to overfit?– Yes, they can represent

arbitrary functions!!!

• Avoiding overfitting?– More training data

– Fewer hidden nodes / better topology

– Regularization

– Early stopping

Image ModelsObject Recognition

Slides from Jeff Dean at Google

Number Detection

What are these numbers?

Acoustic Modeling for Speech Recognition

Trained in <5 days on cluster of 800 machines

30% reduction in Word Error Rate for English!(“biggest single improvement in 20 years of speech research”)

Launched in 2012 at time of Jellybean release of Android

Close collaboration with Google Speech team

Fully-connected layers

Layer 1

Layer 7

Softmax to predict object class

Convolutional layers!(same weights used at all!spatial locations in layer)!!

Convolutional networks developed by!Yann LeCun (NYU)

Basic architecture developed by Krizhevsky, Sutskever & Hinton (all now at Google).!

Won 2012 ImageNet challenge with 16.4% top-5 error rate

2012-era Convolutional Model for Object Recognition

24 layers deep!

2014-era Model for Object Recognition

Developed by team of Google Researchers:!

Won 2014 ImageNet challenge with 6.66% top-5 error rate

Module with 6 separate!

convolutional layers

Good Fine-grained Classification

“hibiscus” “dahlia”

Good Generalization

Both recognized as a

“meal”

Sensible Errors

“snake” “dog”

Works in practice for real users.

Object Detection

40DEMO

What you need to know about neural networks

• Perceptron:

– Relationship to general neurons

• Multilayer neural nets

– Representation

– Derivation of backprop

– Learning rule

• Overfitting

CSE446: Neural Networks Spring 2017 › ... › 17sp › Slides › 14_NeuralNet… · Multilayer...

Documents

Neural Communication: The Neural Impulse

Neural network for Neural Networks

Neural Information Flow: Learning neural information

Neural Networks Neural Networks

Neural and Fuzzy Neural Networks

CSE 442 - Data Visualization Animationcourses.cs.washington.edu/courses/cse442/17sp/... · Pre-attentive, stronger than color, shape, … More sensitive to motion at periphery Similar

Artificial Intelligence Slide credits: Pieter Abbeel, Dan ...courses.cs.washington.edu/courses/cse120/17sp/lectures/24/CSE12… · 1)Grab a pack of “game pieces” (at least 10

Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

Math 7350: Differential Graded Algebras and Differential ...pi.math.cornell.edu/~dmehrle/notes/cornell/17sp/7350notes.pdf · Math 7350: Differential Graded Algebras and Differential

CSE 446 Machine Learning Exam Review Semi-Supervised Learning … › courses › cse446 › 12wi › ... · 2012-03-09 · Semi-Supervised Learning & a hint of SVM 1 s Daniel Weld

Computer Science Principlescourses.cs.washington.edu/courses/cse120/17sp/lectures/01/CSE12… · L01: Intro, Abstraction CSE120, Spring 2017 Why Study Computer Science? Increasingly

Web Security: Web Application Security [continued]courses.cs.washington.edu/courses/cse484/17sp/slides/cse484-lect… · Web Security: Web Application Security [continued] Spring

Developing an App II - courses.cs.washington.educourses.cs.washington.edu/courses/cse120/17sp/lectures/23/CSE12… · L23: Developing an App II CSE120, Spring 2017 Final Project Three

Variable Selection LASSO: Sparse Regression · 2013-04-17 · Variable Selection LASSO: Sparse Regression Machine Learning – CSE446 Carlos Guestrin University of Washington April

Chapter 2 Introduction to Neural networktomczak/PDF/[Grbic]Neural...Chapter 2 Introduction to Neural network 2.1 Introduction to Artiﬂcial Neural Net-work Artiﬂcial Neural Networks

Neural Communication: The Neural Chain

Artificial Neural NetworksArtificial Neural Networksrailway.iust.ac.ir/files/rail/Booklet_Teach/neural_network_moaveni.pdf · Artificial Neural NetworksArtificial Neural Networks

CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Computer Networks - courses.cs.washington.educourses.cs.washington.edu/courses/csep561/17sp/week1.pdf · mrtg nmap ntop dig wget net-snmp. 3/27/17 6 The Main Point 1. To learn how

CSE446: Dimensionality Reduction and PCA Spring 2017 · 2017-05-25 · Spring 2017 Slides adapted from Carlos Guestrin and Luke Zettlemoyer. Dimensionality reduction • Input data