45
Deep Learning Primer Nishith Khandwala

Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Deep Learning PrimerNishith Khandwala

Page 2: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Neural Networks

Page 3: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Overview

● Neural Network Basics

● Activation Functions

● Stochastic Gradient Descent (SGD)

● Regularization (Dropout)

● Training Tips and Tricks

Page 4: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Neural Network (NN) Basics

Dataset: (x, y) where x: inputs, y: labels

Steps to train a 1-hidden layer NN:

● Do a forward pass: ŷ = f(xW + b)● Compute loss: loss(y, ŷ)● Compute gradients using backprop● Update weights using an optimization

algorithm, like SGD● Do hyperparameter tuning on Dev set● Evaluate NN on Test set

Page 5: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Activation Functions: Sigmoid

Properties:

● Squashes input between 0 and 1.

Problems:

● Saturation of neurons kills gradients.

● Output is not centered at 0.

Page 6: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Activation Functions: Tanh

Properties:

● Squashes input between -1 and 1.

● Output centered at 0.

Problems:

● Saturation of neurons kills gradients.

Page 7: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Activation Functions: ReLU

Properties:

● No saturation● Computationally cheap● Empirically known to converge faster

Problems:

● Output not centered at 0● When input < 0, ReLU gradient is 0. Never

changes.

Page 8: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Stochastic Gradient Descent (SGD)

● Stochastic Gradient Descent (SGD)○ 𝝷 : weights/parameters○ 𝛂 : learning rate○ J : loss function

● SGD update happens after every training example.

● Minibatch SGD (sometimes also abbreviated as SGD) considers a small batch of training examples at once, averages their loss and updates 𝝷.

Page 9: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Regularization: Dropout

● Randomly drop neurons at forward pass during training.

● At test time, turn dropout off.● Prevents overfitting by forcing

network to learn redundancies.

Think about dropout as training an ensemble of networks.

Page 10: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Training Tips and Tricks

● Learning rate:○ If loss curve seems to be unstable

(jagged line), decrease learning rate.○ If loss curve appears to be “linear”,

increase learning rate.

very high learning rate

high learning rate

good learning rate

low learning rate

loss

Page 11: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

● Regularization (Dropout, L2 Norm, … ):○ If the gap between train and dev

accuracies is large (overfitting), increase the regularization constant.

DO NOT test your model on the test set until overfitting is no longer an issue.

Training Tips and Tricks

Page 12: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Backpropagation and Gradients

Slides courtesy of Barak Oshri

Page 13: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Problem Statement

Given a function f with respect to inputs x, labels y, and parameters 𝜃compute the gradient of Loss with respect to 𝜃

Page 14: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Backpropagation

An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients

Page 15: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Backpropagation

1. Identify intermediate functions (forward prop)2. Compute local gradients3. Combine with upstream error signal to get full gradient

Page 16: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Modularity - Simple Example

Compound function

Intermediate Variables(forward propagation)

Page 17: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Modularity - Neural Network Example

Compound function

Intermediate Variables(forward propagation)

Page 18: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Intermediate Variables(forward propagation)

Intermediate Gradients(backward propagation)

Page 19: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Chain Rule Behavior

Key chain rule intuition: Slopes multiply

Page 20: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Circuit Intuition

Page 21: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Backprop Menu for Success

1. Write down variable graph

2. Compute derivative of cost function

3. Keep track of error signals

4. Enforce shape rule on error signals

5. Use matrix balancing when deriving over a linear transformation

Page 22: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201822

Convolutional Neural Networks

Slides courtesy of Justin Johnson, Serena Yeung, and Fei-Fei Li

Page 23: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201823

30721

Fully Connected Layer32x32x3 image -> stretch to 3072 x 1

10 x 3072 weights

activationinput

110

Page 24: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201824

30721

Fully Connected Layer32x32x3 image -> stretch to 3072 x 1

10 x 3072 weights

activationinput

1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)

110

Page 25: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201825

32

32

3

Convolution Layer32x32x3 image -> preserve spatial structure

width

height

depth

Page 26: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201826

32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Page 27: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201827

32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume

Page 28: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201828

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image(i.e. 5*5*3 = 75-dimensional dot product + bias)

Page 29: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201829

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation map

1

28

28

Page 30: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201830

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation maps

1

28

28

consider a second, green filter

Page 31: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201831

32

32

3

Convolution Layer

activation maps

6

28

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

Page 32: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201832

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

32

3

28

28

6

CONV,ReLUe.g. 6 5x5x3 filters

Page 33: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Fei-Fei Li & Justin Johnson & Serena Yeung April 17, 2018Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 201833

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6

CONV,ReLUe.g. 10 5x5x6 filters

CONV,ReLU

….

10

24

24

Page 34: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

RNNs, Language Models, LSTMs, GRUs

Slides courtesy of Lisa Wang and Juhi Naik

Page 35: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

RNNs

● Review of RNNs● RNN Language Models● Vanishing Gradient Problem● GRUs● LSTMs

Page 36: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

RNN Review

Key points:

● Weights are shared (tied) across timesteps (Wxh , Whh , Why )

● Hidden state at time t depends on previous hidden state and new input

● Backpropagation across timesteps (use unrolled network)

Page 37: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

RNN Review

RNNs are good for:

● Learning representations for sequential data with temporal relationships

● Predictions can be made at every timestep, or at the end of a sequence

Page 38: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

RNN Language Model

● Language Modeling (LM): task of computing probability distributions over sequence of words

● Important role in speech recognition, text summarization, etc.● RNN Language Model:

Page 39: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

RNN Language Model for Machine Translation

● Encoder for source language● Decoder for target language● Different weights in encoder

and decoder sections of the RNN (Could see them as two chained RNNs)

Page 40: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Vanishing Gradient Problem

● Backprop in RNNs: recursive gradient call for hidden layer● Magnitude of gradients of typical activation functions between 0 and 1.

● When terms less than 1, product can get small very quickly● Vanishing gradients → RNNs fail to learn, since parameters barely update. ● GRUs and LSTMs to the rescue!

Page 41: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Gated Recurrent Units

● Reset gate, rt● Update gate, zt● rt and zt control long-term and

short-term dependencies (mitigates vanishing gradients problem)

Gated Recurrent Units (GRUs)

Page 42: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Gated Recurrent Units (GRUs)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 43: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

LSTMs

● it: Input gate - How much does current input matter

● ft: Input gate - How much does past matter

● ot: Output gate - How much should current cell be exposed

● ct: New memory - Memory from current cell

Page 44: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

LSTMs

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 45: Deep Learning PrimerDeep Learning Primer Nishith Khandwala. Neural Networks. Overview Neural Network Basics ... Decoder for target language Different weights in encoder and decoder

Acknowledgements

● Slides adapted from:○ Barak Oshri, Lisa Wang, and Juhi Naik (CS224N, Winter 2017)○ Justin Johnson, Serena Yeung, and Fei-Fei Li (CS231N, Spring 2018)

● Andrej Karpathy, Research Scientist, OpenAI● Christopher Olah, Research Scientist, Google Brain