146
CS 1699: Deep Learning Neural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020

CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

CS 1699: Deep Learning

Neural Network Basics

Prof. Adriana KovashkaUniversity of Pittsburgh

January 16, 2020

Page 2: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Plan for this lecture (next few classes)

• Definition – Architecture– Basic operations– Biological inspiration

• Goals– Loss functions

• Training– Gradient descent– Backpropagation

• Tricks of the trade– Dealing with sparse data and overfitting

Page 3: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Definition

Page 4: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Neural network definition

• Activations:

• Nonlinear activation function h (e.g. sigmoid,

tanh, RELU): e.g. z = RELU(a) = max(0, a)

Figure from Christopher Bishop

Page 5: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

• Layer 2

• Layer 3 (final)

• Outputs

• Finally:

Neural network definition

(binary)

(multiclass)

(binary)

Page 6: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Activation functions

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 7: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

A multi-layer neural network…

• Is a non-linear classifier

• Can approximate any continuous function to arbitrary

accuracy given sufficiently many hidden units

Lana Lazebnik

Page 8: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Inspiration: Neuron cells

• Neurons

• accept information from multiple inputs

• transmit information to other neurons

• Multiply inputs by weights along edges

• Apply some function to the set of inputs at each node

• If output of function over threshold, neuron “fires”

Text: HKUST, figures: Andrej Karpathy

Page 9: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Biological analog

A biological neuron An artificial neuron

Jia-bin Huang

Page 10: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Hubel and Weisel’s architecture Multi-layer neural network

Adapted from Jia-bin Huang

Biological analog

Page 11: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Cascade neurons together

• Output from one layer is the input to the next

• Each layer has its own sets of weights

HKUST

Page 12: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Predictions are fed forward through the

network to classify

HKUST

Page 13: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Predictions are fed forward through the

network to classify

HKUST

Page 14: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Predictions are fed forward through the

network to classify

HKUST

Page 15: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Predictions are fed forward through the

network to classify

HKUST

Page 16: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Predictions are fed forward through the

network to classify

HKUST

Page 17: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Feed-forward networks

• Predictions are fed forward through the

network to classify

HKUST

Page 18: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Deep neural networks

• Lots of hidden layers

• Depth = power (usually)

Figure from http://neuralnetworksanddeeplearning.com/chap5.html

We

igh

ts t

o learn

!

We

igh

ts t

o learn

!

We

igh

ts t

o le

arn

!

We

igh

ts t

o le

arn

!

Page 19: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Goals

Page 20: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

How do we train deep networks?

• No closed-form solution for the weights (i.e.

we cannot set up a system A*w = b)

• We will iteratively find such a set of weights

that allow the outputs to match the desired

outputs

• We want to minimize a loss function (a

function of the weights in the network)

• For now let’s simplify and assume there’s a

single layer of weights in the network, and no

activation function (i.e. output is a linear

combination of the inputs)

Page 21: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Classification goal

Example dataset: CIFAR-10

10 labels

50,000 training images

each image is 32x32x3

10,000 test images.

Andrej Karpathy

Page 22: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Classification scores

[32x32x3]

array of numbers 0...1

(3072 numbers total)

f(x,W)

image parameters

10 numbers,

indicating class

scores

Andrej Karpathy

Page 23: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier

[32x32x3]

array of numbers 0...1

10 numbers,

indicating class

scores

3072x1

10x1 10x3072

parameters, or “weights”

(+b) 10x1

Andrej Karpathy

Page 24: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

Andrej Karpathy

Page 25: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier

Going forward: Loss function/Optimization

1. Define a loss function

that quantifies our

unhappiness with the

scores across the training

data.

2. Come up with a way of

efficiently finding the

parameters that minimize

the loss function.

(optimization)

TODO:

Adapted from Andrej Karpathy

cat

car

frog

3.2

5.1

-1.7

1.3

4.9

2.0

2.2

2.5

-3.1

Page 26: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier

Suppose: 3 training examples, 3 classes.

With some W the scores are:

cat

car

frog

3.2

5.1

-1.7

1.3

4.9

2.0

2.2

2.5

-3.1

Adapted from Andrej Karpathy

Page 27: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier: Hinge loss

Suppose: 3 training examples, 3 classes.

With some W the scores are:

cat

car

frog

3.2

5.1

-1.7

1.3

4.9

2.0

2.2

2.5

-3.1

Hinge loss:

Given an example

where

where

is the image and

is the (integer) label,

and using the shorthand for the

scores vector:

the loss has the form:

Adapted from Andrej Karpathy

Want: syi>= sj + 1

i.e. sj – syi+ 1 <= 0

If true, loss is 0

If false, loss is magnitude of violation

Page 28: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier: Hinge loss

Suppose: 3 training examples, 3 classes.

With some W the scores are:Hinge loss:

Given an example

where

where

is the image and

is the (integer) label,

and using the shorthand for the

scores vector:

the loss has the form:

= max(0, 5.1 - 3.2 + 1)

+max(0, -1.7 - 3.2 + 1)

= max(0, 2.9) + max(0, -3.9)

= 2.9 + 0

= 2.9

cat

car

frog

3.2

5.1

-1.7

1.3 2.2

4.9 2.5

2.0 -3.1

Losses: 2.9

Adapted from Andrej Karpathy

Page 29: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier: Hinge loss

Suppose: 3 training examples, 3 classes.

With some W the scores are:Hinge loss:

Given an example

where

where

is the image and

is the (integer) label,

and using the shorthand for the

scores vector:

the loss has the form:

= max(0, 1.3 - 4.9 + 1)

+max(0, 2.0 - 4.9 + 1)

= max(0, -2.6) + max(0, -1.9)

= 0 + 0

= 0

cat 3.2

car 5.1

frog -1.7

1.3

4.9

2.0

2.2

2.5

-3.1

Losses: 2.9 0

Adapted from Andrej Karpathy

Page 30: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier: Hinge loss

Suppose: 3 training examples, 3 classes.

With some W the scores are:Hinge loss:

Given an example

where

where

is the image and

is the (integer) label,

and using the shorthand for the

scores vector:

the loss has the form:

= max(0, 2.2 - (-3.1) + 1)

+max(0, 2.5 - (-3.1) + 1)

= max(0, 5.3 + 1)

+ max(0, 5.6 + 1)

= 6.3 + 6.6

= 12.9

cat

car

frog

3.2

5.1

-1.7

1.3

4.9

2.0

2.2

2.5

-3.1

Losses: 2.9 0 12.9

Adapted from Andrej Karpathy

Page 31: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier: Hinge loss

cat

car

frog

3.2

5.1

-1.7

1.3

4.9

2.0

2.2

2.5

-3.1

Suppose: 3 training examples, 3 classes.

With some W the scores are:Hinge loss:

Given an example

where

where

is the image and

is the (integer) label,

and using the shorthand for the

scores vector:

the loss has the form:

and the full training loss is the mean

over all examples in the training data:

L = (2.9 + 0 + 12.9)/32.9 0 12.9Losses: = 15.8 / 3 = 5.3

Lecture 3 - 12

Adapted from Andrej Karpathy

Page 32: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Linear classifier: Hinge loss

Slide from Fei-Fei, Johnson, Yeung

E.g. Suppose that we found a W such that L = 0.

Is this W unique?

No! 2W is also has L = 0!

How do we choose between W and 2W?

Page 33: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Weight Regularization

Data loss: Model predictions

should match training dataRegularization: Prevent the model

from doing too well on training data

= regularization strength

(hyperparameter)

Simple examples

L2 regularization:

L1 regularization:

Elastic net (L1 + L2):

More complex:

Dropout

Batch normalization

Stochastic depth, fractional pooling, etc

Why regularize?

- Express preferences over weights

- Make the model simple so it works on test data

Adapted from Fei-Fei, Johnson, Yeung

Page 34: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Weight Regularization

Expressing preferences

L2 Regularization

L2 regularization likes to

“spread out” the weights

Slide from Fei-Fei, Johnson, Yeung

Page 35: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Weight Regularization

Preferring simple models

yf1 f2

x

Regularization pushes against fitting the data

too well so we don’t fit noise in the data

Page 36: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Want to maximize the log likelihood, or (for a loss function)

to minimize the negative log likelihood of the correct class:cat

car

frog

3.2

5.1

-1.7

scores = unnormalized log probabilities of the classes.

where

Another loss: Cross-entropy

Andrej Karpathy

Page 37: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

cat

car

frog

unnormalized log probabilities

24.5

164.0

0.18

3.2

5.1

-1.7

exp normalize

unnormalized probabilities

0.13

0.87

0.00

probabilities

L_i = -log(0.13)

= 0.89

Another loss: Cross-entropy

Adapted from Fei-Fei, Johnson, Yeung

Probabilities

must be >= 0

Probabilities

must sum to 1

Aside:

- This is multinomial logistic regression

- Choose weights to maximize the likelihood of the observed x/y data

(Maximum Likelihood Estimation; more discussion in CS 1675)

Page 38: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Another loss: Cross-entropy

Adapted from Fei-Fei, Johnson, Yeung

cat

car

frog

3.2

5.1

-1.7

24.5

164.0

0.18

0.13

0.87

0.00

exp normalize

Probabilities

must be >= 0

Probabilities

must sum to 1

compare 1.00

0.00

0.00

Kullback–Leibler

divergence

unnormalized

log-probabilities / logitsunnormalized

probabilitiesprobabilities correct

probs

Page 39: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Other losses

• Triplet loss (Schroff, FaceNet, CVPR 2015)

• Anything you want! (almost)

a denotes anchor

p denotes positive

n denotes negative

Page 40: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Training

Page 41: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

To minimize loss, use gradient descent

Andrej Karpathy

Page 42: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

How to minimize the loss function?

In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives).

Andrej Karpathy

Page 43: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Loss gradients

• Denoted as (diff notations):

• i.e. how does the loss change as a function

of the weights

• We want to change the weights in such a

way that makes the loss decrease as fast as

possible

Page 44: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

gradient dW:

[?,

?,

?,

?,

?,

?,

?,

?,

?,…]

Andrej Karpathy

Page 45: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

W + h (first dim):

[0.34 + 0.0001,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25322

gradient dW:

[?,

?,

?,

?,

?,

?,

?,

?,

?,…]

Andrej Karpathy

Page 46: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

gradient dW:

[-2.5,

?,

?,

?,?,

?,

?,?,

?,…]

(1.25322 - 1.25347)/0.0001

= -2.5

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

W + h (first dim):

[0.34 + 0.0001,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25322

Andrej Karpathy

Page 47: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

gradient dW:

[-2.5,

?,

?,

?,

?,

?,

?,

?,

?,…]

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

W + h (second dim):

[0.34,

-1.11 + 0.0001,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25353

Andrej Karpathy

Page 48: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

gradient dW:

[-2.5,

0.6,

?,

?,?,

?,

?,

?,

?,…]

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

W + h (second dim):

[0.34,

-1.11 + 0.0001,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25353

(1.25353 - 1.25347)/0.0001

= 0.6

Andrej Karpathy

Page 49: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

gradient dW:

[-2.5,

0.6,

?,

?,

?,

?,

?,

?,

?,…]

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

W + h (third dim):

[0.34,

-1.11,

0.78 + 0.0001,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

Andrej Karpathy

Page 50: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

This is silly. The loss is just a function of W:

want

Calculus

= ...

Andrej Karpathy

Page 51: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

gradient dW:

[-2.5,

0.6,

0,

0.2,

0.7,

-0.5,

1.1,

1.3,

-2.1,…]

current W:

[0.34,

-1.11,

0.78,

0.12,

0.55,

2.81,

-3.1,

-1.5,

0.33,…]

loss 1.25347

dW = ...

(some function

data and W)

Andrej Karpathy

Page 52: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradient descent

• We’ll update weights

• Move in direction opposite to gradient:

L

Learning rateTime

Figure from Andrej Karpathy

original W

negative gradient directionW_1

W_2

Page 53: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradient descent

• Iteratively subtract the gradient with respect

to the model parameters (w)

• I.e. we’re moving in a direction opposite to

the gradient of the loss

• I.e. we’re moving towards smaller loss

Page 54: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Andrej Karpathy

Learning rate selection

The effects of step size (or “learning rate”)

Page 55: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradient descent in multi-layer nets

• We’ll update weights

• Move in direction opposite to gradient:

• How to update the weights at all layers?

• Answer: backpropagation of error from

higher layers to lower layers

Page 56: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Comments on training algorithm

• Not guaranteed to converge to zero training error, may

converge to local optima or oscillate indefinitely.

• However, in practice, does converge to low error for

many large networks on real data.

• Local minima – not a huge problem in practice for deep

networks.

• Thousands of epochs (epoch = network sees all training

data once) may be required, hours or days to train.

• May be hard to set learning rate and to select number of

hidden units and layers.

• When in doubt, use validation set to decide on

design/hyperparameters.

• Neural networks had fallen out of fashion in 90s, early

2000s; now significantly improved performance (deep

networks trained with dropout and lots of data).

Adapted from Ray Mooney, Carlos Guestrin, Dhruv Batra

Page 57: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradient descent in multi-layer nets

• How to update the weights at all layers?

• Answer: backpropagation of error from

higher layers to lower layers

Figure from Andrej Karpathy

Page 58: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Backpropagation: Graphic example

First calculate error of output units and use this

to change the top layer of weights.

output

hidden

input

Update weights into j

Adapted from Ray Mooney, equations from Chris Bishop

k

j

i

w(2)

w(1)

Page 59: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Backpropagation: Graphic example

Next calculate error for hidden units based on

errors on the output units it feeds into.

output

hidden

input

k

j

i

Adapted from Ray Mooney, equations from Chris Bishop

Page 60: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Backpropagation: Graphic example

Finally update bottom layer of weights based on

errors calculated for hidden units.

output

hidden

input

Update weights into i

k

j

i

Adapted from Ray Mooney, equations from Chris Bishop

Page 61: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Computing gradient for each weight

• We need to move weights in direction

opposite to gradient of loss wrt that weight:

wkj = wkj – η dE/dwkj (output layer)

wji = wji – η dE/dwji (hidden layer)

• Loss depends on weights in an indirect way,

so we’ll use the chain rule and compute:

dE/dwkj = dE/dyk dyk/dak dak/dwkj

dE/dwji = dE/dzj dzj/daj daj/dwji

Page 62: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradient for output layer weights

• Loss depends on weights in an indirect way,

so we’ll use the chain rule and compute:

dE/dwkj = dE/dyk dyk/dak dak/dwkj

• How to compute each of these?

• dE/dyk : need to know form of error function• Example: if E = (yk – yk’)

2, where yk’ is the ground-truth

label, then dE/dyk = 2(yk – yk’)

• dyk/dak : need to know output layer activation• If h(ak)=σ(ak), then d h(ak)/d ak = σ(ak)(1-σ(ak))

• dak/dwkj : zj since ak is a linear combination• ak = wk:

T z = Σj wkj zj

Page 63: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradient for hidden layer weights

• We’ll use the chain rule again and compute:

dE/dwji = dE/dzj dzj/daj daj/dwji

• Unlike the previous case (weights for output layer), the error (dE/dzj) is hard to compute

(indirect, need chain rule again)

• We’ll simplify the computation by doing it

step by step via backpropagation of error

• You could directly compute this term– you

will get the same result as with backprop (do

as an exercise!)

Page 64: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradients – slightly different notation

• The following is a framework, slightly imprecise

• Let us denote the inputs at a layer i by ini,

the linear combination of inputs computed at that layer as rawi, the activation as acti

• We define a new quantity that will roughly correspond to accumulated error, erri

• Then we can write the updates as

w = w – η * erri * ini

• We can compute error as:

erri =

d E / d acti * d acti / d rawi

Page 65: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Gradients – slightly different approach

• We’ll write the weight updates as follows

➢wkj = wkj - η δk zj for output units

➢wji = wji - η δj xi for hidden units

• What are δk, δj? • They store error, gradient wrt raw activations (i.e. dE/da)

• They’re of the form dE/dzj dzj/daj

• The latter is easy to compute – just use derivative of

activation function

• The former is easy for output – e.g. (yk – yk’)

• It is harder to compute for hidden layers

• dE/dzj = ∑k wkj δk (where did this come from?)

Figure from Chris Bishop

Page 66: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Deriving backprop (Bishop Eq. 5.56)

• In a neural network:

• Gradient is (using chain rule):

• Denote the “errors” as:

• Also:

Equations from Bishop

Page 67: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Deriving backprop (Bishop Eq. 5.56)

• For output (identity output, L2 loss):

• For hidden units (using chain rule again):

• Backprop formula:

Equations from Bishop

Page 68: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Putting it all together

• Example: use sigmoid at hidden layer and

output layer, loss is L2 between

true/predicted labels

Page 69: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Example algorithm for sigmoid, L2 error

• Initialize all weights to small random values

• Until convergence (e.g. all training examples’ error

small, or error stops decreasing) repeat:

• For each (x, y’=class(x)) in training set:

– Calculate network outputs: yk

– Compute errors (gradients wrt activations) for each unit:

» δk = yk (1-yk) (yk – yk’) for output units

» δj = zj (1-zj) ∑k wkj δk for hidden units

– Update weights:

» wkj = wkj - η δk zj for output units

» wji = wji - η δj xi for hidden units

Adapted from R. Hwa, R. Mooney

Recall: wji = wji – η dE/dzj dzj/daj daj/dwji

Page 70: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Another example

• Two layer network w/ tanh at hidden layer:

• Derivative:

• Minimize:

• Forward propagation:

Page 71: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Another example

• Errors at output (identity function at output):

• Errors at hidden units:

• Derivatives wrt weights:

Page 72: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Same example with graphic and math

First calculate error of output units and use this

to change the top layer of weights.

output

hidden

input

Update weights into j

Adapted from Ray Mooney, equations from Chris Bishop

k

j

i

Page 73: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Same example with graphic and math

Next calculate error for hidden units based on

errors on the output units it feeds into.

output

hidden

input

k

j

i

Adapted from Ray Mooney, equations from Chris Bishop

Page 74: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Same example with graphic and math

Finally update bottom layer of weights based on

errors calculated for hidden units.

output

hidden

input

Update weights into i

k

j

i

Adapted from Ray Mooney, equations from Chris Bishop

Page 75: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Another way of keeping track of error

Computation graphs

• Accumulate upstream/downstream gradients

at each node

• One set flows from inputs to outputs and can

be computed without evaluating loss

• The other flows from outputs (loss) to inputs

Page 76: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

f

activations

Fei-Fei Li & Andrej Karpathy & JustinJohnson

13 Jan 2016

Lecture 4 - 22

Andrej Karpathy

Generic example

Page 77: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

activations

Lecture 4 - 23

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

“local gradient”

f

gradients

Generic example

Page 78: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

activations

“local gradient”

f

gradients

Lecture 4 - 24

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Generic example

Page 79: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

activations

“local gradient”

f

gradients

Lecture 4 - 25

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Generic example

Page 80: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

activations

“local gradient”

f

gradients

Lecture 4 - 26

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Generic example

Page 81: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

activations

“local gradient”

f

gradients

Lecture 4 - 27

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Generic example

Page 82: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Lecture 4 - 10

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 83: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 11

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 84: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 12

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 85: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 13

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 86: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 14

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 87: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 15

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 88: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 16

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 89: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 17

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 90: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 18

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 91: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Lecture 4 - 19

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 92: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Want:

Lecture 4 - 20

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 93: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Lecture 4 - 21

13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson

Andrej Karpathy

Another generic example

Page 94: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Tricks of the trade

Page 95: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Practical matters

• Getting started: Preprocessing, initialization, choosing activation functions, normalization

• Improving performance and dealing with sparse data: regularization, augmentation, transfer

• Hardware and software

Page 96: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

(Assume X [NxD] is data matrix,

each example in a row)Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Lecture 6 - 96 April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Preprocessing the Data

Page 97: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

In practice, you may also see PCA and Whitening of the data

(data has diagonal

covariance matrix)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018Lecture 6 - 39

April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung(covariance matrix is the

identity matrix)

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Preprocessing the Data

Page 98: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Weight Initialization

• Q: what happens when W=constant init is used?

April 19, 2018Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 99: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

- Another idea: Small random numbers

(gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but problems with

deeper networks.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Lecture 6 - 99 April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Weight Initialization

Page 100: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

“Xavier initialization”

[Glorot et al., 2010]

Reasonable initialization.

(Mathematical derivation

assumes linear activations)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

April 19, 2018

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 101: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 102: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]

- Historically popular since they

have nice interpretation as a

saturating “firing rate” of a neuron

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 103: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]

- Historically popular since they

have nice interpretation as a

saturating “firing rate” of a neuron

• 3 problems:

1. Saturated neurons “kill” the

gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung 103

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 104: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

sigmoid

gate

x

What happens when x = -10?

What happens when x = 0?

What happens when x = 10?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019104

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 105: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]

- Historically popular since they

have nice interpretation as a

saturating “firing rate” of a neuron

• 3 problems:

1. Saturated neurons “kill” the

gradients

2. Sigmoid outputs are not

zero-centered

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung 105

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 106: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]

- Historically popular since they

have nice interpretation as a

saturating “firing rate” of a neuron

• 3 problems:

1. Saturated neurons “kill” the

gradients

2. Sigmoid outputs are not

zero-centered

3. exp() is a bit compute expensive

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 107: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

tanh(x)

- Squashes numbers to range [-1,1]

- zero centered (nice)

- still kills gradients when saturated :(

[LeCun et al., 1991]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 108: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions - Computes f(x) = max(0,x)

- Does not saturate (in +region)

- Very computationally efficient

- Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

ReLU

(Rectified Linear Unit)

[Krizhevsky et al., 2012]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 109: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

ReLU

(Rectified Linear Unit)

- Does not saturate (in +region)

- Very computationally efficient

- Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

- Computes f(x) = max(0,x)

Page 110: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

ReLU

(Rectified Linear Unit)

- Does not saturate (in +region)

- Very computationally efficient

- Converges much faster than

sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output

- An annoyance:

hint: what is the gradient when x < 0?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

- Computes f(x) = max(0,x)

Page 111: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

ReLU

gate

x

What happens when x = -10?

What happens when x = 0?

What happens when x = 10?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019111

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 112: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Leaky ReLU

[Mass et al., 2013]

[He et al., 2015]

- Does not saturate

- Computationally efficient

- Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

- will not “die”.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019112

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 113: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Leaky ReLU

[Mass et al., 2013]

[He et al., 2015]

- Does not saturate

- Computationally efficient

- Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)

- will not “die”.

Parametric Rectifier (PReLU)

backprop into \alpha

(parameter)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 114: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Activation Functions

Exponential Linear Units (ELU)

- All benefits of ReLU

- Closer to zero mean outputs

- Negative saturation regime

compared with Leaky ReLU

adds some robustness to noise

- Computation requires exp()

[Clevert et al., 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 115: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Maxout “Neuron”

- Does not have the basic form of dot product ->

nonlinearity

- Generalizes ReLU and Leaky ReLU

- Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung 115

[Goodfellow et al., 2013]

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 116: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

TLDR: In practice:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019Lecture 7 -

April 22, 2019Fei-Fei Li & Justin Johnson & Serena Yeung 116

- Use ReLU. Be careful with your learning rates

- Try out Leaky ReLU / Maxout / ELU

- Try out tanh but don’t expect much

- Don’t use sigmoid

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 117: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

[Ioffe and Szegedy, 2015]

“you want zero-mean unit-variance activations? just make them so.”

consider a batch of activations at some layer. To make

each dimension zero-mean unit-variance, apply:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Batch Normalization

Lecture 6 -

117

April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 118: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

N

D

1. compute the empirical mean and

variance independently for each

dimension.

2. Normalize

“you want zero-mean unit-variance activations? just make them so.”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

April 19, 2018

Batch Normalization

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

[Ioffe and Szegedy, 2015]

Page 119: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

And then allow the network to squash

the range if it wants to:

Note, the network can learn:

to recover the identity

mapping.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Normalize:

Batch Normalization

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

[Ioffe and Szegedy, 2015]

Page 120: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

- Improves gradient flow through

the network

- Allows higher learning rates- Reduces the strong dependence

on initialization

- Acts as a form of regularization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Lecture 6 -

120

April 19, 2018

Batch Normalization

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

[Ioffe and Szegedy, 2015]

Page 121: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Note: at test time BatchNorm layer

functions differently:

The mean/std are not computed

based on the batch. Instead, a single

fixed empirical mean of activations

during training is used.

(e.g. can be estimated during training

with running averages)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Lecture 6 -

121

April 19, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

Batch Normalization

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

[Ioffe and Szegedy, 2015]

Page 122: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

W_1

W_2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Lecture 7 -

122

April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

Optimization

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Next lecture: Problems and better strategies

Page 123: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Babysitting the Learning Process

• Preprocess data

• Choose architecture

• Initialize and check initial loss with no regularization

• Increase regularization, loss should increase

• Then train – try small portion of data, check you can

overfit

• Add regularization, and find learning rate that can make

the loss go down

• Check learning rates in range [1e-3 … 1e-5]

• Coarse-to-fine search for hyperparameters (e.g. learning

rate, regularization)

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 124: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

big gap = overfitting

=> increase regularization strength?

no gap=> increase model capacity?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

April 19, 2018

Monitor and visualize accuracy

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 125: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Dealing with sparse data

• Deep neural networks require lots of data,

and can overfit easily

• The more weights you need to learn, the

more data you need

• That’s why with a deeper network, you need

more data for training than for a shallower

network

• Ways to prevent overfitting include:• Using a validation set to stop training or pick parameters

• Regularization

• Data augmentation

• Transfer learning

Page 126: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Over-training prevention

• Running too many epochs can result in over-fitting.

• Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.

0 # training epochs

err

or

on training data

on test data

Adapted from Ray Mooney

Page 127: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Determining best number of hidden units

• Too few hidden units prevents the network from

adequately fitting the data.

• Too many hidden units can result in over-fitting.

• Use internal cross-validation to empirically

determine an optimal number of hidden units.

err

or

on training data

on test data

0 # hidden units

Ray Mooney

Page 128: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

more neurons = more capacity

Effect of number of neurons

Andrej Karpathy

Page 129: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

(you can play with this demo over at ConvNetJS: http://cs.stanford.

edu/people/karpathy/convnetjs/demo/classify2d.html)

Do not use size of neural network as a regularizer. Use stronger

regularization instead:

Effect of regularization

Andrej Karpathy

Page 130: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Regularization

• L1, L2 regularization (weight decay)

• Dropout• Randomly turn off some neurons

• Allows individual neurons to independently be responsible

for performance

Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]

Adapted from Jia-bin Huang

Page 131: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Load image

and label

“cat”

Compute

loss

CNN

Data Augmentation

April 24, 2018 Lecture 7 - 131

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 132: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Data Augmentation

April 24, 2018 Lecture 7 - 132

Load image

and label

“cat”

Compute

loss

CNN

Transform image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 133: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Data Augmentation

Horizontal Flips

Fei-Fei Li & Justin Johnson & SerenaYeung

April 24, 2018 Lecture 7 - 133

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 134: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Data Augmentation

Get creative for your problem!

Random mix/combinations of :

- translation

- rotation

- stretching

- shearing,

- lens distortions

- …

April 24, 2018 Lecture 7 - 134

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung; Image: https://github.com/aleju/imgaug

Page 135: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Data Augmentation

Random crops and scales

Training: sample random crops / scalesResNet:

1. Pick random L in range [256, 480]

2. Resize training image, short side = L

3. Sample random 224 x 224 patch

Testing: average a fixed set of cropsResNet:

1. Resize image at 5 scales: {224, 256, 384, 480, 640}

2. For each size, use 10 224 x 224 crops: 4 corners + center, +

flips

April 24, 2018 Lecture 7 - 135

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 136: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Transfer learning

• If you have sparse data in your domain of

interest (target), but have rich data in a

disjoint yet related domain (source),

• You can train the early layers on the source

domain, and only the last few layers on the

target domain:

Set these to the already learned

weights from another network

Learn these on your own task

Page 137: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

1. Train on

source (large

dataset)

2. Small dataset:

Freeze these

Train this

3. Medium dataset:

finetuning

more data = retrain more of

the network (or all of it)

Freeze these

Lecture 11 - 29

Train this

Transfer learning

Adapted from Andrej Karpathy

Another option: use network as feature extractor,

train SVM/LR on extracted features for target task

Source: e.g. classification of animals Target: e.g. classification of cars

Page 138: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Training: Best practices

• Center (subtract mean from) your data

• To initialize weights, use “Xavier

initialization”

• Use RELU or leaky RELU or ELU, don’t use

sigmoid

• Use mini-batch

• Use data augmentation

• Use regularization

• Use batch normalization

• Use cross-validation for your parameters

• Learning rate: too high? Too low?

Page 139: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Mini-batch gradient descent

• In classic gradient descent, we compute the

gradient from the loss for all training

examples

• Could also only use some of the data for

each gradient update

• We cycle through all the training examples

multiple times

• Each time we’ve cycled through all of them

once is called an ‘epoch’

• Allows faster training (e.g. on GPUs),

parallelization

Page 140: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Spot the CPU! (central processing unit)

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 141: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Spot the GPUs! (graphics processing unit)

Lecture 8 - April 26, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 142: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

CPU vs GPU

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Cores Clock

Speed

Memory Price Speed

CPU

(Intel Core

i7-7700k)

4(8 threads with

hyperthreading)

4.2 GHz System

RAM

$385 ~540 GFLOPs FP32

GPU

(NVIDIA

RTX 2080 Ti)

3584 1.6 GHz 11 GB

GDDR6

$1199 ~13.4 TFLOPs FP32

TPU

NVIDIA

TITAN V

5120 CUDA,

640 Tensor1.5 GHz 12GB

HBM2

$2999 ~14 TFLOPs FP32

~112 TFLOP FP16

TPU

Google Cloud

TPU

? ? 64 GB

HBM

$4.50

per

hour

~180 TFLOP

CPU: Fewer cores,

but each core is

much faster and

much more

capable; great at

sequential tasks

GPU: More cores,

but each core is

much slower and

“dumber”; great for

parallel tasks

TPU: Specialized

hardware for deep

learning

Page 143: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

CPU vs GPU in practice

(CPU performance not

well-optimized, a little unfair)

66x 67x 71x 64x 76x

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - April 26, 2018

Data from https://github.com/jcjohnson/cnn-benchmarks

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 144: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

CPU / GPU Communication

Lecture 8 -April 26, 2018

Model

is hereData is here

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - April 26, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

If you aren’t careful, training can

bottleneck on reading data and

transferring to GPU!

Solutions:

- Read all data into RAM

- Use SSD instead of HDD

- Use multiple CPU threads

to prefetch data

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 145: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Software: A zoo of frameworks!

Caffe(UC Berkeley)

Torch(NYU / Facebook)

Theano(U Montreal)

TensorFlow(Google)

Caffe2(Facebook)

PyTorch(Facebook)

CNTK(Microsoft)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - April 26, 2018Fei-Fei Li & Justin Johnson & Serena Yeung

PaddlePaddle(Baidu)

MXNet(Amazon)

And others...

Chainer

JAX(Google)

Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung

Page 146: CS 1699: Deep Learning Neural Network Basicskovashka/cs1699_sp20/dl_02_basics.pdfNeural Network Basics Prof. Adriana Kovashka University of Pittsburgh January 16, 2020. Plan for this

Summary

• Feed-forward network architecture

• Training deep neural nets• We need an objective function that measures and guides us

towards good performance

• We need a way to minimize the loss function: (stochastic,

mini-batch) gradient descent

• We need backpropagation to propagate error towards all

layers and change weights at those layers

• Practices for preventing overfitting, training

with little data