108
CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: Linear Classifiers Loss Functions

L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – Linear Classifiers– Loss Functions

Page 2: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Administrativia• Notes and readings on class webpage

– https://www.cc.gatech.edu/classes/AY2020/cs7643_spring

• PS0 due today!– Make sure to follow instructions!

(C) Dhruv Batra & Zsolt Kira 2

Page 3: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Office Hours•

Mondays 11:30am - 12:30pm (Patrick)Mondays 3pm - 4pm (Manas)Tuesdays 11:30am - 12:30pm (Sameer)Tuesdays 4pm - 5pm (Jiachen)Wednesdays 11:00am – 12:00pm (Yihao)Wednesdays 2:30pm - 3:30pm (Anishi)Thursdays 11:30am - 12:30pm. (Rahul)Thursdays 3:00pm - 4:00pm. (Harish)Thursdays 4pm - 5pm (Yinquan)Fridays 11:30am - 12:30pm (Zhuoran)

(C) Dhruv Batra & Zsolt Kira 3

Page 4: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

What is the collaboration policy?• Collaboration

– Only on HWs and project (not allowed in HW0).– You may discuss the questions– Each student writes their own answers & code– Write on your homework anyone with whom you collaborate

• Zero tolerance on plagiarism– Neither ethical nor in your best interest– Always credit your sources– Don’t cheat. We will find out.

(C) Dhruv Batra & Zsolt Kira 4

Page 5: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Recap from last time

(C) Dhruv Batra & Zsolt Kira 5

Page 6: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Image Classification: A core task in Computer Vision

cat

(assume given set of discrete labels){dog, cat, truck, plane, ...}

This image by Nikita is licensed under CC-BY 2.0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 7: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

An image classifier

7

Unlike e.g. sorting a list of numbers,

no obvious way to hard-code the algorithm for recognizing a cat, or other classes.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 8: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Supervised Learning• Input: x (images, text, emails…)• Output: y (spam or non-spam…)

• (Unknown) Target Function– f: X Y (the “true” mapping / reality)

• Data – { (x1,y1), (x2,y2), …, (xN,yN) }

• Model / Hypothesis Class– H = {h: X Y}– e.g. y = h(x) = sign(wTx)

• Loss Function– How good is a model wrt my data D?

• Learning = Search in hypothesis space– Find best h in model class.

(C) Dhruv Batra & Zsolt Kira 8

Page 9: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Error Decomposition

(C) Dhruv Batra & Zsolt Kira 9

Reality

3x3 conv, 384

Pool

5x5 conv, 256

11x11 conv, 96

Input

Pool

3x3 conv, 384

3x3 conv, 256

Pool

FC 4096

FC 4096

Softmax

FC 1000

AlexNet

Page 10: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

First classifier: Nearest Neighbor

10

Memorize all data and labels

Predict the label of the most similar training image

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 11: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Nearest Neighbours

Page 12: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Instance/Memory-based LearningFour things make a memory based learner:• A distance metric

• How many nearby neighbors to look at?

• A weighting function (optional)

• How to fit with the local points?

(C) Dhruv Batra & Zsolt Kira 12Slide Credit: Carlos Guestrin

Page 13: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Hyperparameters

13

Your Dataset

testfold 1 fold 2 fold 3 fold 4 fold 5

Idea #4: Cross-Validation: Split data into folds, try each fold as validation and average the results

testfold 1 fold 2 fold 3 fold 4 fold 5

testfold 1 fold 2 fold 3 fold 4 fold 5

Useful for small datasets, but not used too frequently in deep learning

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 14: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Problems with Instance-Based Learning• Expensive

– No Learning: most real work done during testing– For every test sample, must search through all dataset –

very slow!– Must use tricks like approximate nearest neighbour search

• Doesn’t work well when large number of irrelevant features– Distances overwhelmed by noisy features

• Curse of Dimensionality– Distances become meaningless in high dimensions– (See proof in next lecture)

(C) Dhruv Batra & Zsolt Kira 14

Page 15: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

This image is CC0 1.0 public domain

Neural Network

Linear classifiers

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 16: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Image

parametersor weights

W

f(x,W) 10 numbers giving class scores

Array of 32x32x3 numbers(3072 numbers total)

f(x,W) = Wx

Parametric Approach: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 17: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Image

parametersor weights

W

f(x,W) 10 numbers giving class scores

Array of 32x32x3 numbers(3072 numbers total)

f(x,W) = Wx + b3072x1

10x1 10x307210x1

Parametric Approach: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 18: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

18

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

0.2 -0.5 0.1 2.0

1.5 1.3 2.1 0.0

0 0.25 0.2 -0.3

WInput image

56

231

24

2

56 231

24 2

Stretch pixels into column

1.1

3.2

-1.2

+-96.8

437.9

61.95

=Cat score

Dog score

Ship score

b

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

s

Page 19: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

(C) Dhruv Batra & Zsolt Kira 19Image Credit: Andrej Karpathy, CS231n

Page 20: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Linear Classifier: Three Viewpoints

20

f(x,W) = Wx

Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

One template per class

Hyperplanes cutting up space

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 21: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

1. Define a loss functionthat quantifies our unhappiness with the scores across the training data.

2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

So far: Defined a (linear) score function

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 22: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Plan for Today• Linear Classifiers

– Linear scoring functions• Loss Functions

– Multi-class hinge loss– Softmax cross-entropy loss

(C) Dhruv Batra & Zsolt Kira 22

Page 23: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 24: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

A loss function tells how good our current classifier is

Given a dataset of examples

Where is image and is (integer) label

Loss over the dataset is a sum of loss over examples:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 25: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 26: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 27: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 28: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

“Hinge loss”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 29: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

“Hinge loss”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 30: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 31: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

= max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1)

= max(0, 2.9) + max(0, -3.9)= 2.9 + 0= 2.9Losses: 2.9

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 32: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Losses:

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1)

= max(0, -2.6) + max(0, -1.9)= 0 + 0= 002.9

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 33: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Losses:

= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1)

= max(0, 6.3) + max(0, 6.6)= 6.3 + 6.6= 12.912.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 34: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Loss over full dataset is average:

Losses: 12.92.9 0 L = (2.9 + 0 + 12.9)/3 = 5.27

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 35: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Q: What happens to loss if car image scores change a bit?Losses: 12.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 36: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Q2: what is the min/max possible loss?Losses: 12.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 37: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Q3: At initialization W is small so all s ≈ 0.What is the loss?Losses: 12.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 38: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Q4: What if the sum was over all classes? (including j = y_i)Losses: 12.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 39: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Q5: What if we used mean instead of sum?Losses: 12.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 40: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:

Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Q6: What if we used

Losses: 12.92.9 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 41: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

E.g. Suppose that we found a W such that L = 0. Q7: Is this W unique?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 42: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

E.g. Suppose that we found a W such that L = 0. Q7: Is this W unique?

No! 2W is also has L = 0!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 43: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Suppose: 3 training examples, 3 classes.With some W the scores are:

cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1)

= max(0, -2.6) + max(0, -1.9)= 0 + 0= 0

0Losses: 2.9

Before:

With W twice as large:= max(0, 2.6 - 9.8 + 1)

+max(0, 4.0 - 9.8 + 1)= max(0, -6.2) + max(0, -4.8)= 0 + 0= 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 44: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Multiclass SVM Loss: Example code

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 45: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

46

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilities

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 46: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

47

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilities

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Softmax function

Page 47: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

48

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

exp

unnormalized probabilities

Probabilities must be >= 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 48: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

49

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

0.130.870.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0

Probabilities must sum to 1

probabilities

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 49: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

50

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmaxFunction

24.5164.00.18

0.130.870.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0

Probabilities must sum to 1

probabilitiesUnnormalized log-probabilities / logits

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 50: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

51

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilities

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

in summary:

Maximize log-prob of the correct class = Maximize the log likelihood = Minimize the negative log likelihood

Page 51: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

52

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmaxFunction

24.5164.00.18

0.130.870.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0

Probabilities must sum to 1

probabilitiesUnnormalized log-probabilities / logits

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 52: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

53

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

0.130.870.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0

Probabilities must sum to 1

probabilitiesUnnormalized log-probabilities / logits

Li = -log(0.13)= 2.04

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 53: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Log-Likelihood / KL-Divergence / Cross-Entropy

(C) Dhruv Batra & Zsolt Kira 54

Page 54: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Log-Likelihood / KL-Divergence / Cross-Entropy

(C) Dhruv Batra & Zsolt Kira 55

Page 55: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Log-Likelihood / KL-Divergence / Cross-Entropy

(C) Dhruv Batra & Zsolt Kira 56

Page 56: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

60

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Maximize probability of correct class Putting it all together:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 57: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

61

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Maximize probability of correct class Putting it all together:

Q: What is the min/max possible loss L_i?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 58: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

62

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Maximize probability of correct class Putting it all together:

Q: What is the min/max possible loss L_i?A: min 0, max infinity

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 59: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

63

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Maximize probability of correct class Putting it all together:

Q2: At initialization all s will be approximately equal; what is the loss?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 60: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

64

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Maximize probability of correct class Putting it all together:

Q2: At initialization all s will be approximately equal; what is the loss?A: log(C), eg log(10) ≈ 2.3

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 61: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

65

Softmax vs. SVM

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 62: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Softmax vs. SVM

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 63: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

- We have some dataset of (x,y)- We have a score function: - We have a loss function:

e.g.

Softmax

SVM

Full loss

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

Page 64: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

- We have some dataset of (x,y)- We have a score function: - We have a loss function:

e.g.

Softmax

SVM

Full loss

How do we find the best W?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

Page 65: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

70

Data loss: Model predictions should match training data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 66: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

71

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 67: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

72

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 68: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization Intuition in Polynomial Regression

73

x

y

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 69: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Polynomial Regression

74

x

y f

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 70: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Polynomial Regression

(C) Dhruv Batra & Zsolt Kira 75

Page 71: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Polynomial Regression

(C) Dhruv Batra & Zsolt Kira 76

Page 72: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Error Decomposition

(C) Dhruv Batra & Zsolt Kira 77

Reality

Input

Softmax

FC HxWx3

Multi-class Logistic Regression

Page 73: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

79

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 74: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

80

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Simple examplesL2 regularization: L1 regularization: Elastic net (L1 + L2):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 75: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

81

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Simple examplesL2 regularization: L1 regularization: Elastic net (L1 + L2):

More complex:DropoutBatch normalizationStochastic depth, fractional pooling, etc

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 76: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization

82

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Why regularize?- Express preferences over weights- Make the model simple so it works on test data- Improve optimization by adding curvature

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 77: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Regularization: Expressing Preferences

84

L2 Regularization

L2 regularization likes to “spread out” the weights

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 78: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

- We have some dataset of (x,y)- We have a score function: - We have a loss function:

e.g.

Softmax

SVM

Full loss

How do we find the best W?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

Page 79: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Next: Neural Networks

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 80: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

So far: Linear Classifiers

88

f(x) = WxClass scores

Page 81: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Hard cases for a linear classifier

89

Class 1: First and third quadrants

Class 2: Second and fourth quadrants

Class 1: 1 <= L2 norm <= 2

Class 2:Everything else

Class 1: Three modes

Class 2:Everything else

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 82: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Aside: Image Features

90

f(x) = WxClass scores

Feature Representation

Page 83: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Image Features: Motivation

91

x

y

Cannot separate red and blue points with linear classifier

Page 84: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Image Features: Motivation

92

x

y

r

θ

f(x, y) = (r(x, y), θ(x, y))

Cannot separate red and blue points with linear classifier

After applying feature transform, points can be separated by linear classifier

Page 85: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Example: Color Histogram

93

+1

Page 86: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Example: Histogram of Oriented Gradients (HoG)

94

Divide image into 8x8 pixel regionsWithin each region quantize edge direction into 9 bins

Example: 320x240 image gets divided into 40x30 bins; in each bin there are 9 numbers so feature vector has 30*40*9 = 10,800 numbers

Lowe, “Object recognition from local scale-invariant features”, ICCV 1999Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005

Page 87: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Feature Extraction

Image features vs Neural Nets

96

f10 numbers giving scores for classes

training

training

10 numbers giving scores for classes

Krizhevsky, Sutskever, and Hinton, “Imagenet classification with deep convolutional neural networks”, NIPS 2012.Figure copyright Krizhevsky, Sutskever, and Hinton, 2012. Reproduced with permission.

Page 88: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Error Decomposition

(C) Dhruv Batra & Zsolt Kira 97

Reality

Input

Softmax

FC HxWx3

Multi-class Logistic Regression

Page 89: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

(Before) Linear score function:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

Page 90: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

99

(Before) Linear score function:

(Now) 2-layer Neural Network

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

Page 91: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

100

x hW1 sW2

3072 100 10

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(Before) Linear score function:

(Now) 2-layer Neural Network

Page 92: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

101

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

x hW1 sW2

3072 100 10

(Before) Linear score function:

(Now) 2-layer Neural Network

Page 93: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

102

(Before) Linear score function:

(Now) 2-layer Neural Networkor 3-layer Neural Network

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 94: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Multilayer Networks• Cascaded “neurons”• The output from one layer is the input to the next• Each layer has its own sets of weights

(C) Dhruv Batra & Zsolt Kira 103Image Credit: Andrej Karpathy, CS231n

Page 95: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

“Fully-connected” layers“2-layer Neural Net”, or“1-hidden-layer Neural Net”

“3-layer Neural Net”, or“2-hidden-layer Neural Net”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: Architectures

Page 96: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

105

Full implementation of training a 2-layer Neural Network needs ~20 lines:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 97: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Summary• ML:

– Data– Model (w/ parameters)– Loss function (w/ regularization)

• For classifier that outputs probability distribution, use cross-entropy loss

• We will take simple linear classifiers, add non-linearity, and compose them arbitrarily into as many layers as we want

– Not restricted to any particular function (linear, activation function). Can be anything differentiable

• Key question: How do we find good weights/parameters for such a complex model?

Page 98: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

107

This image by Fotis Bobolas is licensed under CC-BY 2.0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 99: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

108

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 100: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

109

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 101: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

110

sigmoid activation function

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 102: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Be very careful with your brain analogies!

Biological Neurons:● Many different types● Dendrites can perform complex non-linear computations● Synapses are not a single weight but a complex non-linear dynamical

system● Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 103: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Activation functions

Page 104: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Activation Functions• sigmoid vs tanh

(C) Dhruv Batra & Zsolt Kira 113

Page 105: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

A quick note

(C) Dhruv Batra & Zsolt Kira 114Image Credit: LeCun et al. ‘98

Page 106: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Rectified Linear Units (ReLU)

(C) Dhruv Batra & Zsolt Kira 115

[Krizhevsky et al., NIPS12]

Page 107: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

Demo Time• https://playground.tensorflow.org

Page 108: L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

117

Example feed-forward computation of a neural network

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n