L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – Linear Classifiers– Loss Functions

Administrativia• Notes and readings on class webpage

– https://www.cc.gatech.edu/classes/AY2020/cs7643_spring

• PS0 due today!– Make sure to follow instructions!

(C) Dhruv Batra & Zsolt Kira 2

Office Hours•

Mondays 11:30am - 12:30pm (Patrick)Mondays 3pm - 4pm (Manas)Tuesdays 11:30am - 12:30pm (Sameer)Tuesdays 4pm - 5pm (Jiachen)Wednesdays 11:00am – 12:00pm (Yihao)Wednesdays 2:30pm - 3:30pm (Anishi)Thursdays 11:30am - 12:30pm. (Rahul)Thursdays 3:00pm - 4:00pm. (Harish)Thursdays 4pm - 5pm (Yinquan)Fridays 11:30am - 12:30pm (Zhuoran)


What is the collaboration policy?• Collaboration

– Only on HWs and project (not allowed in HW0).– You may discuss the questions– Each student writes their own answers & code– Write on your homework anyone with whom you collaborate

• Zero tolerance on plagiarism– Neither ethical nor in your best interest– Always credit your sources– Don’t cheat. We will find out.


Recap from last time


Image Classification: A core task in Computer Vision

cat

(assume given set of discrete labels){dog, cat, truck, plane, ...}

This image by Nikita is licensed under CC-BY 2.0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

An image classifier

7

Unlike e.g. sorting a list of numbers,

no obvious way to hard-code the algorithm for recognizing a cat, or other classes.


Supervised Learning• Input: x (images, text, emails…)• Output: y (spam or non-spam…)

• (Unknown) Target Function– f: X Y (the “true” mapping / reality)

• Data – { (x1,y1), (x2,y2), …, (xN,yN) }

• Model / Hypothesis Class– H = {h: X Y}– e.g. y = h(x) = sign(wTx)

• Loss Function– How good is a model wrt my data D?

• Learning = Search in hypothesis space– Find best h in model class.


Error Decomposition


Reality

3x3 conv, 384

Pool

5x5 conv, 256

11x11 conv, 96

Input

Pool

3x3 conv, 384

3x3 conv, 256

Pool

FC 4096

FC 4096

Softmax

FC 1000

AlexNet

First classifier: Nearest Neighbor

10

Memorize all data and labels

Predict the label of the most similar training image


Nearest Neighbours

Instance/Memory-based LearningFour things make a memory based learner:• A distance metric

• How many nearby neighbors to look at?

• A weighting function (optional)

• How to fit with the local points?

(C) Dhruv Batra & Zsolt Kira 12Slide Credit: Carlos Guestrin

Hyperparameters

13

Your Dataset

testfold 1 fold 2 fold 3 fold 4 fold 5

Idea #4: Cross-Validation: Split data into folds, try each fold as validation and average the results



Useful for small datasets, but not used too frequently in deep learning


Problems with Instance-Based Learning• Expensive

– No Learning: most real work done during testing– For every test sample, must search through all dataset –

very slow!– Must use tricks like approximate nearest neighbour search

• Doesn’t work well when large number of irrelevant features– Distances overwhelmed by noisy features

• Curse of Dimensionality– Distances become meaningless in high dimensions– (See proof in next lecture)


This image is CC0 1.0 public domain

Neural Network

Linear classifiers


Image

parametersor weights

W

f(x,W) 10 numbers giving class scores

Array of 32x32x3 numbers(3072 numbers total)

f(x,W) = Wx

Parametric Approach: Linear Classifier


Image

parametersor weights

W

f(x,W) 10 numbers giving class scores

Array of 32x32x3 numbers(3072 numbers total)

f(x,W) = Wx + b3072x1

10x1 10x307210x1

Parametric Approach: Linear Classifier


18

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

0.2 -0.5 0.1 2.0

1.5 1.3 2.1 0.0

0 0.25 0.2 -0.3

WInput image

56

231

24

2

56 231

24 2

Stretch pixels into column

1.1

3.2

-1.2

+-96.8

437.9

61.95

=Cat score

Dog score

Ship score

b


s

(C) Dhruv Batra & Zsolt Kira 19Image Credit: Andrej Karpathy, CS231n

Linear Classifier: Three Viewpoints

20

f(x,W) = Wx

Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

One template per class

Hyperplanes cutting up space


1. Define a loss functionthat quantifies our unhappiness with the scores across the training data.

2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

So far: Defined a (linear) score function


Plan for Today• Linear Classifiers

– Linear scoring functions• Loss Functions

– Multi-class hinge loss– Softmax cross-entropy loss


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

Suppose: 3 training examples, 3 classes.With some W the scores are:


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2


A loss function tells how good our current classifier is

Given a dataset of examples

Where is image and is (integer) label

Loss over the dataset is a sum of loss over examples:


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2


Multiclass SVM loss:

Given an examplewhere is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2







cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2







cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






“Hinge loss”


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






“Hinge loss”


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2







cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






= max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1)

= max(0, 2.9) + max(0, -3.9)= 2.9 + 0= 2.9Losses: 2.9


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Losses:

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1)

= max(0, -2.6) + max(0, -1.9)= 0 + 0= 002.9


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Losses:

= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1)

= max(0, 6.3) + max(0, 6.6)= 6.3 + 6.6= 12.912.92.9 0


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Loss over full dataset is average:

Losses: 12.92.9 0 L = (2.9 + 0 + 12.9)/3 = 5.27


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Q: What happens to loss if car image scores change a bit?Losses: 12.92.9 0


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Q2: what is the min/max possible loss?Losses: 12.92.9 0


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Q3: At initialization W is small so all s ≈ 0.What is the loss?Losses: 12.92.9 0


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Q4: What if the sum was over all classes? (including j = y_i)Losses: 12.92.9 0


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Q5: What if we used mean instead of sum?Losses: 12.92.9 0


cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2






Q6: What if we used

Losses: 12.92.9 0


E.g. Suppose that we found a W such that L = 0. Q7: Is this W unique?


E.g. Suppose that we found a W such that L = 0. Q7: Is this W unique?

No! 2W is also has L = 0!



cat

frog

car

3.25.1-1.7

4.91.3

2.0 -3.12.52.2

= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1)

= max(0, -2.6) + max(0, -1.9)= 0 + 0= 0

0Losses: 2.9

Before:

With W twice as large:= max(0, 2.6 - 9.8 + 1)

+max(0, 4.0 - 9.8 + 1)= max(0, -6.2) + max(0, -4.8)= 0 + 0= 0


Multiclass SVM Loss: Example code


46

Softmax Classifier (Multinomial Logistic Regression)

cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilities


47


cat

frog

car

3.25.1-1.7



Softmax function

48


cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

exp

unnormalized probabilities

Probabilities must be >= 0


49


cat

frog

car

3.25.1-1.7


24.5164.00.18

0.130.870.00

exp normalize



Probabilities must sum to 1

probabilities


50


cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmaxFunction

24.5164.00.18

0.130.870.00

exp normalize




probabilitiesUnnormalized log-probabilities / logits


51


cat

frog

car

3.25.1-1.7



in summary:

Maximize log-prob of the correct class = Maximize the log likelihood = Minimize the negative log likelihood

52


cat

frog

car

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmaxFunction

24.5164.00.18

0.130.870.00

exp normalize






53


cat

frog

car

3.25.1-1.7


24.5164.00.18

0.130.870.00

exp normalize





Li = -log(0.13)= 2.04


Log-Likelihood / KL-Divergence / Cross-Entropy






60


cat

frog

car

3.25.1-1.7


Maximize probability of correct class Putting it all together:


61


cat

frog

car

3.25.1-1.7



Q: What is the min/max possible loss L_i?


62


cat

frog

car

3.25.1-1.7



Q: What is the min/max possible loss L_i?A: min 0, max infinity


63


cat

frog

car

3.25.1-1.7



Q2: At initialization all s will be approximately equal; what is the loss?


64


cat

frog

car

3.25.1-1.7



Q2: At initialization all s will be approximately equal; what is the loss?A: log(C), eg log(10) ≈ 2.3


65

Softmax vs. SVM


Softmax vs. SVM


- We have some dataset of (x,y)- We have a score function: - We have a loss function:

e.g.

Softmax

SVM

Full loss


Recap


e.g.

Softmax

SVM

Full loss

How do we find the best W?


Recap

Regularization

70

Data loss: Model predictions should match training data


Regularization

71


Regularization: Prevent the model from doing too well on training data


Regularization

72



= regularization strength(hyperparameter)


Regularization Intuition in Polynomial Regression

73

x

y


Polynomial Regression

74

x

y f






Error Decomposition


Reality

Input

Softmax

FC HxWx3

Multi-class Logistic Regression

Regularization

79





Regularization

80




Simple examplesL2 regularization: L1 regularization: Elastic net (L1 + L2):


Regularization

81




Simple examplesL2 regularization: L1 regularization: Elastic net (L1 + L2):

More complex:DropoutBatch normalizationStochastic depth, fractional pooling, etc


Regularization

82




Why regularize?- Express preferences over weights- Make the model simple so it works on test data- Improve optimization by adding curvature


Regularization: Expressing Preferences

84

L2 Regularization

L2 regularization likes to “spread out” the weights



e.g.

Softmax

SVM

Full loss

How do we find the best W?


Recap

Next: Neural Networks


So far: Linear Classifiers

88

f(x) = WxClass scores

Hard cases for a linear classifier

89

Class 1: First and third quadrants

Class 2: Second and fourth quadrants

Class 1: 1 <= L2 norm <= 2

Class 2:Everything else

Class 1: Three modes

Class 2:Everything else


Aside: Image Features

90

f(x) = WxClass scores

Feature Representation

Image Features: Motivation

91

x

y

Cannot separate red and blue points with linear classifier

Image Features: Motivation

92

x

y

r

θ

f(x, y) = (r(x, y), θ(x, y))

Cannot separate red and blue points with linear classifier

After applying feature transform, points can be separated by linear classifier

Example: Color Histogram

93

+1

Example: Histogram of Oriented Gradients (HoG)

94

Divide image into 8x8 pixel regionsWithin each region quantize edge direction into 9 bins

Example: 320x240 image gets divided into 40x30 bins; in each bin there are 9 numbers so feature vector has 30*40*9 = 10,800 numbers

Lowe, “Object recognition from local scale-invariant features”, ICCV 1999Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005

Feature Extraction

Image features vs Neural Nets

96

f10 numbers giving scores for classes

training

training

10 numbers giving scores for classes

Krizhevsky, Sutskever, and Hinton, “Imagenet classification with deep convolutional neural networks”, NIPS 2012.Figure copyright Krizhevsky, Sutskever, and Hinton, 2012. Reproduced with permission.

Error Decomposition


Reality

Input

Softmax

FC HxWx3

Multi-class Logistic Regression

(Before) Linear score function:


Neural networks: without the brain stuff

99


(Now) 2-layer Neural Network



100

x hW1 sW2

3072 100 10





101



x hW1 sW2

3072 100 10



102


(Now) 2-layer Neural Networkor 3-layer Neural Network



Multilayer Networks• Cascaded “neurons”• The output from one layer is the input to the next• Each layer has its own sets of weights

(C) Dhruv Batra & Zsolt Kira 103Image Credit: Andrej Karpathy, CS231n

“Fully-connected” layers“2-layer Neural Net”, or“1-hidden-layer Neural Net”

“3-layer Neural Net”, or“2-hidden-layer Neural Net”


Neural networks: Architectures

105

Full implementation of training a 2-layer Neural Network needs ~20 lines:



Summary• ML:

– Data– Model (w/ parameters)– Loss function (w/ regularization)

• For classifier that outputs probability distribution, use cross-entropy loss

• We will take simple linear classifiers, add non-linearity, and compose them arbitrarily into as many layers as we want

– Not restricted to any particular function (linear, activation function). Can be anything differentiable

• Key question: How do we find good weights/parameters for such a complex model?

107

This image by Fotis Bobolas is licensed under CC-BY 2.0


108

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal


109




dendrite

cell body

axon



110

sigmoid activation function




dendrite

cell body

axon



Be very careful with your brain analogies!

Biological Neurons:● Many different types● Dendrites can perform complex non-linear computations● Synapses are not a single weight but a complex non-linear dynamical

system● Rate code may not be adequate

[Dendritic Computation. London and Hausser]


Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU


Activation functions

Activation Functions• sigmoid vs tanh


A quick note

(C) Dhruv Batra & Zsolt Kira 114Image Credit: LeCun et al. ‘98

Rectified Linear Units (ReLU)


[Krizhevsky et al., NIPS12]

Demo Time• https://playground.tensorflow.org

117

Example feed-forward computation of a neural network


Documents

L3 linear classification · 3.2 5.1-1.7 4.9 1.3 2.0 -3.1 2.5 2.2 Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is