58
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome [email protected] http://cedric.cnam.fr/vertigo/Cours/ml2/ Département Informatique Conservatoire Nationnal des Arts et Métiers (Cnam)

Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Apprentissage, réseaux de neurones et modèles graphiques(RCP209)

Neural Networks and Deep Learning

Nicolas [email protected]

http://cedric.cnam.fr/vertigo/Cours/ml2/

Département InformatiqueConservatoire Nationnal des Arts et Métiers (Cnam)

Page 2: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

8 Weeks on Deep Learning and Structured Prediction

• Week 1-5: Deep Learning• Week 6-8: Structured Prediction and Applications

[email protected] RCP209 / Deep Learning 2/ 54

Page 3: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Outline

1 Context

2 Neural Networks

3 Training Deep Neural Networks

[email protected] RCP209 / Deep Learning 3/ 54

Page 4: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Context

Big Data• Superabundance of data: images, videos, audio, text, use traces, etc

BBC: 2.4M videos Facebook: 350B images 100M monitoring cameras

1B each day

• Obvious need to access, search, or classify these data: Recognition• Huge number of applications: mobile visual search, robotics, autonomous driving,

augmented reality, medical imaging etc

• Leading track in major ML/CV conferences during the last decade

[email protected] RCP209 / Deep Learning 4/ 54

Page 5: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Recognition and classification

• Classification : assign a given data to a given set of pre-defined classes• Recognition much more general than classification, e.g.

Object Localization in imagesRanking for document indexingSequence prediction for text, speech, audio, etc

• Many tasks can be cast as classification problems⇒ importance of classification

[email protected] RCP209 / Deep Learning 5/ 54

Page 6: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Focus on Visual Recognition: Perceiving Visual World

• Visual Recognition: archetype of low-level signal understanding• Supposed to be a master class problem in the early 80’s• Certainly the most impacted topic by deep learning

• Scene categorization

• Object localization

• Context & Attributerecognition

• Rough 3D layout, depthordering

• Rich description ofscene, e.g. sentences

[email protected] RCP209 / Deep Learning 6/ 54

Page 7: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Recognition of low-level signals

Challenge: filling the semantic gap

What we perceive vsWhat a computer sees

• Illumination variations• View-point variations• Deformable objects• intra-class variance• etc

⇒ How to design "good" intermediate representation ?

[email protected] RCP209 / Deep Learning 7/ 54

Page 8: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Deep Learning (DL) & Recognition of low-level signals

• DL: breakthrough for the recognition of low-level signal data• Before DL: handcrafted intermediate representations for each task

⊖ Needs expertise (PhD level) in each field⊖ Weak level of semantics in the representation

[email protected] RCP209 / Deep Learning 8/ 54

Page 9: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Deep Learning (DL) & Recognition of low-level signals

• DL: breakthrough for the recognition of low-level signal data• Since DL: automatically learning intermediate representations

⊕ Outstanding experimental performances >> handcrafted features⊕ Able to learn high level intermediate representations⊕ Common learning methodology ⇒ field independent, no expertise

[email protected] RCP209 / Deep Learning 9/ 54

Page 10: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Deep Learning (DL) & Representation Learning

• DL: breakthrough for representation learningAutomatically learning intermediate levels of representation

• Ex: Natural language Processing (NLP)

[email protected] RCP209 / Deep Learning 10/ 54

Page 11: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Outline

1 Context

2 Neural Networks

3 Training Deep Neural Networks

[email protected] RCP209 / Deep Learning 11/ 54

Page 12: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal Neuron: 1943 [MP43]

• Basis of Neural Networks• Input: vector x ∈ Rm, i.e. x = {xi}i∈{1,2,...,m}• Neuron output y ∈ R: scalar

[email protected] RCP209 / Deep Learning 12/ 54

Page 13: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal Neuron: 1943 [MP43]

• Mapping from x to y :1 Linear (affine) mapping: s = w⊺x + b2 Non-linear activation function: f : y = f (s)

[email protected] RCP209 / Deep Learning 13/ 54

Page 14: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal Neuron: Linear Mapping

• Linear (affine) mapping: s = w⊺x + b =m∑i=1

wixi + b

w: normal vector to an hyperplane in Rm ⇒ linear boundaryb bias, shift the hyperplane position

2D hyperplane: line 3D hyperplane: plane

[email protected] RCP209 / Deep Learning 14/ 54

Page 15: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal Neuron: Activation Function

• y = f (w⊺x + b), f activation functionPopular f choices: step, sigmoid, tanh

• Step (Heaviside) function: H(z) =⎧⎪⎪⎨⎪⎪⎩

1 if z ≥ 00 otherwise

[email protected] RCP209 / Deep Learning 15/ 54

Page 16: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Step function: Connection to Biological Neurons

• Formal neuron, step activation H: y = H(w⊺x + b)y = 1 (activated) ⇔ w⊺x ≥ −by = 0 (unactivated) ⇔ w⊺x < −b

• Biological Neurons: output activated⇔ input weighted by synaptic weight ≥ threshold

[email protected] RCP209 / Deep Learning 16/ 54

Page 17: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Sigmoid Activation Function

• Neuron output y = f (w⊺x + b), f activation function• Sigmoid: σ(z) = (1 + e−az)−1

• a ↑: more similar to step function (step: a →∞)• Sigmoid: linear and saturating regimes

[email protected] RCP209 / Deep Learning 17/ 54

Page 18: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal neuron: Application to Binary Classification

• Binary Classification: label input x as belonging to class 1 or 0• Neuron output with sigmoid: y = 1

1+e−a(w⊺x+b)

• Sigmoid: probabilistic interpretation ⇒ y ∼ P(1/x)Input x classified as 1 if P(1/x) > 0.5 ⇔ w⊺x + b > 0Input x classified as 0 if P(1/x) < 0.5 ⇔ w⊺x + b < 0⇒ sign(w⊺x + b): linear boundary decision in input space !

[email protected] RCP209 / Deep Learning 18/ 54

Page 19: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal neuron: Toy Example for Binary Classification

• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b

[email protected] RCP209 / Deep Learning 19/ 54

Page 20: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal neuron: Toy Example for Binary Classification

• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b

• Sigmoid activation function: y = (1 + e−a(w⊺x+b))−1,

a = 10

[email protected] RCP209 / Deep Learning 19/ 54

Page 21: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal neuron: Toy Example for Binary Classification

• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b

• Sigmoid activation function: y = (1 + e−a(w⊺x+b))−1,

a = 1

[email protected] RCP209 / Deep Learning 19/ 54

Page 22: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

The Formal neuron: Toy Example for Binary Classification

• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b

• Sigmoid activation function: y = (1 + e−a(w⊺x+b))−1,

a = 0.1

[email protected] RCP209 / Deep Learning 19/ 54

Page 23: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

From Formal Neuron to Neural Networks

• Formal Neuron:1 A single scalar output2 Linear decision boundary for binary

classification• Single scalar output: limited for several tasks

Ex: multi-class classification, e.g. MNIST orCIFAR

[email protected] RCP209 / Deep Learning 20/ 54

Page 24: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Perceptron and Multi-Class Classification

• Formal Neuron: limited to binaryclassification

• Multi-Class Classification: use severaloutput neurons instead of a single one !⇒ Perceptron

• Input x in Rm

• Output neuron y1 is a formal neuron:Linear (affine) mapping: s1 = w1

⊺x + b1Non-linear activation function: f :y1 = f (s1)

• Linear mapping parameters:w1 = {w11, ...,wm1} ∈ Rm

b1 ∈ R

[email protected] RCP209 / Deep Learning 21/ 54

Page 25: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Perceptron and Multi-Class Classification

• Input x in Rm

• Output neuron yk is a formal neuron:Linear (affine) mapping: sk = wk

⊺x + bkNon-linear activation function: f :yk = f (sk)

• Linear mapping parameters:wk = {w1k , ...,wmk} ∈ Rm

bk ∈ R

[email protected] RCP209 / Deep Learning 22/ 54

Page 26: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Perceptron and Multi-Class Classification

• Input x in Rm (1 ×m), output y : concatenation of K formal neurons• Linear (affine) mapping ∼ matrix multiplication: s = xW + b

W matrix of size m ×K - columns are wkb: bias vector - size 1 ×K

• Element-wise non-linear activation: y = f (s)

[email protected] RCP209 / Deep Learning 23/ 54

Page 27: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Perceptron and Multi-Class Classification

• Soft-max Activation:

yk = f (sk) =esk

K∑

k′=1esk′

• Probabilistic interpretation for multi-classclassification:

Each output neuron ⇔ classyk ∼ P(k/x,w)

⇒ Logistic Regression (LR) Model !

[email protected] RCP209 / Deep Learning 24/ 54

Page 28: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

2d Toy Example for Multi-Class Classification

• x = {x1, x2} ∈ [−5;5] × [−5;5], y : 3 outputs (classes)

Linear mapping foreach class:sk = wk

⊺x + bk

w1 = [1;1], b1 = −2 w2 = [0;−1], b2 = 1 w3 = [1;−0.5], b3 = 10

Soft-max output:P(k/x,W)

[email protected] RCP209 / Deep Learning 25/ 54

Page 29: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

2d Toy Example for Multi-Class Classification

• x = {x1, x2} ∈ [−5;5] × [−5;5], y : 3 outputs (classes)

Soft-max output:P(k/x,W)

w1 = [1;1], b1 = −2 w2 = [0;−1], b2 = 1 w3 = [1;−0.5], b3 = 10

Class Prediction:k∗ = arg max

kP(k/x,W)

[email protected] RCP209 / Deep Learning 25/ 54

Page 30: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Beyond Linear Classification

X-OR Problem

• Logistic Regression (LR): NN with 1 input layer & 1 output layer• LR: limited to linear decision boundaries• X-OR: NOT 1 and 2 OR NOT 2 AND 1

X-OR: Non linear decision function

[email protected] RCP209 / Deep Learning 26/ 54

Page 31: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Beyond Linear Classification

• LR: limited to linear boundaries• Solution: add a layer !

• Input x in Rm, e.g. m = 4• Output y in RK (K # classes),

e.g. K = 2• Hidden layer h in RL

[email protected] RCP209 / Deep Learning 27/ 54

Page 32: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Multi-Layer Perceptron

• Hidden layer h: x projection to a newspace RL

• Neural Net with ≥ 1 hidden layer:Multi-Layer Perceptron (MLP)

• h: intermediate representations of xfor classification y: h = f (xW + b)

• Mapping from x to y: non-linearboundary ! ⇒ activation f crucial!

[email protected] RCP209 / Deep Learning 28/ 54

Page 33: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Deep Neural Networks

• Adding more hidden layers: Deep Neural Networks (DNN) ⇒ Basis of DeepLearning

• Each layer hl projects layer hl−1 into a new space• Gradually learning intermediate representations useful for the task

[email protected] RCP209 / Deep Learning 29/ 54

Page 34: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Conclusion

• Deep Neural Networks: applicable to classification problems with non-lineardecision boundaries

• Visualize prediction from fixed model parameters• Reverse problem: Supervised Learning

[email protected] RCP209 / Deep Learning 30/ 54

Page 35: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Outline

1 Context

2 Neural Networks

3 Training Deep Neural Networks

[email protected] RCP209 / Deep Learning 31/ 54

Page 36: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Training Multi-Layer Perceptron (MLP)

• Input x, output y• A parametrized model x⇒ y: fw(xi) = yi

• Supervised context: training set A = {(xi , y∗i )}i∈{1,2,...,N}A loss function `(yi , y∗i ) for each annotated pair (xi , y∗i )

• Assumptions: parameters w ∈ Rd continuous, L differentiable• Gradient ∇w = ∂L

∂w : steepest direction to decrease loss L

[email protected] RCP209 / Deep Learning 32/ 54

Page 37: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

MLP Training

• Gradient descent algorithm:Initialyze parameters w

Update: w(t+1) = w(t) − η ∂L∂w

Until convergence, e.g. ∣∣∇w ∣∣2 ≈ 0

[email protected] RCP209 / Deep Learning 33/ 54

Page 38: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Gradient Descent

Update rule: w(t+1) = w(t) − η ∂L∂w η learning rate

• Convergence ensured ? ⇒ provided a "well chosen" learning rate η

[email protected] RCP209 / Deep Learning 34/ 54

Page 39: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Gradient Descent

Update rule: w(t+1) = w(t) − η ∂L∂w

• Global minimum ?⇒ convex a) vs non convex b) loss L(w)

a) Convex function a) Non convex function

[email protected] RCP209 / Deep Learning 35/ 54

Page 40: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Supervised Learning: Multi-Class Classification

• Logistic Regression for multi-class classification• si = xiW + b• Soft-Max (SM): yk ∼ P(k/xi,W,b) = esk

K∑

k′=1esk′

• Supervised loss function: L(W,b) = 1N

N∑i=1`(yi , y∗i )

1 y ∈ {1;2; ...; K}2 yi = arg max

kP(k/xi ,W,b)

3 `0/1(yi , y∗i ) =⎧⎪⎪⎨⎪⎪⎩

1 if yi ≠ y∗i0 otherwise

: 0/1 loss

[email protected] RCP209 / Deep Learning 36/ 54

Page 41: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Logistic Regression Training Formulation

• Input xi , ground truth output supervision y∗i• One hot-encoding for y∗i :

y∗c,i =⎧⎪⎪⎨⎪⎪⎩

1 if c is the groud truth class for xi

0 otherwise

[email protected] RCP209 / Deep Learning 37/ 54

Page 42: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Logistic Regression Training Formulation

• Loss function: multi-class Cross-Entropy (CE) `CE

• `CE : Kullback-Leiber divergence between y∗i and yi

`CE (yi, y∗i ) = KL(y∗i , yi) = −

K

∑c=1

y∗c,i log(yc,i) = −log(yc∗,i)

• B KL asymmetric: KL(yi, y∗i ) ≠ KL(y∗i , yi) B

[email protected] RCP209 / Deep Learning 38/ 54

Page 43: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Logistic Regression Training

• LCE(W,b) = 1N

N∑i=1`CE(yi , y∗i ) = − 1

N

N∑i=1

log(yc∗,i)

• `CE smooth convex upper bound of `0/1⇒ gradient descent optimization

• Gradient descent: W(t+1) =W(t) − η ∂LCE∂W (b(t+1) = b(t) − η ∂LCE

∂b )

• MAIN CHALLENGE: computing ∂LCE∂W = 1

N

N∑i=1

∂`CE∂W ?

⇒ Key Property: chain rule∂x∂z=∂x∂y

∂y∂z

⇒ Backpropagation of gradient error!

[email protected] RCP209 / Deep Learning 39/ 54

Page 44: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Chain Rule

∂`∂x =

∂`∂y

∂y∂x

• Logistic regression: ∂`CE∂W =

∂`CE∂yi

∂yi∂si

∂si∂W

[email protected] RCP209 / Deep Learning 40/ 54

Page 45: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Logistic Regression Training: Backpropagation

∂`CE∂W = ∂`CE

∂yi

∂yi∂si

∂si∂W , `CE(yi , y∗i ) = −log(yc∗,i) ⇒ Update for 1 example:

• ∂`CE∂yi

= −1yc∗,i

= −1yi⊙ δc,c∗

• ∂`CE∂si

= yi − y∗i = δyi

• ∂`CE∂W = xi

T δyi

[email protected] RCP209 / Deep Learning 41/ 54

Page 46: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Logistic Regression Training: Backpropagation

• Whole dataset: data matrix X (N ×m), label matrix Y, Y∗ (N ×K)

• LCE(W,b) = − 1N

N∑i=1

log(yc∗,i), ∂LCE∂W = ∂LCE

∂Y∂Y∂S

∂S∂W

• ∂LCE∂s = Y −Y∗ = ∆y

• ∂LCE∂W = XT ∆y

[email protected] RCP209 / Deep Learning 42/ 54

Page 47: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Perceptron Training: Backpropagation

• Perceptron vs Logistic Regression: adding hidden layer (sigmoid)• Goal: Train parameters Wy and Wh (+bias) with Backpropagation

⇒ computing ∂LCE∂Wy = 1

N

N∑i=1

∂`CE∂Wy and ∂LCE

∂Wh = 1N

N∑i=1

∂`CE∂Wh

• Last hidden layer ∼ Logistic Regression• First hidden layer: ∂`CE

∂Wh = xiT ∂`CE

∂ui⇒ computing ∂`CE

∂ui= δh

i

[email protected] RCP209 / Deep Learning 43/ 54

Page 48: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Perceptron Training: Backpropagation

• Computing ∂`CE∂ui

= δhi ⇒ use chain rule: ∂`CE

∂ui= ∂`CE

∂vi

∂vi∂hi

∂hi∂ui

• ... Leading to: ∂`CE∂ui

= δhi = δ

yiTWy ⊙ σ

(hi) = δyiTWy ⊙ (hi ⊙ (1 − hi))

[email protected] RCP209 / Deep Learning 44/ 54

Page 49: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Deep Neural Network Training: Backpropagation

• Multi-Layer Perceptron (MLP): adding more hidden layers

• Backpropagation update ∼ Perceptron: assuming ∂L∂Ul+1

= ∆l+1 known

∂L∂Wl+1 = Hl

T ∆l+1

Computing ∂L∂Ul

= ∆l (= ∆l+1TWl+1 ⊙Hl ⊙ (1 −Hl) sigmoid)

∂L∂Wl = Hl−1

T ∆hl

[email protected] RCP209 / Deep Learning 45/ 54

Page 50: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Neural Network Training: Optimization Issues

• Classification loss over training set (vectorized w, bignored):

LCE(w) = 1N

N

∑i=1`CE(yi , y∗i ) = −

1N

N

∑i=1

log(yc∗,i)

• Gradient descent optimization:

w(t+1) = w(t) − η ∂LCE

∂w(w(t)) = w(t) − η∇(t)w

• Gradient ∇(t)w = 1N

N∑i=1

∂`CE (yi ,y∗i )∂w (w(t)) linearly scales

wrt:w dimensionTraining set size

⇒ Too slow even for moderatedimensionality & dataset size!

[email protected] RCP209 / Deep Learning 46/ 54

Page 51: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Stochastic Gradient Descent

• Solution: approximate ∇(t)w = 1N

N∑i=1

∂`CE (yi ,y∗i )∂w (w(t)) with subset of examples

⇒ Stochastic Gradient Descent (SGD)

• Use a single example (online):

∇(t)w ≈ ∂`CE(yi , y∗i )∂w

(w(t))

• Mini-batch: use B < N examples:

∇(t)w ≈ 1B

B

∑i=1

∂`CE(yi , y∗i )∂w

(w(t))

Full gradient SGD (online) SGD (mini-batch)

[email protected] RCP209 / Deep Learning 47/ 54

Page 52: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Stochastic Gradient Descent

• SGD: approximation of the true Gradient ∇w !Noisy gradient can lead to bad direction, increase lossBUT: much more parameter updates: online ×N, mini-batch ×N

BFaster convergence, at the core of Deep Learning for large scale datasets

Full gradient SGD (online) SGD (mini-batch)

[email protected] RCP209 / Deep Learning 48/ 54

Page 53: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Optimization: Learning Rate Decay

• Gradient descent optimization: w(t+1) = w(t) − η∇(t)w

• η setup ? ⇒ open question• Learning Rate Decay: decrease η during training progress

Inverse (time-based) decay: ηt = η01+r ⋅t , r decay rate

Exponential decay: ηt = η0 ⋅ e−λt

Step Decay ηt = η0 ⋅ rttu ...

Exponential Decay (η0 = 0.1, λ = 0.1s) Step Decay (η0 = 0.1, r = 0.5, tu = 10)

[email protected] RCP209 / Deep Learning 49/ 54

Page 54: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Generalization and Overfitting

• Learning: minimizing classification loss LCE over training setTraining set: sample representing data vs labels distributionsUltimate goal: train a prediction function with low prediction error on the true(unknown) data distribution

Ltrain = 4, Ltrain = 9 Ltest = 15, Ltest = 13

⇒ Optimization ≠ Machine Learning!⇒ Generalization / Overfitting!

[email protected] RCP209 / Deep Learning 50/ 54

Page 55: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Regularization

• Regularization: improving generalization, i.e. test (≠ train) performances• Structural regularization: add Prior R(w) in training objective:

L(w) = LCE(w) + αR(w)

• L2 regularization: weight decay, R(w) = ∣∣w∣∣2Commonly used in neural networksTheoretical justifications, generalization bounds (SVM)

• Other possible R(w): L1 regularization, dropout, etc

[email protected] RCP209 / Deep Learning 51/ 54

Page 56: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Regularization and hyper-parameters

• Neural networks: hyper-parameters to tune:Training parameters: learning rate, weight decay, learning rate decay, #epochs, etcArchitectural parameters: number of layers, number neurones,non-linearity type, etc

• Hyper-parameters tuning: ⇒ improve generalization: estimateperformances on a validation set

[email protected] RCP209 / Deep Learning 52/ 54

Page 57: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

Neural networks: Conclusion

• Training issues at several levels: optimization, generalization, cross-validation• Limits of fully connected layers and Convolutional Neural Nets ? ⇒ next course!

[email protected] RCP209 / Deep Learning 53/ 54

Page 58: Apprentissage,réseauxdeneuronesetmodèlesgraphiques ...cedric.cnam.fr/vertigo/Cours/ml2/docs/coursDeep1.pdfk isaformalneuron: Linear(affine)mapping: s k =w k Šx+b k Non-linearactivationfunction:

Context Neural Nets Backprop

References I

Warren S McCulloch and Walter Pitts, A logical calculus of the ideas immanent in nervous activity, The

bulletin of mathematical biophysics 5 (1943), no. 4, 115–133.

[email protected] RCP209 / Deep Learning 54/ 54