62
1/62 1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Univer Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

1/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

Page 2: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

2/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

Lecture 1bLogistic regression

&neural network

October 2, 2015

Page 3: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

3/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

Table of contents

1 1 Bird’s-eye view of Lecture 1b1.1 Objectives1.2 Quick Summary

2 2. GLM: Generalized linear model2.1 Exponential family of distributions2.2 Generalized linear model(GLM)2.3 Parameter estimation

3 3. XOR problem and neural network with hidden layer

4 4. Universal approximation4.1 Further construction4.2 Universal approximation theorem4.3 Deep vs Shallow learning

Page 4: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

4/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.1 Objectives

1 Bird’s-eye view of Lecture 1b1.1 Objectives

Objective1

Understand logistic regression (binary classification) and itsmulticlass generalization (softmax regression)

Objective2

Recast logistic and softmax regression in a neural network(perceptron) formalism

Page 5: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

5/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.1 Objectives

Objective3

Learn the limitations of the perceptron by looking at the XORproblem

Learn how to fix it by adding a hidden layer

Objective4

Introduce the Universal Approximation

Learn about the clash of ”Deep vs Shallow” paradigms inmachine learning

Page 6: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

6/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

1.2 Quick Summary

Logistic regression

Data D = {(x (t), y (t))}Nt=1

input x (t) ∈ Rd

label y (t) ∈ {0, 1}Given x = (x1, · · · , xd) ∈ Rd , logistic regression outputs theprobability of the output label y being equal to 1 by

P[y = 1 | x ] = sigm(b + w1x1 + · · ·+ wdxd),

where

sigm(t) =et

1 + et.

Page 7: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

7/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Thus

P[y = 1 | x ] =eb+

∑j wjxj

1 + eb+∑

j wjxj

P[y = 0 | x ] =1

1 + eb+∑

j wjxj

DecisionGiven x , decide the output label is y , where{

y = 1 if b +∑

j wjxj ≥ 0

y = 0 if b +∑

j wjxj < 0

[Thus the decision boundary is the hyperplaneb +

∑j wjxj = 0 in Rd ]

Page 8: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

8/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Neural network formulation

Figure: Neural Network

Page 9: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

9/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

z: input to the output neuron.

z = b1 + w1x1 + · · ·+ b1 + wdxd

h: output of the output neuron

h = sigm(z) = sigm(b1 + w1x1 + · · ·+ b1 + wdxd)

Page 10: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

10/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Symmetric (redundant) form of logistic regression

The probabilities P[y = 1 | x ] and P[y = 0 | x ] have differentform in logistic regression.

We can put them in symmetric form by rewriting them in thefollowing (redundant) form:

P[y = 1 | x ] =exp

(b1 +

∑j w1jxj

)exp

(b1 +

∑j w1jxj

)+ exp

(b2 +

∑j w2jxj

)P[y = 0 | x ] =

exp(b2 +

∑j w2jxj

)exp

(b1 +

∑j w1jxj

)+ exp

(b2 +

∑j w2jxj

)

Page 11: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

11/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Decision

Given x , decide the output label is y , where{y = 1 if b1 +

∑j w1jxj ≥ b2 +

∑j w2jxj

y = 0 if b1 +∑

j w1jxj < b2 +∑

j w2jxj

The decision boundary is the hyperplane

b1 +∑j

w1jxj = b2 +∑j

w2jxj

in Rd

Page 12: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

12/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Neural network formulation

Figure: Neural Network

Page 13: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

13/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

zi : input to the ith neuron in the output layer.

zi = bi +∑j

wijxj , i = 1, 2

hi : output of the ith neuron in the output layer.

hi =ezi

ez1 + ez2, i = 1, 2

Page 14: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

14/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Softmax regression: multiclass classification

There are K output labels, i.e., y ∈ {1, · · · ,K}ProbabilityP[y = i | x ] =

exp(bi +

∑j wijxj

)exp

(b1 +

∑j w1jxj

)+ · · ·+ exp

(bK +

∑j wKjxj

)for i = 1, · · · ,K .Decision

Given x , decide the output label is y , where

y = argmaxi

P[y = i | x ]

Page 15: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

15/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Decision boundary

Figure: example of decision boundary

Decision regions are partitioned by linear hyperplanes in Rd

Page 16: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

16/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Neural network formalism

Figure: Neural network

Page 17: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

17/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

zi : input to the ith neuron in the output layer.

zi = bi +∑j

wijxj , i = 1, · · · ,K

hi : output of the ith neuron in the output layer.

hi =ezi∑k e

zk= P[y = i | x ], i = 1, · · · ,K

In vector notation, we write:

(h1, · · · , hK ) = softmax(z1, · · · , zK ),

orh = softmax(z)

Page 18: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

18/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

XOR problem

Given a data set D consisting of 4 points in R2 in 2 classes asshown in the following:

Figure: XOR

Note that there is no line that separates there two classes

Page 19: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

19/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

But if we add one more (hidden) layer to the neural network,then this network can separate the two classes

Figure: hidden layer

Page 20: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

20/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Cybenko-Hornik-Funabashi Theorem

Let Σ = [0, 1]d : d-dimensional hypercube. Then the sum of theform

f (x) =∑i

ci sigm(bi +d∑

j=1

wijxj)

can approximate any continuous function on Σ to any degree ofaccuracy

Page 21: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

21/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

1.2 Quick Summary

Universal Approximation

This theorem implies that the neural network with one hiddenlayer is good enough to do any classification job with smallerror. at least in principle.

In fact, Lecture 2 should be viewed in this spirit

Then, why deep learning?

Page 22: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

22/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.1 Exponential family of distributions

2. GLM: Generalized linear model2.1 Exponential family of distributions

Exponential family of distributions

An exponential family of distributions in canonical form is aprobability distribution of the form:

Pθ(y) =1

Z (θ)h(y) exp

(∑i

θiTi (y)

),

where

y = (y1, · · · , yK ) ∈ RK ,

θ = (θ1, · · · , θm) ∈ Rm,

T : Rm → RK

Page 23: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

23/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.1 Exponential family of distributions

Rewrite it in the form

Pθ(y) = exp

[∑i

θiTi (y)− A(θ) + C (y)

],

where

A(θ) = logZ (θ) : log partition (cumulant) function

C (y) = log h(y)

[Remark: Here, we assume the dispersion parameter is 1]

Page 24: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

24/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.1 Exponential family of distributions

Bernoulli distribution

Random variable Y with value y ∈ {0, 1}. Let

p = P[y = 1]

Then

P(y) = py (1− p)1−y = exp

[y log

p

1− p+ log(1− p)

]In exponential family form:

T (y) = y

θ = logp

1− p= logit(p)

p = sigm(θ) =eθ

1 + eθ

Page 25: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

25/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.1 Exponential family of distributions

Multivariate Bernoulli (Multinoulli) distribution

Random variable Y with value y ∈ {0, · · · ,K}. Let

pi = P[y = i ]

Define yi = I(y = i) ∈ {0, 1}. Thus y1 + · · ·+ yK = 1, and wehave

P(y) = py11 · · · pyKK = py11 · · · p

yK−1

K−1 p1−

∑K−1i=1 yi

k

= exp

[K−1∑i=1

yi logpipK

+ log pK

]

[Note: when K = 2, this is exactly the Bernoulli distribution.]

Page 26: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

26/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.1 Exponential family of distributions

In the exponential family form:

Ti (y) = yi

θi = logpipK, i = 1, · · · ,K − 1

Page 27: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

27/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.1 Exponential family of distributions

Solving for pi , we get generalized sigmoid (softmax) function

pi =eθi

1 +∑K−1

k=1 eθk= P[y = i ], i = 1, · · · ,K − 1

pK =1

1 +∑K−1

k=1 eθk= P[y = K ]

The generalized logit function

θi = logpi

1−∑K−1

k=1 pk, i = 1, · · · ,K − 1

The above expressions show how p1, · · · , pK−1 andθ1, · · · , θK−1 are related;

pK is gotten by setting pK = 1− (p1 + · · ·+ pK−1)

Page 28: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

28/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

2.2 Generalized linear model(GLM)

GLM

GLM mechanism is a way to relate the input vectorx = (x1, · · · , xd) to the parameters θi of GLM by setting

θi = bi +d∑

j=1

wijxj ,

where bi and wij are the GLM parameters to be determined bythe data

Thus get

pi =exp

(bi +

∑dj=1 wijxj

)1 +

∑K−1k=1 exp

(bi +

∑dj=1 wkjxj

) ,

Page 29: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

29/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

i.e., pi = P[y = i | x ] for i = 1, · · · ,K − 1, and

pK = P[y = k | x ] =1

1 +∑K−1

k=1 exp(bi +

∑dj=1 wkjxj

)

Page 30: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

30/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

Note: when K = 2, it is the logistic regression such that

P[y = 1 | x ] = p1 =exp

(b +

∑j wjxj

)1 + exp

(b +

∑j wjxj

)P[y = 0 | x ] = p2 =

1

1 + exp(b +

∑j wjxj

) .Here, we set b = b1,wj = w1j

Page 31: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

31/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

Symmetric (redundant) form

The expression for pK is different from those for pi .

To put p1, · · · , pK in symmetric form, multiply

exp

a +d∑

j=1

αjxj

on the numerator and the denominator of pi and pK . Thenpi =

exp(a + bi +

∑dj=1(wij + αj)xj

)exp

(a +

∑dj=1 αjxj

)+∑K−1

k=1 exp(a + bk +

∑dj=1(wkj + αj)xj

) ,i = 1, · · · ,K − 1

Page 32: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

32/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

and pK =

exp(a +

∑dj=1 αjxj

)exp

(a +

∑dj=1 αjxj

)+∑K−1

k=1 exp(a + bk +

∑dj=1(wkj + αj)xj

)Set

bi ← bi + a

wij ← wij + αj , j = 1, · · · , d ,

for i = 1, · · · ,K − 1 and set

bK = a

wKj = αj , j = 1, · · · , d

Page 33: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

33/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

Then we have

pi =exp

(bi +

∑dj=1 wijxj

)∑K

k=1 exp(bk +

∑dj=1 wkjxj

) = P[y = 1 | x ],

i = 1, · · · ,K .In vector notation

p = (p1, · · · , pK ) = softmax(z1, · · · , zK ) = softmax(z),

where

zi = exp

bi +d∑

j=1

wijxj

, i = 1, · · · ,K

Page 34: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

34/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

Neural network formalism

Figure: Neural network

Page 35: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

35/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.2 Generalized linear model(GLM)

zi : input to the ith neuron in the output layer

zi = bi +∑j

wijxj , i = 1, · · · ,K

hi : output of the ith neuron in the output layer

hi =ezi∑k e

zk= P[y = i | x ], i = 1, · · · ,K

In vector notation, we write:

(h1, · · · , hK ) = softmax(z1, · · · , zK ),

orh = softmax(z)

Page 36: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

36/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

2.3 Parameter estimation

Determining W and b

So far the parameters K × 1 vector b = [b1, · · · , bK ]T andK × d matrix W = [wij ] are regarded as given

But need to determine b and W using the given data

Use MLE (maximum likelihood estimation)

MLE

Data D = {(x (t), y (t))}Nt=1

ProbabilityP(y | x) = py11 · · · p

yKK ,

Page 37: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

37/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

where pi =exp

(bi +

∑dj=1 wijxj

)∑K

k=1 exp(bk +

∑dj=1 wkjxj

)Likelihood function

L(W , b) =N∏t=1

P[y (t) | x (t)]

Log likelihood function

`(W , b) = log L(W , b) =N∑t=1

logP[y (t) | x (t)]

Page 38: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

38/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

RecallP(y | x) = py11 · · · p

yKK

Thus

logP[y | x ] = y1 log p1 + · · ·+ yK log pK

=K∑

k=1

I(y = k) log pk

=K∑

k=1

I(y = k) logP[y = k | x ]

=K∑

k=1

I(y = k) logezk∑Ki=1 e

zi,

Page 39: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

39/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

where zi = bi +∑d

j=1 wijxj

Rewrite the log likelihood function:

`(W , b) =N∑t=1

logP[y (t) | x (t)]

=N∑t=1

K∑k=1

I(y (t) = k) logP[y (t) = k | x (t)]

=N∑t=1

K∑k=1

I(y (t) = k) logez

(t)k∑K

i=1 ez(t)i

,

Page 40: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

40/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

where z(t)i = bi +

∑dj=1 wijx

(t)j

MLE is to find W and b that maximizes `(W , b)

[Note: for softmax regression it turns out that `(W , b) is aconcave (for generic data sets, strictly concave) function ofW and b.]

Page 41: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

41/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

Neural network formalism

Recall

Figure: Neural network

Page 42: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

42/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

For each input x (t)

z(t)i : input to the ith neuron in the output layer.

z(t)i = bi +

∑j

wijx(t)j , i = 1, · · · ,K

h(t)i : output of the ith neuron in the output layer.

h(t)i =

ez(t)i∑

k ez(t)k

, i = 1, · · · ,K

Page 43: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

43/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

2.3 Parameter estimation

For neural networks, the error function is set to be −`(W , b)and the training is to minimize this error.

[Note: This neural network training is exactly the same as theMLE estimation in softmax regression]

Training (learning) of neural network in case of single layer(no hidden layer) neural network

Training (learning) is a convex optimization optimizationproblem; so it is a relatively easy problem

Three kinds of training (learning) strategies

Full-batch learning: train using all data in D at onceMini-batch learning: train using a small portion of Dsuccessively, and cycle through themOn-line learning: train using one data point at a time andcycle through them

Page 44: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

44/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

3. XOR problem and neural network with hidden layer

XOR Problem

Separate X’s from O’s

Page 45: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

45/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

XOR(x1, x2) = x1x2 + x1x2

x1x2 :

Page 46: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

46/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

z1 = a(x1 − x2 −1

2), a : large

h1 = sigm(z1)

Page 47: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

47/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

Page 48: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

48/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

x1x2 :

Page 49: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

49/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

z2 = a(−x1 + x2 −1

2), a : large

h2 = sigm(z2)

Page 50: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

50/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

Page 51: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

51/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

z3 = b(h1 + h2 −1

2), b : large

h2 = sigm(z3)

Page 52: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

52/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

This neural network achieves the separation

Page 53: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

53/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.1 Further construction

4. Universal approximation4.1 Further construction

Further construction

The NN constructed above has values

Page 54: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

54/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.1 Further construction

Can also construct another NN

Page 55: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

55/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.1 Further construction

Page 56: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

56/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.1 Further construction

The region where h1 g h2 g h3 g h4 = 0 is

The neural network

Page 57: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

57/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.1 Further construction

One can easily find a hyperplane in R4 that separates(0, 0, 0, 0) from the rest; and this hyperplane define h5, whichdefines a function with value ≈ 0 in the center and ≈ 1 in therest

Page 58: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

58/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.1 Further construction

Continuing this way, one can construct any approximate bumpfunction as an output of a neural network with one hidden

Combining these bump functions, one can approximate anycontinuous function

Namely, a neural network with one hidden layer can do anytask, at least in principle

Page 59: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

59/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.2 Universal approximation theorem

4.2 Universal approximation theorem

Universal approximation theorem

This heuristic argument can be made rigorous usingStone-Weienstrass theorem-type argument to getCybenko-Hornik-Funabashi Theorem

Cybenko-Hornik-Funabashi Theorem

Let Σ = [0, 1]d : d-dimensional hypercube. Then the sum of theform

f (x) =∑i

cisigm(bi +d∑

j=1

wijxj)

can approximate any continuous function on Σ to any degree ofaccuracy

Page 60: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

60/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.2 Universal approximation theorem

Universal approximation theorem

There are many similar results to this effect

Page 61: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

61/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.3 Deep vs Shallow learning

4.2 Deep vs Shallow learning

Deep vs Shallow learning

This theorem says that at least in principle one can do anyclassification with a neural network with one hidden layer

Deep learning utilizes neural network with many hidden layers,typically up to 40 or more layers.

Question:

If universal Approximation Theorem says one can do the job withonly one hidden layer, why does one use so many hidden layers?What is advantage in doing so? This is one big question we like toaddress to for the rest of this lecture series.

Page 62: Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

62/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

4.3 Deep vs Shallow learning

Deep vs Shallow learning

To achieve high accuracy, the number of terms has to be hugeand the training (learning) is a big problem: typical problemof shallow networks (shallow learning)

In contrast, deep NN arranges neurons in depth for moreefficiency and better training, but training is a very subtleissue, [which will be dealt with later in this lecture series]