Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view

1/62

1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/62


Lecture 1bLogistic regression

&neural network

October 2, 2015

3/62


Table of contents

1 1 Bird’s-eye view of Lecture 1b1.1 Objectives1.2 Quick Summary

2 2. GLM: Generalized linear model2.1 Exponential family of distributions2.2 Generalized linear model(GLM)2.3 Parameter estimation

3 3. XOR problem and neural network with hidden layer

4 4. Universal approximation4.1 Further construction4.2 Universal approximation theorem4.3 Deep vs Shallow learning

4/62


1.1 Objectives

1 Bird’s-eye view of Lecture 1b1.1 Objectives

Objective1

Understand logistic regression (binary classification) and itsmulticlass generalization (softmax regression)

Objective2

Recast logistic and softmax regression in a neural network(perceptron) formalism

5/62


1.1 Objectives

Objective3

Learn the limitations of the perceptron by looking at the XORproblem

Learn how to fix it by adding a hidden layer

Objective4

Introduce the Universal Approximation

Learn about the clash of ”Deep vs Shallow” paradigms inmachine learning

6/62


1.2 Quick Summary

1.2 Quick Summary

Logistic regression

Data D = {(x (t), y (t))}Nt=1

input x (t) ∈ Rd

label y (t) ∈ {0, 1}Given x = (x1, · · · , xd) ∈ Rd , logistic regression outputs theprobability of the output label y being equal to 1 by

P[y = 1 | x ] = sigm(b + w1x1 + · · ·+ wdxd),

where

sigm(t) =et

1 + et.

7/62


1.2 Quick Summary

Thus

P[y = 1 | x ] =eb+

∑j wjxj

1 + eb+∑

j wjxj

P[y = 0 | x ] =1

1 + eb+∑

j wjxj

DecisionGiven x , decide the output label is y , where{

y = 1 if b +∑

j wjxj ≥ 0

y = 0 if b +∑

j wjxj < 0

[Thus the decision boundary is the hyperplaneb +

∑j wjxj = 0 in Rd ]

8/62


1.2 Quick Summary

Neural network formulation

Figure: Neural Network

9/62


1.2 Quick Summary

z: input to the output neuron.

z = b1 + w1x1 + · · ·+ b1 + wdxd

h: output of the output neuron

h = sigm(z) = sigm(b1 + w1x1 + · · ·+ b1 + wdxd)

10/62


1.2 Quick Summary

Symmetric (redundant) form of logistic regression

The probabilities P[y = 1 | x ] and P[y = 0 | x ] have differentform in logistic regression.

We can put them in symmetric form by rewriting them in thefollowing (redundant) form:

P[y = 1 | x ] =exp

(b1 +

∑j w1jxj

)exp

(b1 +

∑j w1jxj

)+ exp

(b2 +

∑j w2jxj

)P[y = 0 | x ] =

exp(b2 +

∑j w2jxj

)exp

(b1 +

∑j w1jxj

)+ exp

(b2 +

∑j w2jxj

)

11/62


1.2 Quick Summary

Decision

Given x , decide the output label is y , where{y = 1 if b1 +

∑j w1jxj ≥ b2 +

∑j w2jxj

y = 0 if b1 +∑

j w1jxj < b2 +∑

j w2jxj

The decision boundary is the hyperplane

b1 +∑j

w1jxj = b2 +∑j

w2jxj

in Rd

12/62


1.2 Quick Summary

Neural network formulation

Figure: Neural Network

13/62


1.2 Quick Summary

zi : input to the ith neuron in the output layer.

zi = bi +∑j

wijxj , i = 1, 2

hi : output of the ith neuron in the output layer.

hi =ezi

ez1 + ez2, i = 1, 2

14/62


1.2 Quick Summary

Softmax regression: multiclass classification

There are K output labels, i.e., y ∈ {1, · · · ,K}ProbabilityP[y = i | x ] =

exp(bi +

∑j wijxj

)exp

(b1 +

∑j w1jxj

)+ · · ·+ exp

(bK +

∑j wKjxj

)for i = 1, · · · ,K .Decision

Given x , decide the output label is y , where

y = argmaxi

P[y = i | x ]

15/62


1.2 Quick Summary

Decision boundary

Figure: example of decision boundary

Decision regions are partitioned by linear hyperplanes in Rd

16/62


1.2 Quick Summary

Neural network formalism

Figure: Neural network

17/62


1.2 Quick Summary

zi : input to the ith neuron in the output layer.

zi = bi +∑j

wijxj , i = 1, · · · ,K

hi : output of the ith neuron in the output layer.

hi =ezi∑k e

zk= P[y = i | x ], i = 1, · · · ,K

In vector notation, we write:

(h1, · · · , hK ) = softmax(z1, · · · , zK ),

orh = softmax(z)

18/62


1.2 Quick Summary

XOR problem

Given a data set D consisting of 4 points in R2 in 2 classes asshown in the following:

Figure: XOR

Note that there is no line that separates there two classes

19/62


1.2 Quick Summary

But if we add one more (hidden) layer to the neural network,then this network can separate the two classes

Figure: hidden layer

20/62


1.2 Quick Summary

Cybenko-Hornik-Funabashi Theorem

Let Σ = [0, 1]d : d-dimensional hypercube. Then the sum of theform

f (x) =∑i

ci sigm(bi +d∑

j=1

wijxj)

can approximate any continuous function on Σ to any degree ofaccuracy

21/62


1.2 Quick Summary

Universal Approximation

This theorem implies that the neural network with one hiddenlayer is good enough to do any classification job with smallerror. at least in principle.

In fact, Lecture 2 should be viewed in this spirit

Then, why deep learning?

22/62


2.1 Exponential family of distributions

2. GLM: Generalized linear model2.1 Exponential family of distributions

Exponential family of distributions

An exponential family of distributions in canonical form is aprobability distribution of the form:

Pθ(y) =1

Z (θ)h(y) exp

(∑i

θiTi (y)

),

where

y = (y1, · · · , yK ) ∈ RK ,

θ = (θ1, · · · , θm) ∈ Rm,

T : Rm → RK

23/62



Rewrite it in the form

Pθ(y) = exp

[∑i

θiTi (y)− A(θ) + C (y)

],

where

A(θ) = logZ (θ) : log partition (cumulant) function

C (y) = log h(y)

[Remark: Here, we assume the dispersion parameter is 1]

24/62



Bernoulli distribution

Random variable Y with value y ∈ {0, 1}. Let

p = P[y = 1]

Then

P(y) = py (1− p)1−y = exp

[y log

p

1− p+ log(1− p)

]In exponential family form:

T (y) = y

θ = logp

1− p= logit(p)

p = sigm(θ) =eθ

1 + eθ

25/62



Multivariate Bernoulli (Multinoulli) distribution

Random variable Y with value y ∈ {0, · · · ,K}. Let

pi = P[y = i ]

Define yi = I(y = i) ∈ {0, 1}. Thus y1 + · · ·+ yK = 1, and wehave

P(y) = py11 · · · pyKK = py11 · · · p

yK−1

K−1 p1−

∑K−1i=1 yi

k

= exp

[K−1∑i=1

yi logpipK

+ log pK

]

[Note: when K = 2, this is exactly the Bernoulli distribution.]

26/62



In the exponential family form:

Ti (y) = yi

θi = logpipK, i = 1, · · · ,K − 1

27/62



Solving for pi , we get generalized sigmoid (softmax) function

pi =eθi

1 +∑K−1

k=1 eθk= P[y = i ], i = 1, · · · ,K − 1

pK =1

1 +∑K−1

k=1 eθk= P[y = K ]

The generalized logit function

θi = logpi

1−∑K−1

k=1 pk, i = 1, · · · ,K − 1

The above expressions show how p1, · · · , pK−1 andθ1, · · · , θK−1 are related;

pK is gotten by setting pK = 1− (p1 + · · ·+ pK−1)

28/62


2.2 Generalized linear model(GLM)


GLM

GLM mechanism is a way to relate the input vectorx = (x1, · · · , xd) to the parameters θi of GLM by setting

θi = bi +d∑

j=1

wijxj ,

where bi and wij are the GLM parameters to be determined bythe data

Thus get

pi =exp

(bi +

∑dj=1 wijxj

)1 +

∑K−1k=1 exp

(bi +

∑dj=1 wkjxj

) ,

29/62



i.e., pi = P[y = i | x ] for i = 1, · · · ,K − 1, and

pK = P[y = k | x ] =1

1 +∑K−1

k=1 exp(bi +

∑dj=1 wkjxj

)

30/62



Note: when K = 2, it is the logistic regression such that

P[y = 1 | x ] = p1 =exp

(b +

∑j wjxj

)1 + exp

(b +

∑j wjxj

)P[y = 0 | x ] = p2 =

1

1 + exp(b +

∑j wjxj

) .Here, we set b = b1,wj = w1j

31/62



Symmetric (redundant) form

The expression for pK is different from those for pi .

To put p1, · · · , pK in symmetric form, multiply

exp

a +d∑

j=1

αjxj

on the numerator and the denominator of pi and pK . Thenpi =

exp(a + bi +

∑dj=1(wij + αj)xj

)exp

(a +

∑dj=1 αjxj

)+∑K−1

k=1 exp(a + bk +

∑dj=1(wkj + αj)xj

) ,i = 1, · · · ,K − 1

32/62



and pK =

exp(a +

∑dj=1 αjxj

)exp

(a +

∑dj=1 αjxj

)+∑K−1

k=1 exp(a + bk +

∑dj=1(wkj + αj)xj

)Set

bi ← bi + a

wij ← wij + αj , j = 1, · · · , d ,

for i = 1, · · · ,K − 1 and set

bK = a

wKj = αj , j = 1, · · · , d

33/62



Then we have

pi =exp

(bi +

∑dj=1 wijxj

)∑K

k=1 exp(bk +

∑dj=1 wkjxj

) = P[y = 1 | x ],

i = 1, · · · ,K .In vector notation

p = (p1, · · · , pK ) = softmax(z1, · · · , zK ) = softmax(z),

where

zi = exp

bi +d∑

j=1

wijxj

, i = 1, · · · ,K

34/62





35/62



zi : input to the ith neuron in the output layer

zi = bi +∑j

wijxj , i = 1, · · · ,K

hi : output of the ith neuron in the output layer

hi =ezi∑k e

zk= P[y = i | x ], i = 1, · · · ,K

In vector notation, we write:

(h1, · · · , hK ) = softmax(z1, · · · , zK ),

orh = softmax(z)

36/62


2.3 Parameter estimation


Determining W and b

So far the parameters K × 1 vector b = [b1, · · · , bK ]T andK × d matrix W = [wij ] are regarded as given

But need to determine b and W using the given data

Use MLE (maximum likelihood estimation)

MLE

Data D = {(x (t), y (t))}Nt=1

ProbabilityP(y | x) = py11 · · · p

yKK ,

37/62



where pi =exp

(bi +

∑dj=1 wijxj

)∑K

k=1 exp(bk +

∑dj=1 wkjxj

)Likelihood function

L(W , b) =N∏t=1

P[y (t) | x (t)]

Log likelihood function

`(W , b) = log L(W , b) =N∑t=1

logP[y (t) | x (t)]

38/62



RecallP(y | x) = py11 · · · p

yKK

Thus

logP[y | x ] = y1 log p1 + · · ·+ yK log pK

=K∑

k=1

I(y = k) log pk

=K∑

k=1

I(y = k) logP[y = k | x ]

=K∑

k=1

I(y = k) logezk∑Ki=1 e

zi,

39/62



where zi = bi +∑d

j=1 wijxj

Rewrite the log likelihood function:

`(W , b) =N∑t=1

logP[y (t) | x (t)]

=N∑t=1

K∑k=1

I(y (t) = k) logP[y (t) = k | x (t)]

=N∑t=1

K∑k=1

I(y (t) = k) logez

(t)k∑K

i=1 ez(t)i

,

40/62



where z(t)i = bi +

∑dj=1 wijx

(t)j

MLE is to find W and b that maximizes `(W , b)

[Note: for softmax regression it turns out that `(W , b) is aconcave (for generic data sets, strictly concave) function ofW and b.]

41/62




Recall


42/62



For each input x (t)

z(t)i : input to the ith neuron in the output layer.

z(t)i = bi +

∑j

wijx(t)j , i = 1, · · · ,K

h(t)i : output of the ith neuron in the output layer.

h(t)i =

ez(t)i∑

k ez(t)k

, i = 1, · · · ,K

43/62



For neural networks, the error function is set to be −`(W , b)and the training is to minimize this error.

[Note: This neural network training is exactly the same as theMLE estimation in softmax regression]

Training (learning) of neural network in case of single layer(no hidden layer) neural network

Training (learning) is a convex optimization optimizationproblem; so it is a relatively easy problem

Three kinds of training (learning) strategies

Full-batch learning: train using all data in D at onceMini-batch learning: train using a small portion of Dsuccessively, and cycle through themOn-line learning: train using one data point at a time andcycle through them

44/62


3. XOR problem and neural network with hidden layer

XOR Problem

Separate X’s from O’s

45/62


XOR(x1, x2) = x1x2 + x1x2

x1x2 :

46/62


z1 = a(x1 − x2 −1

2), a : large

h1 = sigm(z1)

47/62


48/62


x1x2 :

49/62


z2 = a(−x1 + x2 −1

2), a : large

h2 = sigm(z2)

50/62


51/62


z3 = b(h1 + h2 −1

2), b : large

h2 = sigm(z3)

52/62


This neural network achieves the separation

53/62


4.1 Further construction

4. Universal approximation4.1 Further construction

Further construction

The NN constructed above has values

54/62



Can also construct another NN

55/62



56/62



The region where h1 g h2 g h3 g h4 = 0 is

The neural network

57/62



One can easily find a hyperplane in R4 that separates(0, 0, 0, 0) from the rest; and this hyperplane define h5, whichdefines a function with value ≈ 0 in the center and ≈ 1 in therest

58/62



Continuing this way, one can construct any approximate bumpfunction as an output of a neural network with one hidden

Combining these bump functions, one can approximate anycontinuous function

Namely, a neural network with one hidden layer can do anytask, at least in principle

59/62


4.2 Universal approximation theorem


Universal approximation theorem

This heuristic argument can be made rigorous usingStone-Weienstrass theorem-type argument to getCybenko-Hornik-Funabashi Theorem

Cybenko-Hornik-Funabashi Theorem

Let Σ = [0, 1]d : d-dimensional hypercube. Then the sum of theform

f (x) =∑i

cisigm(bi +d∑

j=1

wijxj)

can approximate any continuous function on Σ to any degree ofaccuracy

60/62



Universal approximation theorem

There are many similar results to this effect

61/62


4.3 Deep vs Shallow learning


Deep vs Shallow learning

This theorem says that at least in principle one can do anyclassification with a neural network with one hidden layer

Deep learning utilizes neural network with many hidden layers,typically up to 40 or more layers.

Question:

If universal Approximation Theorem says one can do the job withonly one hidden layer, why does one use so many hidden layers?What is advantage in doing so? This is one big question we like toaddress to for the rest of this lecture series.

62/62



Deep vs Shallow learning

To achieve high accuracy, the number of terms has to be hugeand the training (learning) is a big problem: typical problemof shallow networks (shallow learning)

In contrast, deep NN arranges neurons in depth for moreefficiency and better training, but training is a very subtleissue, [which will be dealt with later in this lecture series]

Documents

Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view