Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
Rapid Introduction to Machine Learning/Deep Learning
Hyeong In Choi
Seoul National University
2/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
Lecture 1bLogistic regression
&neural network
October 2, 2015
3/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
Table of contents
1 1 Bird’s-eye view of Lecture 1b1.1 Objectives1.2 Quick Summary
2 2. GLM: Generalized linear model2.1 Exponential family of distributions2.2 Generalized linear model(GLM)2.3 Parameter estimation
3 3. XOR problem and neural network with hidden layer
4 4. Universal approximation4.1 Further construction4.2 Universal approximation theorem4.3 Deep vs Shallow learning
4/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.1 Objectives
1 Bird’s-eye view of Lecture 1b1.1 Objectives
Objective1
Understand logistic regression (binary classification) and itsmulticlass generalization (softmax regression)
Objective2
Recast logistic and softmax regression in a neural network(perceptron) formalism
5/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.1 Objectives
Objective3
Learn the limitations of the perceptron by looking at the XORproblem
Learn how to fix it by adding a hidden layer
Objective4
Introduce the Universal Approximation
Learn about the clash of ”Deep vs Shallow” paradigms inmachine learning
6/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
1.2 Quick Summary
Logistic regression
Data D = {(x (t), y (t))}Nt=1
input x (t) ∈ Rd
label y (t) ∈ {0, 1}Given x = (x1, · · · , xd) ∈ Rd , logistic regression outputs theprobability of the output label y being equal to 1 by
P[y = 1 | x ] = sigm(b + w1x1 + · · ·+ wdxd),
where
sigm(t) =et
1 + et.
7/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Thus
P[y = 1 | x ] =eb+
∑j wjxj
1 + eb+∑
j wjxj
P[y = 0 | x ] =1
1 + eb+∑
j wjxj
DecisionGiven x , decide the output label is y , where{
y = 1 if b +∑
j wjxj ≥ 0
y = 0 if b +∑
j wjxj < 0
[Thus the decision boundary is the hyperplaneb +
∑j wjxj = 0 in Rd ]
8/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Neural network formulation
Figure: Neural Network
9/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
z: input to the output neuron.
z = b1 + w1x1 + · · ·+ b1 + wdxd
h: output of the output neuron
h = sigm(z) = sigm(b1 + w1x1 + · · ·+ b1 + wdxd)
10/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Symmetric (redundant) form of logistic regression
The probabilities P[y = 1 | x ] and P[y = 0 | x ] have differentform in logistic regression.
We can put them in symmetric form by rewriting them in thefollowing (redundant) form:
P[y = 1 | x ] =exp
(b1 +
∑j w1jxj
)exp
(b1 +
∑j w1jxj
)+ exp
(b2 +
∑j w2jxj
)P[y = 0 | x ] =
exp(b2 +
∑j w2jxj
)exp
(b1 +
∑j w1jxj
)+ exp
(b2 +
∑j w2jxj
)
11/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Decision
Given x , decide the output label is y , where{y = 1 if b1 +
∑j w1jxj ≥ b2 +
∑j w2jxj
y = 0 if b1 +∑
j w1jxj < b2 +∑
j w2jxj
The decision boundary is the hyperplane
b1 +∑j
w1jxj = b2 +∑j
w2jxj
in Rd
12/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Neural network formulation
Figure: Neural Network
13/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
zi : input to the ith neuron in the output layer.
zi = bi +∑j
wijxj , i = 1, 2
hi : output of the ith neuron in the output layer.
hi =ezi
ez1 + ez2, i = 1, 2
14/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Softmax regression: multiclass classification
There are K output labels, i.e., y ∈ {1, · · · ,K}ProbabilityP[y = i | x ] =
exp(bi +
∑j wijxj
)exp
(b1 +
∑j w1jxj
)+ · · ·+ exp
(bK +
∑j wKjxj
)for i = 1, · · · ,K .Decision
Given x , decide the output label is y , where
y = argmaxi
P[y = i | x ]
15/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Decision boundary
Figure: example of decision boundary
Decision regions are partitioned by linear hyperplanes in Rd
16/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Neural network formalism
Figure: Neural network
17/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
zi : input to the ith neuron in the output layer.
zi = bi +∑j
wijxj , i = 1, · · · ,K
hi : output of the ith neuron in the output layer.
hi =ezi∑k e
zk= P[y = i | x ], i = 1, · · · ,K
In vector notation, we write:
(h1, · · · , hK ) = softmax(z1, · · · , zK ),
orh = softmax(z)
18/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
XOR problem
Given a data set D consisting of 4 points in R2 in 2 classes asshown in the following:
Figure: XOR
Note that there is no line that separates there two classes
19/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
But if we add one more (hidden) layer to the neural network,then this network can separate the two classes
Figure: hidden layer
20/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Cybenko-Hornik-Funabashi Theorem
Let Σ = [0, 1]d : d-dimensional hypercube. Then the sum of theform
f (x) =∑i
ci sigm(bi +d∑
j=1
wijxj)
can approximate any continuous function on Σ to any degree ofaccuracy
21/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
1.2 Quick Summary
Universal Approximation
This theorem implies that the neural network with one hiddenlayer is good enough to do any classification job with smallerror. at least in principle.
In fact, Lecture 2 should be viewed in this spirit
Then, why deep learning?
22/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.1 Exponential family of distributions
2. GLM: Generalized linear model2.1 Exponential family of distributions
Exponential family of distributions
An exponential family of distributions in canonical form is aprobability distribution of the form:
Pθ(y) =1
Z (θ)h(y) exp
(∑i
θiTi (y)
),
where
y = (y1, · · · , yK ) ∈ RK ,
θ = (θ1, · · · , θm) ∈ Rm,
T : Rm → RK
23/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.1 Exponential family of distributions
Rewrite it in the form
Pθ(y) = exp
[∑i
θiTi (y)− A(θ) + C (y)
],
where
A(θ) = logZ (θ) : log partition (cumulant) function
C (y) = log h(y)
[Remark: Here, we assume the dispersion parameter is 1]
24/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.1 Exponential family of distributions
Bernoulli distribution
Random variable Y with value y ∈ {0, 1}. Let
p = P[y = 1]
Then
P(y) = py (1− p)1−y = exp
[y log
p
1− p+ log(1− p)
]In exponential family form:
T (y) = y
θ = logp
1− p= logit(p)
p = sigm(θ) =eθ
1 + eθ
25/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.1 Exponential family of distributions
Multivariate Bernoulli (Multinoulli) distribution
Random variable Y with value y ∈ {0, · · · ,K}. Let
pi = P[y = i ]
Define yi = I(y = i) ∈ {0, 1}. Thus y1 + · · ·+ yK = 1, and wehave
P(y) = py11 · · · pyKK = py11 · · · p
yK−1
K−1 p1−
∑K−1i=1 yi
k
= exp
[K−1∑i=1
yi logpipK
+ log pK
]
[Note: when K = 2, this is exactly the Bernoulli distribution.]
26/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.1 Exponential family of distributions
In the exponential family form:
Ti (y) = yi
θi = logpipK, i = 1, · · · ,K − 1
27/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.1 Exponential family of distributions
Solving for pi , we get generalized sigmoid (softmax) function
pi =eθi
1 +∑K−1
k=1 eθk= P[y = i ], i = 1, · · · ,K − 1
pK =1
1 +∑K−1
k=1 eθk= P[y = K ]
The generalized logit function
θi = logpi
1−∑K−1
k=1 pk, i = 1, · · · ,K − 1
The above expressions show how p1, · · · , pK−1 andθ1, · · · , θK−1 are related;
pK is gotten by setting pK = 1− (p1 + · · ·+ pK−1)
28/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
2.2 Generalized linear model(GLM)
GLM
GLM mechanism is a way to relate the input vectorx = (x1, · · · , xd) to the parameters θi of GLM by setting
θi = bi +d∑
j=1
wijxj ,
where bi and wij are the GLM parameters to be determined bythe data
Thus get
pi =exp
(bi +
∑dj=1 wijxj
)1 +
∑K−1k=1 exp
(bi +
∑dj=1 wkjxj
) ,
29/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
i.e., pi = P[y = i | x ] for i = 1, · · · ,K − 1, and
pK = P[y = k | x ] =1
1 +∑K−1
k=1 exp(bi +
∑dj=1 wkjxj
)
30/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
Note: when K = 2, it is the logistic regression such that
P[y = 1 | x ] = p1 =exp
(b +
∑j wjxj
)1 + exp
(b +
∑j wjxj
)P[y = 0 | x ] = p2 =
1
1 + exp(b +
∑j wjxj
) .Here, we set b = b1,wj = w1j
31/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
Symmetric (redundant) form
The expression for pK is different from those for pi .
To put p1, · · · , pK in symmetric form, multiply
exp
a +d∑
j=1
αjxj
on the numerator and the denominator of pi and pK . Thenpi =
exp(a + bi +
∑dj=1(wij + αj)xj
)exp
(a +
∑dj=1 αjxj
)+∑K−1
k=1 exp(a + bk +
∑dj=1(wkj + αj)xj
) ,i = 1, · · · ,K − 1
32/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
and pK =
exp(a +
∑dj=1 αjxj
)exp
(a +
∑dj=1 αjxj
)+∑K−1
k=1 exp(a + bk +
∑dj=1(wkj + αj)xj
)Set
bi ← bi + a
wij ← wij + αj , j = 1, · · · , d ,
for i = 1, · · · ,K − 1 and set
bK = a
wKj = αj , j = 1, · · · , d
33/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
Then we have
pi =exp
(bi +
∑dj=1 wijxj
)∑K
k=1 exp(bk +
∑dj=1 wkjxj
) = P[y = 1 | x ],
i = 1, · · · ,K .In vector notation
p = (p1, · · · , pK ) = softmax(z1, · · · , zK ) = softmax(z),
where
zi = exp
bi +d∑
j=1
wijxj
, i = 1, · · · ,K
34/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
Neural network formalism
Figure: Neural network
35/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.2 Generalized linear model(GLM)
zi : input to the ith neuron in the output layer
zi = bi +∑j
wijxj , i = 1, · · · ,K
hi : output of the ith neuron in the output layer
hi =ezi∑k e
zk= P[y = i | x ], i = 1, · · · ,K
In vector notation, we write:
(h1, · · · , hK ) = softmax(z1, · · · , zK ),
orh = softmax(z)
36/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
2.3 Parameter estimation
Determining W and b
So far the parameters K × 1 vector b = [b1, · · · , bK ]T andK × d matrix W = [wij ] are regarded as given
But need to determine b and W using the given data
Use MLE (maximum likelihood estimation)
MLE
Data D = {(x (t), y (t))}Nt=1
ProbabilityP(y | x) = py11 · · · p
yKK ,
37/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
where pi =exp
(bi +
∑dj=1 wijxj
)∑K
k=1 exp(bk +
∑dj=1 wkjxj
)Likelihood function
L(W , b) =N∏t=1
P[y (t) | x (t)]
Log likelihood function
`(W , b) = log L(W , b) =N∑t=1
logP[y (t) | x (t)]
38/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
RecallP(y | x) = py11 · · · p
yKK
Thus
logP[y | x ] = y1 log p1 + · · ·+ yK log pK
=K∑
k=1
I(y = k) log pk
=K∑
k=1
I(y = k) logP[y = k | x ]
=K∑
k=1
I(y = k) logezk∑Ki=1 e
zi,
39/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
where zi = bi +∑d
j=1 wijxj
Rewrite the log likelihood function:
`(W , b) =N∑t=1
logP[y (t) | x (t)]
=N∑t=1
K∑k=1
I(y (t) = k) logP[y (t) = k | x (t)]
=N∑t=1
K∑k=1
I(y (t) = k) logez
(t)k∑K
i=1 ez(t)i
,
40/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
where z(t)i = bi +
∑dj=1 wijx
(t)j
MLE is to find W and b that maximizes `(W , b)
[Note: for softmax regression it turns out that `(W , b) is aconcave (for generic data sets, strictly concave) function ofW and b.]
41/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
Neural network formalism
Recall
Figure: Neural network
42/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
For each input x (t)
z(t)i : input to the ith neuron in the output layer.
z(t)i = bi +
∑j
wijx(t)j , i = 1, · · · ,K
h(t)i : output of the ith neuron in the output layer.
h(t)i =
ez(t)i∑
k ez(t)k
, i = 1, · · · ,K
43/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
2.3 Parameter estimation
For neural networks, the error function is set to be −`(W , b)and the training is to minimize this error.
[Note: This neural network training is exactly the same as theMLE estimation in softmax regression]
Training (learning) of neural network in case of single layer(no hidden layer) neural network
Training (learning) is a convex optimization optimizationproblem; so it is a relatively easy problem
Three kinds of training (learning) strategies
Full-batch learning: train using all data in D at onceMini-batch learning: train using a small portion of Dsuccessively, and cycle through themOn-line learning: train using one data point at a time andcycle through them
44/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
3. XOR problem and neural network with hidden layer
XOR Problem
Separate X’s from O’s
45/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
XOR(x1, x2) = x1x2 + x1x2
x1x2 :
46/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
z1 = a(x1 − x2 −1
2), a : large
h1 = sigm(z1)
47/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
48/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
x1x2 :
49/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
z2 = a(−x1 + x2 −1
2), a : large
h2 = sigm(z2)
50/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
51/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
z3 = b(h1 + h2 −1
2), b : large
h2 = sigm(z3)
52/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
This neural network achieves the separation
53/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.1 Further construction
4. Universal approximation4.1 Further construction
Further construction
The NN constructed above has values
54/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.1 Further construction
Can also construct another NN
55/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.1 Further construction
56/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.1 Further construction
The region where h1 g h2 g h3 g h4 = 0 is
The neural network
57/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.1 Further construction
One can easily find a hyperplane in R4 that separates(0, 0, 0, 0) from the rest; and this hyperplane define h5, whichdefines a function with value ≈ 0 in the center and ≈ 1 in therest
58/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.1 Further construction
Continuing this way, one can construct any approximate bumpfunction as an output of a neural network with one hidden
Combining these bump functions, one can approximate anycontinuous function
Namely, a neural network with one hidden layer can do anytask, at least in principle
59/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.2 Universal approximation theorem
4.2 Universal approximation theorem
Universal approximation theorem
This heuristic argument can be made rigorous usingStone-Weienstrass theorem-type argument to getCybenko-Hornik-Funabashi Theorem
Cybenko-Hornik-Funabashi Theorem
Let Σ = [0, 1]d : d-dimensional hypercube. Then the sum of theform
f (x) =∑i
cisigm(bi +d∑
j=1
wijxj)
can approximate any continuous function on Σ to any degree ofaccuracy
60/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.2 Universal approximation theorem
Universal approximation theorem
There are many similar results to this effect
61/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.3 Deep vs Shallow learning
4.2 Deep vs Shallow learning
Deep vs Shallow learning
This theorem says that at least in principle one can do anyclassification with a neural network with one hidden layer
Deep learning utilizes neural network with many hidden layers,typically up to 40 or more layers.
Question:
If universal Approximation Theorem says one can do the job withonly one hidden layer, why does one use so many hidden layers?What is advantage in doing so? This is one big question we like toaddress to for the rest of this lecture series.
62/62
1 Bird’s-eye view of Lecture 1b 2. GLM: Generalized linear model 3. XOR problem and neural network with hidden layer 4. Universal approximation
4.3 Deep vs Shallow learning
Deep vs Shallow learning
To achieve high accuracy, the number of terms has to be hugeand the training (learning) is a big problem: typical problemof shallow networks (shallow learning)
In contrast, deep NN arranges neurons in depth for moreefficiency and better training, but training is a very subtleissue, [which will be dealt with later in this lecture series]