Upload
kaskamotz
View
217
Download
0
Embed Size (px)
Citation preview
7/27/2019 Learning by Neural Networks
1/5
Learning by neural networks
Learning is generally defined as adapting the weights of the neural network in order toimplement a specific mapping, based on a training set. In this section we will unfold the
statistical theory of learning and give a precise meaning to the somewhat naive
definition of learning given above. At first, we view learning as an approximation task,
then we pursue a statistical investigation into the problem.
Learning as an approximation
Given a function 2LxF and a network architecture
21
1 1
01
1
,LL l
j
N
n
nen
L
ij
l
i
L
i xwwwwxNet
Find 22,min:Lw
opt wxNetxFw
This problem is analogous to the problem of finding the optimal Fourier coefficients for
a function xF . There are two approaches to solve this problem depending on theconstruction of the network.
Learning by one-layered network
In this case one can draw from the results of Fourier analysis namely any
Learning from examples:
Instead of the function xF to be approximated given a training set with size K Kkdx
kk
K,,1,,
where Xxk and
kk xFd is the so-called desired output. One can then define an
empirical error over the training set, e.g.
K
k
kk wxNetd
K 1
2,
1
The weight optimization or learning then reduces to finding
2
1
1min:
K
k
kkw
K
opt wxNetdK
w
If the sequence Xxk is drawn randomly from Xsubject to uniform distribution the
xdxpkkK
k
kk wxNetdwxNetdEwxNetdK
22
1
2,,,
1
the function
K
k
kk wxNetdlK
wJ1
,1
is called empirical error function.
7/27/2019 Learning by Neural Networks
2/5
There are two fundamental questions which can arise with regard to K
op tw .
- What is the connection between
2
1
,1
min
K
k
kkWw
wxNetd
K
and xdxpw
wxNetd 2,min
- If this relationship is specified how to minimize wJ in order to find K
optw
As one can see the first question is a statistical one and the second is purely an
optimization task. Thus, we treat learning in two subsections
- statistical learning theory- learning as an optimization problem
Statistical learning theory
Let us assume that the samples in the learning setk
x are drawn from the input space X
independently and subject to uniform distribution. It is also assumed that the desired
response is a function of x given as xFd .Here F can be a deterministic function or a stochastic one. In the latter case
zaj
xgd
First, without dealing with the training set, let us analyze the expression when xFd
is a deterministic function nnX dxdxxxpwxNetdwxNetdE
11
22
,,
since X
xxp n1
1 due to the uniformity
2,,1
1
2
LnXwxNetxFdxdxwxNetd
X
Thus, finding 2,min: wxNetdEww
opt will minimize the approximation error in the
corresponding 2L space. Allowing stochastic mappings between d and x (e.g.
xfd ) finding 2
,min: wxNetdEw wopt
costs a nonlinear regression
problem. In order to solve this let ddxdpxdE
: . It is useful to recall that
EdddddpdddxdxxdpddxdddxxddpxdEE nX
nX
11
Now (xxx) can be rewritten as
7/27/2019 Learning by Neural Networks
3/5
2
222
,
,2,,
wxNetE
wxNetdEdEwxNetdEwxNetdE
In the term wxNetdE ,2 one can use the fact that d and wxNet ,
are statistically independent, therefore wxNetEdEwxNetdE ,2,2
as EdEdxEEdEdEEddE
thence
222 ,, wxNetEdEwxnetdE (2)
As result when 2,wxNetdE is minimized, namely 2,min: wxNetdEww
opt is
searched, then xdEwxNet opt , , making the second term zero in (2).Nevertheless instead of minimizing wxNetdE , , we can only minimize the
empirical quadratic error 2
1
,1
N
k
kk wxNetd
KwJ over a training set of size K
given as Kkdxkk
k,,1,,:
Renoting wxNetdEwR ,: we want to ascertain what is the relationship
between wJw
min and wRw
min or between optwR and k
op twJ . Here the rotation
koptw is used to indicate that
koptw is obtained as minimizing wJ over a finite
training set of length K .
In other words we wish to find out that what size of the training set provides enough
information to approximate op twR .
The bias-variance dilemma
Let us investigate the difference 2, kop twxNetdE where kop tw is obtained byminimizing the empirical error wJ . One can write then
22
22
,,,
,,,,
k
op top top t
k
op top top t
k
op t
wxNetwxNetEwxNetdE
wxNetwxNetwxNetdEwxNetdE
Remark:
The other terms in the expression above becomes zero in a similar manner as was
demonstrated in (xxx).
The first term in the expression above is the approximation error between xF and wxNet , , whereas the second one is the error resulting from the finite training set.
Thus, one can choose between two options:
7/27/2019 Learning by Neural Networks
4/5
- either minimizing the first term (which is referred to as bias) with arelatively large size network, but in this case with a limited size training set the
weights cannot be trained correctly learning the second term large
- or minimizing the second term (called variance) which needs small sizenetwork if the size of the training set is finite but that learns the first term large
As a conclusion there is a dilemma between bias and variance. This gives rise to the
question, how to set the size of the training set which strike a good balance between the
bias and variance.
The optimal size of the training setthe VC dimension
From the erged hypothesis if follows that0,0,0 JKWw for which
wJwRP k if0
KK
Unfortunately this 0K depends on w thus we cannot ascertain a ,0K , for which it
is known that
wJwRPwJwRP k
ww
k
opt
k
opt minmin
If such ,0K exists then it would yield the necessary size of the training set.To have such a result, we have to introduce a more stringent bound on the convergence,
called uniform convergence, namely for0
,, JKWw for which
wJwRP k
Ww
sup if0
KK which enforces that for all other w
wJwRP k if0
KK .
If this uniform convergence holds then the necessary size of learning set can beestimated. Vapuik and Chervoneukis pioneered the work in revealing such bounds and
the basic parameter of this bound is called VC dimension to honour their achievements.
The VC dimension
Let us assume that we are given a wxNet , what we use for binary classification. TheVC dimension is related to the classification power of wxNet , . More precisely,given the set of dichotomies expanded by wxNet , as
01010
1
,
0,1,:,,:
XXXXxXxif
wxNetXxifwxNetWwwxNetF
and given a set of input points
Nix i ,,1,:
The VC dimension of wxNet , is defined as the number of possible dichotomies
expressed by wxNet , on . If all 2 number of dichotomies can be expressed by
wxNet , , then we say is shattered by F .
7/27/2019 Learning by Neural Networks
5/5
E.g. Let us consider the following elementary mapping bxwwxNet T sgn, generated by a single neuron.
Its VC dimension is 1N as if 2N only 32 b points can be separated on a D2
plane.
Distribution free bounds on the convergence rate over the training set.
Let us choose the error function as an indicator function
wiseother
dwxNetifwxNetd
1
,0,,
Then wPerrorwxNetdEL ,,
Whereas the empirical error xwxNetdK
wJ
K
k
1
,1
is the relative
frequency (the error rate) w .Then (Vapuik, 1982) the following bound holds
Kvc
w
evc
ekwwPP22sup
where VC is the VC dimension of wxNet , .
Since this guarantees uniform convergence by setting 022 K
vc
evc
ek
. Therefore
,0K is implicitly given. One must note that 0K depends on the VC dimension. Ifthe VC dimension is finite than