Learning by Neural Networks

7/27/2019 Learning by Neural Networks

1/5

Learning by neural networks

Learning is generally defined as adapting the weights of the neural network in order toimplement a specific mapping, based on a training set. In this section we will unfold the

statistical theory of learning and give a precise meaning to the somewhat naive

definition of learning given above. At first, we view learning as an approximation task,

then we pursue a statistical investigation into the problem.

Learning as an approximation

Given a function 2LxF and a network architecture

21

1 1

01

1

,LL l

j

N

n

nen

L

ij

l

i

L

i xwwwwxNet

Find 22,min:Lw

opt wxNetxFw

This problem is analogous to the problem of finding the optimal Fourier coefficients for

a function xF . There are two approaches to solve this problem depending on theconstruction of the network.

Learning by one-layered network

In this case one can draw from the results of Fourier analysis namely any

Learning from examples:

Instead of the function xF to be approximated given a training set with size K Kkdx

kk

K,,1,,

where Xxk and

kk xFd is the so-called desired output. One can then define an

empirical error over the training set, e.g.

K

k

kk wxNetd

K 1

2,

1

The weight optimization or learning then reduces to finding

2

1

1min:

K

k

kkw

K

opt wxNetdK

w

If the sequence Xxk is drawn randomly from Xsubject to uniform distribution the

xdxpkkK

k

kk wxNetdwxNetdEwxNetdK

22

1

2,,,

1

the function

K

k

kk wxNetdlK

wJ1

,1

is called empirical error function.


2/5

There are two fundamental questions which can arise with regard to K

op tw .

- What is the connection between

2

1

,1

min

K

k

kkWw

wxNetd

K

and xdxpw

wxNetd 2,min

- If this relationship is specified how to minimize wJ in order to find K

optw

As one can see the first question is a statistical one and the second is purely an

optimization task. Thus, we treat learning in two subsections

- statistical learning theory- learning as an optimization problem

Statistical learning theory

Let us assume that the samples in the learning setk

x are drawn from the input space X

independently and subject to uniform distribution. It is also assumed that the desired

response is a function of x given as xFd .Here F can be a deterministic function or a stochastic one. In the latter case

zaj

xgd

First, without dealing with the training set, let us analyze the expression when xFd

is a deterministic function nnX dxdxxxpwxNetdwxNetdE

11

22

,,

since X

xxp n1

1 due to the uniformity

2,,1

1

2

LnXwxNetxFdxdxwxNetd

X

Thus, finding 2,min: wxNetdEww

opt will minimize the approximation error in the

corresponding 2L space. Allowing stochastic mappings between d and x (e.g.

xfd ) finding 2

,min: wxNetdEw wopt

costs a nonlinear regression

problem. In order to solve this let ddxdpxdE

: . It is useful to recall that

EdddddpdddxdxxdpddxdddxxddpxdEE nX

nX

11

Now (xxx) can be rewritten as


3/5

2

222

,

,2,,

wxNetE

wxNetdEdEwxNetdEwxNetdE

In the term wxNetdE ,2 one can use the fact that d and wxNet ,

are statistically independent, therefore wxNetEdEwxNetdE ,2,2

as EdEdxEEdEdEEddE

thence

222 ,, wxNetEdEwxnetdE (2)

As result when 2,wxNetdE is minimized, namely 2,min: wxNetdEww

opt is

searched, then xdEwxNet opt , , making the second term zero in (2).Nevertheless instead of minimizing wxNetdE , , we can only minimize the

empirical quadratic error 2

1

,1

N

k

kk wxNetd

KwJ over a training set of size K

given as Kkdxkk

k,,1,,:

Renoting wxNetdEwR ,: we want to ascertain what is the relationship

between wJw

min and wRw

min or between optwR and k

op twJ . Here the rotation

koptw is used to indicate that

koptw is obtained as minimizing wJ over a finite

training set of length K .

In other words we wish to find out that what size of the training set provides enough

information to approximate op twR .

The bias-variance dilemma

Let us investigate the difference 2, kop twxNetdE where kop tw is obtained byminimizing the empirical error wJ . One can write then

22

22

,,,

,,,,

k

op top top t

k

op top top t

k

op t

wxNetwxNetEwxNetdE

wxNetwxNetwxNetdEwxNetdE

Remark:

The other terms in the expression above becomes zero in a similar manner as was

demonstrated in (xxx).

The first term in the expression above is the approximation error between xF and wxNet , , whereas the second one is the error resulting from the finite training set.

Thus, one can choose between two options:


4/5

- either minimizing the first term (which is referred to as bias) with arelatively large size network, but in this case with a limited size training set the

weights cannot be trained correctly learning the second term large

- or minimizing the second term (called variance) which needs small sizenetwork if the size of the training set is finite but that learns the first term large

As a conclusion there is a dilemma between bias and variance. This gives rise to the

question, how to set the size of the training set which strike a good balance between the

bias and variance.

The optimal size of the training setthe VC dimension

From the erged hypothesis if follows that0,0,0 JKWw for which

wJwRP k if0

KK

Unfortunately this 0K depends on w thus we cannot ascertain a ,0K , for which it

is known that

wJwRPwJwRP k

ww

k

opt

k

opt minmin

If such ,0K exists then it would yield the necessary size of the training set.To have such a result, we have to introduce a more stringent bound on the convergence,

called uniform convergence, namely for0

,, JKWw for which

wJwRP k

Ww

sup if0

KK which enforces that for all other w

wJwRP k if0

KK .

If this uniform convergence holds then the necessary size of learning set can beestimated. Vapuik and Chervoneukis pioneered the work in revealing such bounds and

the basic parameter of this bound is called VC dimension to honour their achievements.

The VC dimension

Let us assume that we are given a wxNet , what we use for binary classification. TheVC dimension is related to the classification power of wxNet , . More precisely,given the set of dichotomies expanded by wxNet , as

01010

1

,

0,1,:,,:

XXXXxXxif

wxNetXxifwxNetWwwxNetF

and given a set of input points

Nix i ,,1,:

The VC dimension of wxNet , is defined as the number of possible dichotomies

expressed by wxNet , on . If all 2 number of dichotomies can be expressed by

wxNet , , then we say is shattered by F .


5/5

E.g. Let us consider the following elementary mapping bxwwxNet T sgn, generated by a single neuron.

Its VC dimension is 1N as if 2N only 32 b points can be separated on a D2

plane.

Distribution free bounds on the convergence rate over the training set.

Let us choose the error function as an indicator function

wiseother

dwxNetifwxNetd

1

,0,,

Then wPerrorwxNetdEL ,,

Whereas the empirical error xwxNetdK

wJ

K

k

1

,1

is the relative

frequency (the error rate) w .Then (Vapuik, 1982) the following bound holds

Kvc

w

evc

ekwwPP22sup

where VC is the VC dimension of wxNet , .

Since this guarantees uniform convergence by setting 022 K

vc

evc

ek

. Therefore

,0K is implicitly given. One must note that 0K depends on the VC dimension. Ifthe VC dimension is finite than

Documents

Learning by Neural Networks