Learning by Neural Networks

Embed Size (px)

Citation preview

  • 7/27/2019 Learning by Neural Networks

    1/5

    Learning by neural networks

    Learning is generally defined as adapting the weights of the neural network in order toimplement a specific mapping, based on a training set. In this section we will unfold the

    statistical theory of learning and give a precise meaning to the somewhat naive

    definition of learning given above. At first, we view learning as an approximation task,

    then we pursue a statistical investigation into the problem.

    Learning as an approximation

    Given a function 2LxF and a network architecture

    21

    1 1

    01

    1

    ,LL l

    j

    N

    n

    nen

    L

    ij

    l

    i

    L

    i xwwwwxNet

    Find 22,min:Lw

    opt wxNetxFw

    This problem is analogous to the problem of finding the optimal Fourier coefficients for

    a function xF . There are two approaches to solve this problem depending on theconstruction of the network.

    Learning by one-layered network

    In this case one can draw from the results of Fourier analysis namely any

    Learning from examples:

    Instead of the function xF to be approximated given a training set with size K Kkdx

    kk

    K,,1,,

    where Xxk and

    kk xFd is the so-called desired output. One can then define an

    empirical error over the training set, e.g.

    K

    k

    kk wxNetd

    K 1

    2,

    1

    The weight optimization or learning then reduces to finding

    2

    1

    1min:

    K

    k

    kkw

    K

    opt wxNetdK

    w

    If the sequence Xxk is drawn randomly from Xsubject to uniform distribution the

    xdxpkkK

    k

    kk wxNetdwxNetdEwxNetdK

    22

    1

    2,,,

    1

    the function

    K

    k

    kk wxNetdlK

    wJ1

    ,1

    is called empirical error function.

  • 7/27/2019 Learning by Neural Networks

    2/5

    There are two fundamental questions which can arise with regard to K

    op tw .

    - What is the connection between

    2

    1

    ,1

    min

    K

    k

    kkWw

    wxNetd

    K

    and xdxpw

    wxNetd 2,min

    - If this relationship is specified how to minimize wJ in order to find K

    optw

    As one can see the first question is a statistical one and the second is purely an

    optimization task. Thus, we treat learning in two subsections

    - statistical learning theory- learning as an optimization problem

    Statistical learning theory

    Let us assume that the samples in the learning setk

    x are drawn from the input space X

    independently and subject to uniform distribution. It is also assumed that the desired

    response is a function of x given as xFd .Here F can be a deterministic function or a stochastic one. In the latter case

    zaj

    xgd

    First, without dealing with the training set, let us analyze the expression when xFd

    is a deterministic function nnX dxdxxxpwxNetdwxNetdE

    11

    22

    ,,

    since X

    xxp n1

    1 due to the uniformity

    2,,1

    1

    2

    LnXwxNetxFdxdxwxNetd

    X

    Thus, finding 2,min: wxNetdEww

    opt will minimize the approximation error in the

    corresponding 2L space. Allowing stochastic mappings between d and x (e.g.

    xfd ) finding 2

    ,min: wxNetdEw wopt

    costs a nonlinear regression

    problem. In order to solve this let ddxdpxdE

    : . It is useful to recall that

    EdddddpdddxdxxdpddxdddxxddpxdEE nX

    nX

    11

    Now (xxx) can be rewritten as

  • 7/27/2019 Learning by Neural Networks

    3/5

    2

    222

    ,

    ,2,,

    wxNetE

    wxNetdEdEwxNetdEwxNetdE

    In the term wxNetdE ,2 one can use the fact that d and wxNet ,

    are statistically independent, therefore wxNetEdEwxNetdE ,2,2

    as EdEdxEEdEdEEddE

    thence

    222 ,, wxNetEdEwxnetdE (2)

    As result when 2,wxNetdE is minimized, namely 2,min: wxNetdEww

    opt is

    searched, then xdEwxNet opt , , making the second term zero in (2).Nevertheless instead of minimizing wxNetdE , , we can only minimize the

    empirical quadratic error 2

    1

    ,1

    N

    k

    kk wxNetd

    KwJ over a training set of size K

    given as Kkdxkk

    k,,1,,:

    Renoting wxNetdEwR ,: we want to ascertain what is the relationship

    between wJw

    min and wRw

    min or between optwR and k

    op twJ . Here the rotation

    koptw is used to indicate that

    koptw is obtained as minimizing wJ over a finite

    training set of length K .

    In other words we wish to find out that what size of the training set provides enough

    information to approximate op twR .

    The bias-variance dilemma

    Let us investigate the difference 2, kop twxNetdE where kop tw is obtained byminimizing the empirical error wJ . One can write then

    22

    22

    ,,,

    ,,,,

    k

    op top top t

    k

    op top top t

    k

    op t

    wxNetwxNetEwxNetdE

    wxNetwxNetwxNetdEwxNetdE

    Remark:

    The other terms in the expression above becomes zero in a similar manner as was

    demonstrated in (xxx).

    The first term in the expression above is the approximation error between xF and wxNet , , whereas the second one is the error resulting from the finite training set.

    Thus, one can choose between two options:

  • 7/27/2019 Learning by Neural Networks

    4/5

    - either minimizing the first term (which is referred to as bias) with arelatively large size network, but in this case with a limited size training set the

    weights cannot be trained correctly learning the second term large

    - or minimizing the second term (called variance) which needs small sizenetwork if the size of the training set is finite but that learns the first term large

    As a conclusion there is a dilemma between bias and variance. This gives rise to the

    question, how to set the size of the training set which strike a good balance between the

    bias and variance.

    The optimal size of the training setthe VC dimension

    From the erged hypothesis if follows that0,0,0 JKWw for which

    wJwRP k if0

    KK

    Unfortunately this 0K depends on w thus we cannot ascertain a ,0K , for which it

    is known that

    wJwRPwJwRP k

    ww

    k

    opt

    k

    opt minmin

    If such ,0K exists then it would yield the necessary size of the training set.To have such a result, we have to introduce a more stringent bound on the convergence,

    called uniform convergence, namely for0

    ,, JKWw for which

    wJwRP k

    Ww

    sup if0

    KK which enforces that for all other w

    wJwRP k if0

    KK .

    If this uniform convergence holds then the necessary size of learning set can beestimated. Vapuik and Chervoneukis pioneered the work in revealing such bounds and

    the basic parameter of this bound is called VC dimension to honour their achievements.

    The VC dimension

    Let us assume that we are given a wxNet , what we use for binary classification. TheVC dimension is related to the classification power of wxNet , . More precisely,given the set of dichotomies expanded by wxNet , as

    01010

    1

    ,

    0,1,:,,:

    XXXXxXxif

    wxNetXxifwxNetWwwxNetF

    and given a set of input points

    Nix i ,,1,:

    The VC dimension of wxNet , is defined as the number of possible dichotomies

    expressed by wxNet , on . If all 2 number of dichotomies can be expressed by

    wxNet , , then we say is shattered by F .

  • 7/27/2019 Learning by Neural Networks

    5/5

    E.g. Let us consider the following elementary mapping bxwwxNet T sgn, generated by a single neuron.

    Its VC dimension is 1N as if 2N only 32 b points can be separated on a D2

    plane.

    Distribution free bounds on the convergence rate over the training set.

    Let us choose the error function as an indicator function

    wiseother

    dwxNetifwxNetd

    1

    ,0,,

    Then wPerrorwxNetdEL ,,

    Whereas the empirical error xwxNetdK

    wJ

    K

    k

    1

    ,1

    is the relative

    frequency (the error rate) w .Then (Vapuik, 1982) the following bound holds

    Kvc

    w

    evc

    ekwwPP22sup

    where VC is the VC dimension of wxNet , .

    Since this guarantees uniform convergence by setting 022 K

    vc

    evc

    ek

    . Therefore

    ,0K is implicitly given. One must note that 0K depends on the VC dimension. Ifthe VC dimension is finite than