25
Machine Learning Lecture 3 ADLINE and Delta Rule G53MLE | Machine Learning | Dr Guoping Qiu 1

Machine Learning

  • Upload
    maxime

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Machine Learning. Lecture 3 ADLINE and Delta Rule. w 0. w 0. x 1. x 1. w 1. w 1. x 2. x 2. w n. w n. x n. x n. The ADLINE and Delta Rule. Adaptive Linear Element (ADLINE) VS Perceptron. o. o. The ADLINE and Delta Rule. Adaptive Linear Element (ADLINE) VS Perceptron - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Learning

Machine Learning

Lecture 3

ADLINE and Delta Rule

G53MLE | Machine Learning | Dr Guoping Qiu 1

Page 2: Machine Learning

The ADLINE and Delta Rule• Adaptive Linear Element (ADLINE) VS Perceptron

G53MLE | Machine Learning | Dr Guoping Qiu 2

w0

w1

wn

x1

xn

x2

w0

w1

wn

x1

xn

x2

n

iiixwwR

10

otherwise 1,

0R if 1;Rsigno

o o

n

iiixwwo

10

Page 3: Machine Learning

The ADLINE and Delta Rule• Adaptive Linear Element (ADLINE) VS Perceptron

– When the problem is not linearly separable, perceptron will fail to converge

– ADLINE can overcome this difficulty by finding a best fit approximation to the target.

G53MLE | Machine Learning | Dr Guoping Qiu 3

Page 4: Machine Learning

The ADLINE Error Function• We have training pairs (X(k), d(k), k =1, 2, …, K), where K is the number of training

samples, the training error specifies the difference between the output of the ALDLINE and the desired target

• The error is defined as

G53MLE | Machine Learning | Dr Guoping Qiu 4

w0

w1

wn

x1

xn

x2o

n

iiixwwo

10

K

k

kokdWE1

2

21

kXWko T)(

is the output of presenting the training input X(k)

Page 5: Machine Learning

The ADLINE Error Function• The error is defined as

• The smaller E(W) is, the closer is the approximation

• We need to find W, based on the given training set, that minimizes the error E(W)

G53MLE | Machine Learning | Dr Guoping Qiu 5

K

k

kokdWE1

2

21

Page 6: Machine Learning

The ADLINE Error Function

G53MLE | Machine Learning | Dr Guoping Qiu 6

Error Surface

Page 7: Machine Learning

The Gradient Descent Rule• An intuition

– Before we formally derive the gradient decent rule, here is an intuition of what we should be doing

G53MLE | Machine Learning | Dr Guoping Qiu 7

w

E(w)

W(0)randomly chosen initial weight

Error surface

We want to move w(0) to a new value,

such that E(W(new))<E(w(0))

E(w(0))

Page 8: Machine Learning

The Gradient Descent Rule• An intuition

– Before we formally derive the gradient decent rule, here is an intuition of what we should be doing

G53MLE | Machine Learning | Dr Guoping Qiu 8

w

E(w)

W(0)randomly chosen initial weight

Error surface

We want to move w(0) to a new value,

Such that E(W(new))<E(w(0))

E(w(0)) Which direction should we move w(0)

Page 9: Machine Learning

The Gradient Descent Rule• An intuition

– Before we formally derive the gradient decent rule, here is an intuition of what we should be doing

G53MLE | Machine Learning | Dr Guoping Qiu 9

w

E(w)

W(0)randomly chosen initial weight

Error surface

We want to move w(0) to a new value,

Such that E(W(new))<E(w(0))

E(w(0))

w(new)

Which direction should we move w(0)

Page 10: Machine Learning

The Gradient Descent Rule• An intuition

– Before we formally derive the gradient decent rule, here is an intuition of what we should be doing

G53MLE | Machine Learning | Dr Guoping Qiu 10

w

E(w)

W(0)randomly chosen initial weight

Error surface

We want to move w(0) to a new value,

Such that E(W(new))<E(w(0))

E(w(0))

w(new)

How do we know which direction to move?

Page 11: Machine Learning

The Gradient Descent Rule• An intuition

– Before we formally derive the gradient decent rule, here is an intuition of what we should be doing

G53MLE | Machine Learning | Dr Guoping Qiu 11

w

E(w)

W(0)randomly chosen initial weight

Error surface

We want to move w(0) to a new value,

Such that E(W(new))<E(w(0))

w(new)

The sign of the gradient at w(0)

)0(wwwE

Page 12: Machine Learning

The Gradient Descent Rule• An intuition

– Before we formally derive the gradient decent rule, here is an intuition of what we should be doing

G53MLE | Machine Learning | Dr Guoping Qiu 12

w

E(w)

W(0)randomly chosen initial weight

Error surface

We want to move w(0) to a new value,

Such that E(W(new))<E(w(0))

w(new)

The sign of the gradient at w(0)

)0(wwwE

Page 13: Machine Learning

The Gradient Descent Rule• An intuition

– The intuition leads to

G53MLE | Machine Learning | Dr Guoping Qiu 13

w

E(w)

W(0)w(new)

w

E(w)

W(0)w(new)

)(oldwww

Esignoldwneww

Page 14: Machine Learning

The Gradient Descent Rule• Formal Derivation of Gradient Descent

– The gradient of E is a vector, whose components are the partial derivatives of E with respect to each of the wi

– The gradient specifies the direction that produces the speepest increase in E.

– Negative of the vector gives the direction of steepest decrease.

G53MLE | Machine Learning | Dr Guoping Qiu 14

nwnE

wE

wEWE ,,

21

Page 15: Machine Learning

The Gradient Descent Rule• The gradient training rule is

G53MLE | Machine Learning | Dr Guoping Qiu 15

iii w

Eww

WEWW

is the training rate

Page 16: Machine Learning

The Gradient Descent Rule• Gradient of ADLINE Error Functions

G53MLE | Machine Learning | Dr Guoping Qiu 16

K

k

kokdWE1

2

21

K

ki

K

ki

K

k

n

iii

i

K

k i

K

k i

K

kii

kxkokd

kxkokd

kxwwkdwEkokd

kokdwEkokd

kokdwE

kokdwE

wE

1

1

1 10

1

1

2

1

2

221

21

21

Page 17: Machine Learning

The Gradient Descent Rule• ADLINE weight updating using gradient descent rule

G53MLE | Machine Learning | Dr Guoping Qiu 17

K

kiii kxkokdww

1

)()()(

Page 18: Machine Learning

The Gradient Descent Rule• Gradient descent training procedure

– Initialise wi to small vales, e.g., in the range of (-1, 1), choose a learning rate, e.g., = 0.2

– Until the termination condition is met, Do

• For all training sample pair (X(k), d(k)), input the instance X(k) and compute

• For each weight wi, Do

G53MLE | Machine Learning | Dr Guoping Qiu 18

iii ww

K

kii kxkokd

1

)()()(Batch Mode:

gradients accumulated over ALL samples first

Then update the weights

Page 19: Machine Learning

Stochastic (Incremental) Gradient Descent

• Also called online mode, Least Mean Square (LMS), Widrow-Hoff, and Delta Rule

– Initialise wi to small vales, e.g., in the range of (-1, 1), choose a learning rate, e.g., = 0.01 (should be smaller than batch mode)

– Until the termination condition is met, Do

• For EACH training sample pair (X(k), d(k)), compute

• For each weight wi, Do

G53MLE | Machine Learning | Dr Guoping Qiu 19

iii ww

)()()( kxkokd ii Online Mode:

Calculate gradient for EACH samples

Then update the weights

Page 20: Machine Learning

Training Iterations, Epochs • Training is an iterative process; training samples will have to be used repeatedly for training

• Assuming we have K training samples [(X(k), d(k)), k=1, 2, …, K]; then an epoch is the presentation of all K sample for training once

– First epoch: Present training samples: (X(1), d(1)), (X(2), d(2)), … (X(K), d(K)) – Second epoch: Present training samples: (X(K), d(K)), (X(K-1), d(K-1)), … (X(1), d(1))

– Note the order of the training sample presentation between epochs can (and should normally) be different.

• Normally, training will take many epochs to complete

G53MLE | Machine Learning | Dr Guoping Qiu 20

Page 21: Machine Learning

Termination of Training • To terminate training, there are normally two ways

– When a pre-set number of training epochs is reached

– When the error is smaller than a pre-set value

G53MLE | Machine Learning | Dr Guoping Qiu 21

K

k

kokdWE1

2

21

Page 22: Machine Learning

Gradient Descent Training• A worked Example

G53MLE | Machine Learning | Dr Guoping Qiu 22

W0

W1

W2

x1

x2

x1 x2 D

-1 -1 -1

-1 +1 +1

+1 -1 +1

+1 +1 +1Initialization

W0(0)=0.1; W1(0)=0.2; W2(0)=0.3;

=0.5

Page 23: Machine Learning

Further Readings• T. M. Mitchell, Machine Learning, McGraw-Hill International Edition, 1997

Chapter 4

• Any other relevant books/papers

G53MLE | Machine Learning | Dr Guoping Qiu 23

Page 24: Machine Learning

Tutorial/Exercise Questions1. Derive a gradient descent training rule for a single unit with output y

2. A network consists of two ADLINE units N1 and N2 is shown as follows. Derive a delta training rule for all the weights

G53MLE | Machine Learning | Dr Guoping Qiu 24

x1

x2 w2

w1 w3

+1 +1w0 w4

yN1 N2

222222

211110 nnnn xwxwxwxwxwxwwy

Page 25: Machine Learning

Tutorial/Exercise Questions3. The connection weights of a two-input ADLINE at time n have following values:

G53MLE | Machine Learning | Dr Guoping Qiu 25

w0 (n) = -0.5 w1 (n) = 0.1 w2 (n) = -0.3.

The training sample at time n is:

x1 (n) = 0.6 x2 (n) = 0.8

The corresponding desired output is d(n) = 1

a) Base on the Least-Mean-Square (LMS) algorithm, derive the learning equations for each weight at time n

b) Assume a learning rate of 0.1, compute the weights at time (n+1):

w0 (n+1), w1 (n+1), and w2 (n+1).