The back-propagation algorithm · The back propagation algorithm Notation I x‘ j: Input to node j...

Preview:

Citation preview

The back-propagation algorithm

January 8, 2012

Ryan

The neuron

I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.

I

σ(x) =1

1 + e−x(1)

x

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

1

Figure: The Sigmoid Function

I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...

The neuron

I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.

I

σ(x) =1

1 + e−x(1)

x

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

1

Figure: The Sigmoid Function

I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...

The neuron

I The sigmoid equation is what is typically used as a transferfunction between neurons. It is similar to the step function,but is continuous and differentiable.

I

σ(x) =1

1 + e−x(1)

x

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

1

Figure: The Sigmoid Function

I One useful property of this transfer function is the simplicityof computing it’s derivative. Let’s do that now...

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)

=e−x

(1 + e−x)2

=

(

1 + e−x

)

− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=

(

1 + e−x

)

− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=

(

1 + e−x

)

− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−

(

1

(1 + e−x) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

The derivative of the sigmoid transfer function

d

dxσ(x) =

d

dx

(1

1 + e−x

)=

e−x

(1 + e−x)2

=(1 + e−x)− 1

(1 + e−x)2

=1 + e−x

(1 + e−x)2−(

1

(

1 + e−x

) 2

)2

= σ(x)− σ(x)2

σ′ = σ(1− σ)

Single input neuron

σξω

θ

O

Figure: A Single-Input Neuron

In the above figure (2) you can see a diagram representing a singleneuron with only a single input. The equation defining the figure is:

O = σ(ξω)

Single input neuron

σξω

θ

O

Figure: A Single-Input Neuron

In the above figure (2) you can see a diagram representing a singleneuron with only a single input. The equation defining the figure is:

O = σ(ξω + θ)

Multiple input neuron

ξ1ω1

ξ2ω2

ξ3ω3

θ

O∑

σ

Figure: A Multiple Input Neuron

Figure 3 is the diagram representing the following equation:

O = σ(ω1ξ1 + ω2ξ2 + ω3ξ3 + θ)

A neural network

I J K

Figure: A layer

A neural network

A neural network

I J K

Figure: A neural network

A neural network

I J K

Figure: A neural network

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W `ij : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W `ij : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W `ij : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W `ij : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W `ij : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer

The back propagation algorithm

Notation

I x`j : Input to node j of layer `

I W `ij : Weight from layer `− 1 node i to layer ` node j

I σ(x) = 11+e−x : Sigmoid Transfer Function

I θ`j : Bias of node j of layer `

I O`j : Output of node j in layer `

I tj : Target value of node j of the output layer

The error calculation

Given a set of training data points tk and output layer output Ok

we can write the error as

E =1

2

∑k∈K

(Ok − tk)2

We let the error of the network for a single training iteration bedenoted by E . We want to calculate ∂E

∂W `jk

, the rate of change of

the error with respect to the given connective weight, so we canminimize it.Now we consider two cases: The node is an output node, or it is ina hidden layer...

Output layer node

∂E

∂Wjk=

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk=

∂Wjk

1

2

∑k∈K

(Ok − tk)2

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk= (Ok − tk)

∂WjkOk

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk= (Ok − tk)

∂Wjkσ(xk)

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk= (Ok − tk)σ(xk)(1− σ(xk))

∂Wjkxk

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk= (Ok − tk)Ok(1−Ok)Oj

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Output layer node

∂E

∂Wjk= (Ok − tk)Ok(1−Ok)Oj

For notation purposes I will define δk to be the expression(Ok − tk)Ok(1−Ok), so we can rewrite the equation above as

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

Hidden layer node

∂E

∂Wij=

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=

∂Wij

1

2

∑k∈K

(Ok − tk)2

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)∂

∂WijOk

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)∂

∂Wijσ(xk)

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)σ(xk)(1− σ(xk))∂xk∂Wij

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)Ok(1−Ok)∂xk∂Oj

·∂Oj

∂Wij

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=∑k∈K

(Ok − tk)Ok(1−Ok)Wjk∂Oj

∂Wij

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij=

∂Oj

∂Wij

∑k∈K

(Ok − tk)Ok(1−Ok)Wjk

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij= Oj(1−Oj)

∂xj∂Wij

∑k∈K

(Ok − tk)Ok(1−Ok)Wjk

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij= Oj(1−Oj)Oi

∑k∈K

(Ok − tk)Ok(1−Ok)Wjk

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij= Oj(1−Oj)Oi

∑k∈K

(Ok − tk)Ok(1−Ok)Wjk

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

Hidden layer node

∂E

∂Wij= Oj(1−Oj)Oi

∑k∈K

(Ok − tk)Ok(1−Ok)Wjk

But, recalling our definition of δk we can write this as

∂E

∂Wij= OiOj(1−Oj)

∑k∈K

δkWjk

Similar to before we will now define all terms besides the Oi to beδj , so we have

∂E

∂Wij= Oiδj

How weights affect errors

For an output layer node k ∈ K

∂E

∂Wjk= Ojδk

whereδk = Ok(1−Ok)(Ok − tk)

For a hidden layer node j ∈ J

∂E

∂Wij= Oiδj

whereδj = Oj(1−Oj)

∑k∈K

δkWjk

What about the bias?

If we incorporate the bias term θ into the equation you will findthat

∂O∂θ

= O(1−O)∂θ

∂θ

and because ∂θ/∂θ = 1 we view the bias term as output from anode which is always one.

This holds for any layer ` we are concerned with, a substitutioninto the previous equations gives us that

∂E

∂θ= δ`

(because the O` is replacing the output from the “previous layer”)

What about the bias?

If we incorporate the bias term θ into the equation you will findthat

∂O∂θ

= O(1−O)∂θ

∂θ

and because ∂θ/∂θ = 1 we view the bias term as output from anode which is always one.This holds for any layer ` we are concerned with, a substitutioninto the previous equations gives us that

∂E

∂θ= δ`

(because the O` is replacing the output from the “previous layer”)

The back propagation algorithm

1. Run the network forward with your input data to get thenetwork output

2. For each output node compute

δk = Ok(1−Ok)(Ok − tk)

3. For each hidden node calulate

δj = Oj(1−Oj)∑k∈K

δkWjk

4. Update the weights and biases as followsGiven

∆W = −ηδ`O`−1

∆θ = −ηδ`apply

W + ∆W →W

θ + ∆θ → θ

Recommended