The Multi-layer MLP Perceptron · 09.03.2019 · Adaline X is voltages w is conductance of controllable resistors Madaline (Many Adaline) Adaline connected to AND logic Adaline &

MLP

The Multi-layer Perceptron

D r. S y e d I m t i y a z H a s s a nA s s i s t a n t P r o f e s s o r, D e p a r t m e n t . o f C S E , J a m i a H a m d a r d( D e e m e d t o b e U n i v e r s i t y ) , N e w D e l h i , I n d i a .

h t t p s : / / S y e d i m t i y a z h a s s a n . o r gs . i m t i y a z @ j a m i a h a m d a r d . a c . i nh t t p : / / w w w. j a m i a h a m d a r d . e d u

https://syedimtiyazhassan.org/

MLP

XOR RevisitS O L U T I O N U S I N G M L P

2

MLP

The Sigmoid Threshold Unit

3

Adaline• Adaptive Linear Element

• Proposed by Widrow & Hoff, 1960

4

Adaline

X is voltages w is conductance of controllableresistors

Madaline (Many Adaline) Adaline connected to AND logic

Adaline & Madaline are single layer.

5

Adaline

Also known as LMS or Widrow & Hoff rule

Update formula

6

D E LTA R U L E

MLP Architecture

MLP

The 3-3-2 Network

8

Gradient descent B a s i s f o r t h e B A C K P R O PA G AT I O N A l g o r i t h m

• k = number of outputs

• d = a training example

• td = target output

• od = output of the unit

• D = set of training example

9

• Error = Half of squared difference

• E as a function of w, because the linear unit output o depends on this weight vector.

10


• gradient of E w.r.t. w

• Training Rule

11


• Training Rule (in component form)

12


• gradient

13


14


• A Differentiable Threshold Unit

15

Multi Layer PerceptronF E E D F O R WA R D B A C K P R O PA G AT I O N

• Networks with multiple output units rather than single units

16

Multi Layer PerceptronF E E D F O R WA R D B A C K P R O PA G AT I O N

MLP

Backpropagation Algorithm

17

The stochastic gradient descent version of the Backpropagation Algorithm

for feedforward networks containing two layers of sigmoid units

MLP

Backpropagation Algorithm

18

• Batch algorithm converges to a local minimum faster than the sequential algorithm

Mini-batches

• is used for splitting the training set into random batches

• estimating the gradient based on one of the subsets of the training set

• performing a weight update and then

• using the next subset to estimate a new gradient and using that for the weight update

• until all of the training set have been used

19

Mini-batchesC H A N C E T O E S C A P E F R O M L O C A L M I N I M A

• Extreme version of the mini-batch idea

• to use just one piece of data to estimate the gradient at each iteration of the algorithm, and to pick that piece of data uniformly at random from the training set.

• It is often used if the training set is very large

20

Stochastic Gradient DescentF O R L A R G E T R A I N I N G S E T

• Weight update on the nth iteration depend partially on the update that occurred during the (n - 1)th

iteration

21

Adding Momentum

• An ANN that uses radial basis functions as activation functions.

• The output of the network is a linear combination of RBFs of the inputs and neuron parameters.

• RBF is a real-valued function whose value depends only on the distance from the origin.

22

RBFN

• Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a non-linear RBF activation function and a linear output layer.

23

RBFN

• Euclidian

• Gaussian

• Multiquadric

• ….

24

RBFN

• Adaptive Resonance Theory

• Developed by Stephen Grossberg and Gail Carpenter in 1987.

• The basic ART system is an unsupervised learning model.

• Always open to new learning (adaptive) without losing the old patterns (resonance).

25

ART

• Recognition phase• The input vector is compared with the classification

presented at every node in the output layer.

• The output of the neuron becomes “1” if it best matches with the classification applied, otherwise it becomes “0”.

• Comparison phase• A comparison of the input vector to the comparison layer

vector is done. The condition for reset is that the degree of similarity would be less than vigilance parameter.

26

ART Operating Principal

• Search phase• The network will search for reset as well as the match

done in the above phases.

• If there would be no reset and the match is quite good, then the classification is over.

• Otherwise, the process would be repeated and the other stored pattern must be sent to find the correct match.

27

ART Operating Principal

• ART 1

• ART 2

• ARTMAP (Predictive ART)

• Fuzzy ART

• Fuzzy ARTMAP

• Gaussian ART

• Gaussian ARTMAP

28

ART Types

MLP

Summary Adal ine

Delta Rule

Gradient Descent

Backpropagat ion

RBFN

ART

29

Thank You

Documents

The Multi-layer MLP Perceptron · 09.03.2019 · Adaline X is voltages w is conductance of controllable resistors Madaline (Many Adaline) Adaline connected to AND logic Adaline &