11. 10. 2001.NIMIA Crema, Italy1 Identification and Neural Networks I S R G G. Horváth Department of Measurement and Information Systems

11. 10. 2001. NIMIA Crema, Italy 1

Identification and Neural Networks

I S R G

G. Horváth

Department of Measurement and Information Systems


Modular networksWhy modular approach

Motivations Biological

Learning

Computational

Implementation


Motivations Biological

Biological systems are not homogenous

Functional specialization

Fault tolerance

Cooperation, competition

Scalability

Extendibility


Motivations Complexity of learning (divide and conquer)

Training of complex network (many layers)

layer by layer learning

Speed of learning

Catastrophic interference, incremental learning

Mixing supervised and unsupervised learning

Hierarchical knowledge structure


Motivations Computational

The capacity of a network

The size of the network

Catastrophic interference

Generalization capability vs network complexity


Motivations Implementation (hardware)

The degree of parallelism

Number of connections

The length of physical connections

Fan out


Modular networksWhat modules

The modules are disagree on some inputs every module solves the same, whole problem,

different ways of solutions (different modules)

every module solves different tasks (sub-tasks) task decomposition (input space, output space)


Modular networksHow combine modules

Cooperative modules simple average weighted average (fixed weights)

optimal linear combination (OLC) of networks

Competitive modules majority vote winner takes all

Competitive/cooperative modules weighted average (input-dependent weights)

mixture of experts (MOE)


Modular networks Construct of modular networks

Task decomposition, subtask definition

Training modules for solving subtasks

Integration of the results

(cooperation and/or competition)


Cooperative networksEnsemble of cooperating networks

(classification/regression)

The motivation Heuristic explanation

Different experts together can solve a problem better

Complementary knowledge

Mathematical justification Accurate and diverse modules


Ensemble of networks Mathematical justification

Ensemble output

Ambiguity (diversity)

Individual error

Ensemble error

Constraint

xx j

M

jj y

0

,y

2,y)((x) xxd

2)( xxx jj yd

2,)( xxx yya jj

1

jj


Ensemble of networks Mathematical justification (cont’d)

Weighted error

Weighted diversity Ensemble error

Averaging over the input distribution

Solution: Ensemble of accurate and diverse networks

xx j

M

jjaa

0

,

xx j

M

jj

0

,

2,y)((x) xxd ,),( xx a

x

xxx dfE )(),( x

xxx dfE )(),( x

xxx dfaA )(),(

AEE


Ensemble of networks How to get accurate and diverse networks

different structures: more than one network structure (e.g. MLP, RBF, CCN, etc.)

different size, different complexity networks (number of hidden units, number of layers, nonlinear function, etc.)

different learning strategies (BP, CG, random search,etc.) batch learning, sequential learning

different training algorithms, sample order, learning samples

different training parameters

different starting parameter values

different stopping criteria


Linear combination of networks

NNM

NN1

NN2

α1

α2

αM

Σ

y1

y2

yM

x

NNM

α 0

y0=1

xx j

M

jj y

0

,y


Linear combination of networks

Computation of optimal coefficients simple average

, k depends on the input for different input domains different network (alone gives the output)

optimal values using the constraint

optimal values without any constraint

Wiener-Hopf equation

MkMk ...1,1

kjjk ,0,1

PR 1*1

y

Ty xyxyR E

1

k

xxyP dE


Task decomposition Decomposition related to learning

before learning (subtask definition)

during learning (automatic task decomposition)

Problem space decomposition input space (input space clustering, definition of

different input regions)

output space (desired response)


Task decomposition Decomposition into separate subproblems

K-class classification K two-class problems (coarse decomposition)

Complex two-class problems smaller two-class problems (fine decomposition)

Integration (module combination)


Task decomposition A 3-class problem


Task decomposition 3 classes

2 small classes 2 small classes



2 classes

2 small classes2 small classes



2 small classes2 small classes


Task decomposition

M12

M13

M23

MIN

MIN

MIN

C1

C2

C3

INV=

Input


Task decompositionA two-class problem

decomposed into

subtasks


Task decomposition

AND

OR

AND

M11 M12

M22M21


Task decomposition

M11

M21

MIN

MAX

MIN

C1

Input

M12

M22


Task decomposition Training set decomposition:

Original training set

Training set for each of the (K) two-class problems

Each of the two-class problems are divided into K-1 smaller two-class problems [using an inverter module really (K-1)/2 is enough]

L

lll yΤ 1, x

KiyΤL

li

lli ...1, 1)( x

iC

iCy

il

ilil exceptclasses allif

classif1)(

x

x

jiKjiΤ j

l

i

l

L

l

iL

l

iij

and...1,,1,

1

)(

1

)( xx


Task decomposition

input number 16 x 16

NormalizationEdge detection

horizontal

diagonal \

diagonal /

vertical

Kirsch masks

4 16 x 16 feature maps

4 8 x 8 matrix

input number16 x 16

A practical example: Zip code recognition


Task decomposition Zip code recognition (handwritten character

recognition) modular solution

45 (K*K-1)/2 neurons

10 AND gates (MIN operator)

256+1 inputs


Mixture of Experts (MOE)

Expert 2Expert 1

Gating network

μ1

μ

g1 g2

x

Expert M

gM

Σ


Mixture of Experts (MOE) The output is the weighted sum of the outputs of the

experts

is the parameter of the i-th expert

The output of the gating network: “softmax” function

is the parameter of the gating network

i

M

iig μμ

1

M

j

ij

i

e

eg

1

i iTv x

11

M

iig

1ig i),( ii f xμ

i

Tiv


Mixture of Experts (MOE) Probabilistic interpretation

The probabilistic model with true parameters

a priori probability

i iE [ | , ]y x g P ii i ( | , )x v

i

iii PgP ),|(),(),|( 000 xyvxxy

g P ii i i( , ) ( | , )x v x v0 0


Mixture of Experts (MOE) Training

Training data

Probability of generating output from the input

The log likelihood function (maximum likelihood estimation)

X l l

l

L

x y( ) ( ),

1

P P i Pl l li

l li

i

( | , ) ( | , ) ( | , )( ) ( ) ( ) ( ) ( )y x x v y x

P P P i Pl l

l

Ll

il l

iil

L

( | , ) ( | , ) ( | , ) ( | , )( ) ( ) ( ) ( ) ( )y x y x x v y x

1 1

ii

lli

l

l

PiPL ),|(),|(log),( )()()( xyvxx


Mixture of Experts (MOE) Training (cont’d)

Gradient method

The parameter of the expert network

The parameter of the gating network

and 0v

x

i ),(

0x

i

),(

i i i

l li

l

Li

i

k k h( ) ( ) ( )( ) ( ) 1

1

y

v v xi i il

il

l

Llk k h g( ) ( ) ( ) ( ) ( )

1

1



A priori probability

A posteriori probability

jj

lllj

illl

ili

Pg

Pgh

),(

),(

xy

xy

),|(),( il

il

ili iPgg vxvx



EM (Expectation Maximization) algorithm

A general iterative technique for maximum likelihood estimation Introducing hidden variables Defining a log likelihood function

Two steps: Expectation of the hidden variables Maximization of the log likelihood function


EM (Expectation Maximization) algorithm A simple example: estimating means of k (2) Gaussians

f (y│µ1) f (y│2)

Measurements


EM (Expectation Maximization) algorithm A simple example: estimating means of k (2) Gaussians

hidden variables for every observation,

(x(l), zi1, zi2)

likelihood function

Log likelihood function

expected value of with given

)2()()(2

)(1 if1and0 X lll xzz

)1()()(2

)(1 if0and1 X lll xzz

)(

)((),()( )(

1

)()()( liz

il

k

ii

li

li

l xfzxfxf

)((log),(log )(

1

)()()(i

lk

i

lii

li

l xfzzxf

L

)(liz 21 and

2

1

)(

1)(

)(1

)(

)(

jj

l

ll

xxf

xxfzE

2

1

)(

2)(

)(2

)(

)(

jj

l

ll

xxf

xxfzE


Mixture of Experts (MOE) A simple example: estimating means of k (2) Gaussians

Expected log likelihood function

where

The estimate of the means

)((log)(

)()((log][][ )(

12

1

)(

)()(

1

)(i

lk

i

jj

l

il

ilk

i

li xf

xxf

xxfxfzEE

L

]

2

1exp[

2

1)(

2

2

2

)(

i

il x

xxf

2

2)(

2

)(

2

1

2

1log)(log

pi

il x

xf

L

l

li

li zEx

L 1

)()( ][1̂


Mixture of Experts (MOE) Applications

Simple experts: linear experts

ECG diagnostics

Mixture of Kalman filters

Discussion: comparison to non-modular architecture


Support vector machines A new approach:

Gives answers for questions not solved using the classical approach

The size of the network

The generalization capability


Classical neural learning Support Vector Machine

Support vector machines

Optimal hyperplane

Classification


TMwww ,...,, 10w xxx M ,...,, 10

xwx T

j

M

jjw

0

y

VC dimension


guarantedtiongeneralizaguaranted vvVCfvv ,...)(teach

Structural error minimization


Support vector machines Linearly separable two-class problem

separating hyperpalne

Piii y 1),( x 1,1 21 iiii yy XX xx

0bT xw

2

1

if,1

and if,1

Xb

Xb

iiT

iiT

xxw

xxw

iyb iiT ,1)( xw

Optimal hyperplane


Support vector machines

ww

xw

w

xw

xwxww

xx

xx

2minmin

),,(min),,(min),(

}1;{}1;{

}1;{}1;{

bb

bdbdb

iT

y

iT

y

iy

iy

iiii

iiii

w

xwxw

bbd

T ),,(

x

d(x)

w

bd

x1

x2

Geometric interpretation


Support vector machines Criterion function, Lagrange function

a constrained optimization problem

conditions

dual problem

support vectors optimal hyperplane

2

2

1ww

P

iii

Ti ybbJ

1

2}1]{[

2

1),,( xwww

01

ii

P

ii y

Jxw

w

P

iiii y

1xw 0

1

i

P

ii y

b

J

})(2

1{max)(max

11 1

P

ii

P

i

P

jjijiji yyW

xx

0: ii x

01

P

iii y

P

iiii y

1

0 xw

),,(minmax,

bJb

ww


Support vector machines Linearly nonseparable case

separating hyperplane

criterion function

Lagrange function

support vectors optimal hyperplane

Piby iiT

i ,...,11][ xw

P

iiCw

1

2

2

1),( w

P

iiiii

Tii

P

ii byCbJ

11

2}1][{

2

1),,,,( xwww

Ci 0

0: ii x

P

iiii y

1

0 xw

Optimal hyperplane


Support vector machines Nonlinear separation

separating hyperplane

decision surface

kernel function

criterion function

0bT xw

0),(1 01

P

i

M

jjijiii

P

iii yKy xxxx

jiT

jiK xxxx ),(

00

M

jjjw x

P

i

P

i

P

jjijijii KyyW

1 1 12

1xx


Support vector machines Examples of SVM

Polynomial

RBF

MLP

,...1,)1(, dK di

Ti xxxx

2

22

1exp, iiK xxxx

10tanh, iT

iK xxxx


Support vector machines Example: polynomial

basis functions

kernel function

221122

222121

21

21 2221, iiiiiii xxxxxxxxxxxxK xx

Tiiiiiii xxxxxx ]2,2,,2,,1[ 21

2221

21x


1,2,..Ni 0, bd iiiT

i 1)( xw1,2,..Ni bd iT

i 1)( xw

www T

2

1)(

Minimize:

Constraint:

Separable samples: Not separable samples:

P

ii

T C12

1),( www

Constraint:

Minimize:

Where by minimizing www T

2

1)( we maximize the distance of

the classes, whilst we also control the VC dimension.

SVR (classification)


SVR (regression)

C()

otherwise),(

),( ha),()),(,(

xx

xxfy

fyfyfyC


SVR (regression)

Pi

ε

εd

i

i

iiiT

iiT

i

1,2,...,

,0ξ

,0ξ

,ξd

,ξ

xw

xw

P

i

T CF1

ii ξξ2

1ξξ,, www

Constraints: Minimize:

M

jjjwy

0x


SVR (regression)Lagrange function

dual problem

constraints

support vectors

solution

)()

)2

1)(,,,,,,

11

11

ii

P

iiii

Ti

P

ii

iiiTP

ii

Ti

P

ii

y

yCJ

xw

xwwww

i

P

iii xw

1

ii C ii C

iii :x

ji

P

i

P

jjjii

P

iii

P

iiiiii KyW xx ,

2

1

1 111

01

P

iii ,0 Ci ,0 Ci


SVR (regression)


SVR (regression)


SVR (regression)


SVR (regression)


Support vector machines Main advantages

generalization

size of the network

centre parameters for RBF

linear-in-the-parameter structure

noise immunity


Support vector machines Main disadavantages

computation intensive (quadratic optimization)

hyperparameter selection VC dimension (classification)

batch processing


Support vector machines Variants

LS SVM

basic criterion function

Advantages: easier to compute

adaptivity,


Mixture of SVMs Problem of hyper-parameter selection for SVMs

Different SVMs, with different hyper-parameters

Soft separation of the input space


Mixture of SVMs


Boosting techniques Boosting by filtering

Boosting by subsampling

Boosting by reweighting


Boosting techniques Boosting by filtering


Boosting techniques Boosting by subsampling


Boosting techniques Boosting by reweighting


Other modular architectures


Other modular architectures


Other modular architectures Modular classifiers

Decoupled modules

Hierarchical modules

Network ensemble (linear combination)

Network ensemble (decision, voting)


Modular architectures

Documents

11. 10. 2001.NIMIA Crema, Italy1 Identification and Neural Networks I S R G G. Horváth Department of Measurement and Information Systems