72
1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

Embed Size (px)

Citation preview

Page 1: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

1

Sergios Theodoridis

Konstantinos Koutroumbas

Version 2

Page 2: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

2

PATTERN RECOGNITIONPATTERN RECOGNITION

Typical application areas Machine vision Character recognition (OCR) Computer aided diagnosis Speech recognition Face recognition Biometrics Image Data Base retrieval Data mining Bionformatics

The task: Assign unknown objects – patterns – into the correct class. This is known as classification.

Page 3: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

3

Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values.

Feature vectors: A number of features

constitute the feature vector

Feature vectors are treated as random vectors.

,,...,1 lxx

lTl Rxxx ,...,1

Page 4: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

4

An example:

Page 5: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

5

The classifier consists of a set of functions, whose values, computed at , determine the class to which the corresponding pattern belongs

Classification system overview

x

sensor

featuregeneration

feature selection

classifierdesign

systemevaluation

Patterns

Page 6: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

6

Supervised – unsupervised pattern recognition:

The two major directions Supervised: Patterns whose class is known a-

priori are used for training.Unsupervised: The number of classes is (in

general) unknown and no training patterns are available.

Page 7: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

7

CLASSIFIERS BASED ON BAYES CLASSIFIERS BASED ON BAYES DECISION THEORYDECISION THEORY

Statistical nature of feature vectors

Assign the pattern represented by feature vector to the most probable of the available classes

That is maximum

Tl21 x,...,x,xx

x

M ,...,, 21

)(: xPx ii

Page 8: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

8

Computation of a-posteriori probabilities Assume known

• a-priori probabilities

• This is also known as the likelihood of

)()...,(),( 21 MPPP

Mixp i ...2,1,)(

... i to rw x

Page 9: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

9

2

1

)()()(

)(

)()()(

)()()()(

iii

iii

iii

Pxpxp

xp

PxpxP

PxpxPxp

The Bayes rule (Μ=2)

where

Page 10: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

10

The Bayes classification rule (for two classes M=2) Given classify it according to the rule

Equivalently: classify according to the rule

For equiprobable classes the test becomes

x

212

121

)()(

)()(

xxPxP

xxPxP

If

If

)()()()()( 2211 PxpPxp

)()()( 21 xPxp

x

Page 11: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

11

)()( 2211 RR and

Page 12: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

12

Equivalently in words: Divide space in two regions

Probability of error Total shaded area

Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!!

22

11

in If

in If

xRx

xRx

0

0

x

1

x

2e dx)x(pdx)x(pP

Page 13: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

13

Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.

Page 14: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

14

The Bayes classification rule for many (M>2) classes: Given classify it to if:

Such a choice also minimizes the classification error probability

Minimizing the average risk For each wrong decision, a penalty term is assigned

since some decisions are more sensitive than others

ijxPxP ji )()(

x i

Page 15: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

15

For M=2• Define the loss matrix

• penalty term for deciding class ,although the pattern belongs to , etc.

Risk with respect to

)(2221

1211

L

12

1

1

xdxpxdxprRR

)()( 1121111

21

2

Page 16: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

16

Risk with respect to

Average risk

2

xdxpxdxprRR

)()( 2222212

21

)()( 2211 PrPrr

Probabilities of wrong decisions, weighted by the penalty terms

Page 17: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

17

Choose and so that r is minimized

Then assign to if

Equivalently:

assign x in if

: likelihood ratio

1R 2R

x i

)( 21

1112

2221

1

2

2

112 )(

)()(

)(

)(

P

P

xp

xp

12

)()()()(

)()()()(

222211122

222111111

PxpPxp

PxpPxp

Page 18: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

18

yprobabiliterror

tion classifica Minimum if

)()( if

)()( if

0 and 2

1)()(

1221

21

1212 2

12

2121 1

221121

xPxPx

xPxPx

PP If

Page 19: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

19

An example:

00.1

5.00

2

1)()(

))1(exp(1

)(

)exp(1

)(

21

22

21

L

PP

xxp

xxp

Page 20: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

20

Then the threshold value is:

Threshold for minimum r

2

1

))1(exp()exp( :

: minimumfor

0

220

0

x

xxx

Px e

2

1

2

)21(ˆ

))1((exp2)(exp :ˆ

0

220

n

x

xxx

0x̂

Page 21: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

21

Thus moves to the left of (WHY?)

0x̂ 02

1x

Page 22: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

22

DISCRIMINANT FUNCTIONS DISCRIMINANT FUNCTIONS DECISION SURFACESDECISION SURFACES

If are contiguous:

is the surface separating the regions. On one side is positive (+), on the other is negative (-). It is known as Decision Surface

)()( :

)()( :

xPxPR

xPxPR

ijj

jii

ji RR , 0)()()( xPxPxg ji

+ - 0xg )(

Page 23: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

23

If f(.) monotonic, the rule remains the same if we use:

is a discriminant function

In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet if chosen appropriately, can be computationally more tractable.

jixPfxPfx jii ))(())(( :if

))(()( xPfxg ii

Page 24: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

24

BAYESIAN CLASSIFIER FOR NORMAL BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONSDISTRIBUTIONS

Multivariate Gaussian pdf

called covariance matrix

))((

in matrix

)()(2

1exp

)2(

1)( 1

2

12

iii

ii

iii

i

i

xxE

xE

xxxp

Page 25: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

25

is monotonic. Define:

Example:

)ln(

)(ln)(ln

))()(ln()(

ii

iii

Pxp

Pxpxg

ii

iiiiT

ii

C

CPxxxg

ln)2

1(2ln)

2(

)(ln)()(2

1)( 1

2

2

i0

0

Page 26: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

26

That is, is quadratic and the surfaces

quadrics, ellipsoids, parabolas, hyperbolas, pairs of lines.

For example:

iiii

iii

CP

xxxxxg

)ln()(2

1

)(1

)(2

1)(

22

212

2211222

212

)(xgi

0)()( xgxg ji

Page 27: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

27

Decision Hyperplanes

Quadratic terms:

If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write:

Discriminant functions are LINEAR

xx 1i

T

ΣΣi

ii

Τii

ii

ioTii

ΣPw

Σw

wxwxg

10

1

2

1)(ln

)(

Page 28: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

28

Let in addition:•

• 22

02

2

)(

)(ln)(

2

1

,

)(

0)()()(

1)(

Then .

ji

ji

j

i

jio

ji

oT

jiij

iT

ii

P

Px

w

xxw

xgxgxg

wxxg

Page 29: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

29

Nondiagonal:

Decision hyperplane

2

0)()( 0 xxwxg Tij

)(1

jiw

2

11

20

)(

))(

)((n)(

2

1

1

1

xxx

P

Px

T

ji

ji

j

i

ji

)( tonormal

tonormalnot

1

ji

ji

Page 30: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

30

Minimum Distance Classifiers

equiprobable

Euclidean Distance: smaller

Mahalanobis Distance: smaller

MP i

1)(

)()(2

1)( 1

i

T

ii xxxg

:Assign :2ixI

iE xd

:Assign :2ixI

2

11 ))()((

i

T

im xxd

Page 31: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

31

Page 32: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

32

:tionclassificaBayesian using2.2

0.1 vectorheclassify t

9.13.0

3.01.1 ,

3

3 ,

0

0), ,()(

), ,()( and)()( : ,Given

2122

112121

x

ΣNxp

ΣNxpPP

55.015.0

15.095.0 1-Σ

672.38.0

0.28.0,0.2,952.2

2.2

0.1

2.2,0.1: , from sMahalanobi Compute

12,

21

1,2

21

m

mm

dd

1,,21 that Observe .Classify EE ddx

Example:

Page 33: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

33

Maximum Likelihood

:method The

to w.r. of Likelihood theasknown iswhich

);(

);,...,();(

,...,

);()( : parameter

ctorunknown vean in known with )(Let

tindependen andknown ,...., ,Let

1

21

21

21

X

xp

xxxpXp

xxxX

xpxp

xp

xxx

k

N

k

N

N

N

ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS Support S

lide

Page 34: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

34

0)(

)(

)(

1

)(

)( :ˆ

);(ln);(ln)(

);(maxarg :ˆ

;

; 1

1

1ML

k

k

N

kML

k

N

k

k

Ν

k

xp

xp

L

xpXpL

xp

Support Slide

Page 35: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

35

Support Slide

Page 36: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

36

Asymptotically unbiased and consistent

0ˆlim

][lim

then,);()(

such that a is thereindeed, If,

2

0N

0N

0

0

ML

ML

E

E

xpxp

Support Slide

Page 37: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

37

Example:

AA

AA

xN

L

L

L

xΣx

Σ

xp

xΣxCxpL

xpxpxxxΣNxp

TT

k

N

kMLk

N

k

l

kT

klk

kT

k

N

kk

N

k

kkN

2)(

if :Remember

10)(

.

.

.

)(

)(

))()(2

1exp(

)2(

1);(

)()(2

1);(ln)(

);()(,...,,unknown, :),( :)(

1

1

1

1

1

2

12

1

11

21

Support Slide

Page 38: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

38

Maximum Aposteriori Probability Estimation In ML method, θ was considered as a

parameterHere we shall look at θ as a random vector

described by a pdf p(θ), assumed to be known

Given

Compute the maximum of

From Bayes theorem

N21 x,...,x,xX

)( Xp

)(

)()()(

or )()()()(

Xp

XppXp

XpXpXpp

Support Slide

Page 39: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

39

The method:

MLMAP

MAP

MAP

p

XpP

Xp

ˆenough broador uniform is )( If

))()((:ˆ

or )(maxargˆ

Support Slide

Page 40: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

40

Support Slide

Page 41: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

41

Example:

k

N

kML

k

N

k

MAP

k

N

kk

N

kMAP

ll

N

xN

For

N

x

xpxp

p

x,...,xXΣNxp

1MAP

2

2

2

2

12

2

0

02211

2

2

0

2

1

1ˆˆ

Nfor or ,1

1

ˆ

0)ˆ(1

)(1

or 0))()(ln( :

)2

exp(

)2(

1)(

unknown, ),,( :)(

Support Slide

Page 42: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

42

Bayesian Inference

How??

)p( estimate :goal The

)( and )(},,...,{ :Given

followed. isroot different a Here

.for estimate single aMAPML,

1

Xx

pxpxxX N

Support Slide

Page 43: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

43

N

kkNN

NN

xN

xNN

xN

NXp

Np

NxpLet

122

0

20

22

220

022

0

2

200

2

1 , ,

),()( :out that It turns

),()(

),()(

examplean ainsight vi morebit A

)()(

)()(

)()(

)(

)()()(

)()(

1

k

N

kxpXp

dpXp

pXp

Xp

pXpXp

dXpxpXxp

)(Support S

lide

Page 44: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

44

The above is a sequence of Gaussians as

Maximum EntropyEntropy

xdxpxpH )(ln)(

sconstraint available

thesubject to maximum :)(ˆ Hxp

NSupport S

lide

Page 45: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

45

Example: x is nonzero in the intervaland zero otherwise. Compute the ME pdf

• The constraint:

• Lagrange Multipliers

21 xxx

2

1

1)(x

x

dxxp

2

1

)1)((x

x

L dxxpHH

otherwise

0

1)(ˆ

)1exp()(ˆ

21

12

xxx

xxxp

xp

Support Slide

Page 46: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

46

Mixture Models

Assume parametric modeling, i.e.,

The goal is to estimategiven a set

Why not ML? As before?

M

jj

J

jj

xdjxpP

Pjxpxp

1 x

1

1)( ,1

)()(

);jx(p

jPPP ,...,, and 21 N21 x,...,x,xX

),...,,;(max1,...,, jik

N

kPPPPxP

ji

Support Slide

Page 47: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

47

This is a nonlinear problem due to the missing label information. This is a typical problem with an incomplete data set.

The Expectation-Maximisation (EM) algorithm.

• General formulation–

which are not observed directly.

We observe

a many to one transformation

, ;ypRYyy ym )( with ,set data complete the

),;( with ,)( xPmlRXygx xl

ob

Support Slide

Page 48: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

48

• Let

• What we need is to compute

• But are not observed. Here comes the EM. Maximize the expectation of the loglikelihood conditioned on the observed samples and the current iteration estimate of

0));(ln(

k

ky

ML

yp

syk '

.

)(

);();(

specific a to' all )(

xY

yx ydypxp

xsyYxY

Support Slide

Page 49: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

49

The algorithm:• E-step:

• M-step:

Application to the mixture modeling problem• Complete data

• Observed data

• Assuming mutual independence

k

ky tXypEtQ ))](;;(ln([))(;(

0))(;(

:)1(

tQ

t

Nkjx kk ,...,2,1 ),,(

Nkxk ,...,2,1 ,

jkkkkk Pjxpjxp );();,(

));(ln()(1

N

kjkkk PjxpL

Support Slide

Page 50: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

50

• Unknown parameters

• E-step

• M-step

Tj

TTTT PPPPP ],...,,[ ,],[ 21

JjP

QQk

jk

,...,2,1,00

J

jjkk

k

jk

k PtjxptxptxP

PtjxptxjP

1

))(;())(;())(;(

))(;())(;(

))(ln ))(;(

] [)]);(ln([))(;(

;11

11

jkkkkk

J

j

N

k

N

k

N

kjkkk

Pjxp(txjP

EPjxpEtQ

k

Support Slide

Page 51: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

51

Nonparametric Estimation

2

2

hx̂x̂

hx̂

0,,0

if , as )()(ˆ , continuous )( If

N

kkh

Nxpxpxp

NNN

2ˆ- ,

1)ˆ(ˆ)(ˆ

total

in

hxx

N

k

hxpxp

N

hk

N

kP

N

NN

Page 52: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

52

Parzen Windows Divide the multidimensional space in

hypercubes

Page 53: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

53

Define

• That is, it is 1 inside a unit side hypercube centered at 0

• The problem:

• Parzen windows-kernels-potential functions

))(1

(1

)(ˆ1

N

i

il h

xx

Nhxp

xN

at centered hypercube side-han

inside points ofnumber *1

*volume

1

ousdiscontinu (.)

continuous )(

xp

x

xdxx

x

1)( ,0)(

smooth is )(

otherwise2

1

0

1)( ij

ixx

Page 54: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

54

Mean value

Hence unbiased in the limit

'1

')'()'

(1

)])([1

(1

)](ˆ[x

l

N

i

il

xdxph

xx

hh

xxE

NhxpE

lh

1 ,0h

0)'

( of width the0

h

xxh

1)'

(1

xdh

xx

hl

)()(1

0 xh

x

hh

l

'

)(')'()'()](ˆ[x

xpxdxpxxxpE

Support Slide

Page 55: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

55

Variance• The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000

Page 56: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

56

h=0.1, N=10000

The higher the N the better the accuracy

Page 57: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

57

If• • •

asymptotically unbiased

The method• Remember:

Νh

N

h 0

1112

2221

1

2

2

112 )(

)()(

)(

)(

P

P

xp

xpl

)(

)(1

)(1

2

1

12

11

N

i

il

N

i

il

hxx

hN

hxx

hN

Page 58: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

58

CURSE OF DIMENSIONALITY

In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate.

If in the one-dimensional space an interval, filled with N points, is adequately (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the ℓ-dimensional space the ℓ-dimensional cube will require Nℓ points.

The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.

Page 59: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

59

NAIVE – BAYES CLASSIFIER

Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points.

Assume x1, x2 ,…, xℓ mutually independent. Then:

In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice.

It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.

x ixp |

1

||j

iji xpxp

Page 60: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

60

K Nearest Neighbor Density Estimation

In Parzen:• The volume is constant• The number of points in the volume is

varying

Now:• Keep the number of points

constant

• Leave the volume to be varying

kkN

)()(ˆ

xNV

kxp

Page 61: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

61

)( 11

22

22

11 VN

VN

VNkVN

k

Page 62: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

62

The Nearest Neighbor Rule Choose k out of the N training vectors, identify

the k nearest ones to x

Out of these k identify ki that belong to class ωi

The simplest version

k=1 !!!

For large N this is not bad. It can be shown that:

if PB is the optimal Bayesian error probability, then:

jikk:x jii Assign

2)1

2( BBBNNB PPM

MPPP

Page 63: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

63

For small PB:

k

P2PPP NN

BkNNB

BkNN PP,k

23 )(3

2

BBNN

BNN

PPP

PP

Page 64: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

64

Voronoi tesselation

jixxdxxdxR jii ),(),(:

Page 65: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

65

Bayes Probability Chain Rule

Assume now that the conditional dependence for each xi is limited to a subset of the features appearing in each of the product terms. That is:

where

BAYESIAN NETWORKS

)()|(...

...),...,|(),...,|(),...,,(

112

1211121

xpxxp

xxxpxxxpxxxp

2

121 )|()(),...,,(i

ii Axpxpxxxp

121 ,..., , xxxA iii

Support Slide

Page 66: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

66

For example, if ℓ=6, then we could assume:

Then:

The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is:

Ai = Ø, for i=1, 2, …, ℓ

),|(),...,|( 456156 xxxpxxxp

15456 ,...,, xxxxA

Support Slide

Page 67: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

67

A graphical way to portray conditional dependencies is given below

According to this figure we have that:

• x6 is conditionally dependent on x4, x5.

• x5 on x4 • x4 on x1, x2

• x3 on x2

• x1, x2 are conditionally independent on other variables.

For this case:)()()|()|(),|(),...,,( 122345456621 xpxpxxpxxpxxxpxxxp

Support Slide

Page 68: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

68

Bayesian Networks

Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(xi|Ai), where xi is the variable associated with the node and Ai is the set of its parents in the graph.

A Bayesian Network is specified by:• The marginal probabilities of its root nodes.• The conditional probabilities of the non-root

nodes, given their parents, for ALL possible combinations.

Support Slide

Page 69: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

69

The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field.

This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.

Support Slide

Page 70: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

70

Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities.

Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology.

Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence.

Support Slide

Page 71: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

71

Example: Consider the Bayesian network of the figure:

a) If x is measured to be x=1 (x1), compute P(w=0|x=1) [P(w0|x1)].

b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)].

Support Slide

Page 72: 1 Sergios Theodoridis Konstantinos Koutroumbas Version 2

72

For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63.

For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4.

In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG.

Complexity:For singly connected graphs, message passing

algorithms amount to a complexity linear in the number of nodes.

Support Slide