85
©2005-2007 Carlos Guestrin Bayesian Networks –(Structure) Learning Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University April 2 nd , 2007

Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Bayesian Networks–(Structure) LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

April 2nd, 2007

Page 2: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Review

Bayesian Networks Compact representation for

probability distributions Exponential reduction in number

of parameters

Fast probabilistic inferenceusing variable elimination Compute P(X|e) Time exponential in tree-width,

not number of variables

Today Learn BN structure

Flu Allergy

Sinus

Headache Nose

Page 3: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Learning Bayes nets

Missing data

Fully observabledata

Unknown structureKnown structure

x(1)

… x(m)

Data

structure parameters

CPTs –P(Xi| PaXi)

Page 4: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Learning the CPTs

x(1)

… x(m)

DataFor each discrete variable Xi

Page 5: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Information-theoretic interpretationof maximum likelihood

Given structure, log likelihood of data:

Flu Allergy

Sinus

Headache Nose

Page 6: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Information-theoretic interpretationof maximum likelihood

Given structure, log likelihood of data:

Flu Allergy

Sinus

Headache Nose

Page 7: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Information-theoretic interpretationof maximum likelihood 2

Given structure, log likelihood of data:

Flu Allergy

Sinus

Headache Nose

Page 8: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Decomposable score

Log data likelihood

Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = ∑i FamScore(Xi|PaXi : D)

Page 9: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

How many trees are there?Nonetheless – Efficient optimal algorithm finds best tree

Page 10: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Scoring a tree 1: equivalent trees

Page 11: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Scoring a tree 2: similar trees

Page 12: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Chow-Liu tree learning algorithm 1

For each pair of variables Xi,Xj Compute empirical distribution:

Compute mutual information:

Define a graph Nodes X1,…,Xn

Edge (i,j) gets weight

Page 13: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Chow-Liu tree learning algorithm 2

Optimal tree BN Compute maximum weight

spanning tree Directions in BN: pick any

node as root, breadth-first-search defines directions

Page 14: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Can we extend Chow-Liu 1

Tree augmented naïve Bayes (TAN)[Friedman et al. ’97] Naïve Bayes model overcounts, because

correlation between features notconsidered

Same as Chow-Liu, but score edges with:

Page 15: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Can we extend Chow-Liu 2

(Approximately learning) modelswith tree-width up to k [Narasimhan & Bilmes ’04] But, O(nk+1)…

and more subtleties

Page 16: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

What you need to know aboutlearning BN structures so far

Decomposable scores Maximum likelihood Information theoretic interpretation

Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(Nk+1))

Page 17: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Scoring general graphical models –Model selection problem

Data

<x_1^{(1)},…,x_n^{(1)}>…

<x_1^{(m)},…,x_n^{(m)}>

Flu Allergy

Sinus

Headache Nose

What’s the best structure?

The more edges, the fewer independence assumptions,the higher the likelihood of the data, but will overfit…

Page 18: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Maximum likelihood overfits!

Information never hurts:

Adding a parent always increases score!!!

Page 19: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Bayesian score avoids overfitting

Given a structure, distribution over parameters

Difficult integral: use Bayes information criterion(BIC) approximation (equivalent as M! 1)

Note: regularize with MDL score Best BN under BIC still NP-hard

Page 20: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

How many graphs are there?

Page 21: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Structure learning for general graphs

In a tree, a node only has one parent

Theorem: The problem of learning a BN structure with at most d

parents is NP-hard for any (fixed) d¸2

Most structure learning approaches use heuristics Exploit score decomposition (Quickly) Describe two heuristics that exploit decomposition

in different ways

Page 22: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Learn BN structure using localsearch

Starting from Chow-Liu tree

Local search,possible moves:• Add edge• Delete edge• Invert edge

Score using BIC

Page 23: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

What you need to know aboutlearning BNs

Learning BNs Maximum likelihood or MAP learns parameters Decomposable score Best tree (Chow-Liu) Best TAN Other BNs, usually local search with BIC score

Page 24: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Unsupervised learning orClustering –K-meansGaussian mixture modelsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

April 2nd, 2007

Page 25: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Some Data

Page 26: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

K-means

1. Ask user how manyclusters they’d like.(e.g. k=5)

Page 27: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

K-means

1. Ask user how manyclusters they’d like.(e.g. k=5)

2. Randomly guess kcluster Centerlocations

Page 28: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

K-means

1. Ask user how manyclusters they’d like.(e.g. k=5)

2. Randomly guess kcluster Centerlocations

3. Each datapoint findsout which Center it’sclosest to. (Thuseach Center “owns”a set of datapoints)

Page 29: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

K-means

1. Ask user how manyclusters they’d like.(e.g. k=5)

2. Randomly guess kcluster Centerlocations

3. Each datapoint findsout which Center it’sclosest to.

4. Each Center findsthe centroid of thepoints it owns

Page 30: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

K-means

1. Ask user how manyclusters they’d like.(e.g. k=5)

2. Randomly guess kcluster Centerlocations

3. Each datapoint findsout which Center it’sclosest to.

4. Each Center findsthe centroid of thepoints it owns…

5. …and jumps there

6. …Repeat untilterminated!

Page 31: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Unsupervised Learning

You walk into a bar.A stranger approaches and tells you:

“I’ve got data from k classes. Each class produces observations with a normal distribution and variance σ2¢I . Standard simple multivariate gaussian assumptions. I can tell you all the P(wi)’s .”

So far, looks straightforward.“I need a maximum likelihood estimate of the µi’s .“

No problem:“There’s just one thing. None of the data are labeled. Ihave datapoints, but I don’t know what class they’re

from (any of them!) Uh oh!!

Page 32: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Gaussian Bayes ClassifierReminder

)(

)()|()|(

x

xx

p

iyPiypiyP

====

( ) ( )

)(

2

1exp

||||)2(

1

)|(2/12/

x

ìxÓìxÓ

xp

p

iyP

iiki

T

ik

i

m !"

#$%

&'''

==(

How do we deal with that?

Page 33: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Predicting wealth from age

Page 34: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Predicting wealth from age

Page 35: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Learning modelyear ,mpg ---> maker

!!!!!

"

#

$$$$$

%

&

=

mmm

m

m

2

21

22

2

12

1121

2

'''

'''

'''

L

MOMM

L

L

Ó

Page 36: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

General: O(m2)parameters

!!!!!

"

#

$$$$$

%

&

=

mmm

m

m

2

21

22

2

12

1121

2

'''

'''

'''

L

MOMM

L

L

Ó

Page 37: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Aligned: O(m)parameters

!!!!!!!!

"

#

$$$$$$$$

%

&

=

'

m

m

2

12

32

22

12

0000

0000

0000

0000

0000

(

(

(

(

(

L

L

MMOMMM

L

L

L

Ó

Page 38: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Aligned: O(m)parameters

!!!!!!!!

"

#

$$$$$$$$

%

&

=

'

m

m

2

12

32

22

12

0000

0000

0000

0000

0000

(

(

(

(

(

L

L

MMOMMM

L

L

L

Ó

Page 39: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Spherical: O(1)cov parameters

!!!!!!!!

"

#

$$$$$$$$

%

&

=

2

2

2

2

2

0000

0000

0000

0000

0000

'

'

'

'

'

L

L

MMOMMM

L

L

L

Ó

Page 40: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Spherical: O(1)cov parameters

!!!!!!!!

"

#

$$$$$$$$

%

&

=

2

2

2

2

2

0000

0000

0000

0000

0000

'

'

'

'

'

L

L

MMOMMM

L

L

L

Ó

Page 41: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Next… back to Density Estimation

What if we want to do density estimation withmultimodal or clumpy data?

Page 42: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

The GMM assumption

• There are k components. Thei’th component is called ωi

• Component ωi has anassociated mean vector µi

µ1

µ2

µ3

Page 43: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

The GMM assumption

• There are k components. Thei’th component is called ωi

• Component ωi has anassociated mean vector µi

• Each component generates datafrom a Gaussian with mean µiand covariance matrix σ2I

Assume that each datapoint isgenerated according to thefollowing recipe:

µ1

µ2

µ3

Page 44: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

The GMM assumption

• There are k components. Thei’th component is called ωi

• Component ωi has anassociated mean vector µi

• Each component generates datafrom a Gaussian with mean µiand covariance matrix σ2I

Assume that each datapoint isgenerated according to thefollowing recipe:

1. Pick a component at random.Choose component i withprobability P(yi).

µ2

Page 45: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

The GMM assumption

• There are k components. Thei’th component is called ωi

• Component ωi has anassociated mean vector µi

• Each component generates datafrom a Gaussian with mean µiand covariance matrix σ2I

Assume that each datapoint isgenerated according to thefollowing recipe:

1. Pick a component at random.Choose component i withprobability P(yi).

2. Datapoint ~ N(µi, σ2I )

µ2

x

Page 46: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

The General GMM assumption

µ1

µ2

µ3

• There are k components. Thei’th component is called ωi

• Component ωi has anassociated mean vector µi

• Each component generates datafrom a Gaussian with mean µiand covariance matrix Σi

Assume that each datapoint isgenerated according to thefollowing recipe:

1. Pick a component at random.Choose component i withprobability P(yi).

2. Datapoint ~ N(µi, Σi )

Page 47: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Unsupervised Learning:not as hard as it looks

and sometimes in between

Sometimes impossible

Sometimes easyIN CASE YOU’REWONDERING WHATTHESE DIAGRAMS ARE,THEY SHOW 2-dUNLABELED DATA (XVECTORS)DISTRIBUTED IN 2-dSPACE. THE TOP ONEHAS THREE VERYCLEAR GAUSSIANCENTERS

Page 48: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Computing likelihoods insupervised learning caseWe have y1,x1 , y2,x2 , … yN,xN

Learn P(y1) P(y2) .. P(yk)Learn σ, µ1,…, µk

By MLE: P(y1,x1 , y2,x2 , … yN,xN |µi, … µk , σ)

Page 49: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Computing likelihoods inunsupervised caseWe have x1 , x2 , … xN

We know P(y1) P(y2) .. P(yk)We know σ

P(x|yi, µi, … µk) = Prob that an observation from class yiwould have value x given classmeans µ1… µx

Can we write an expression for that?

Page 50: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

likelihoods in unsupervised case

We have x1 x2 … xnWe have P(y1) .. P(yk). We have σ.We can define, for any x , P(x|yi , µ1, µ2 .. µk)

Can we define P(x | µ1, µ2 .. µk) ?

Can we define P(x1, x1, .. xn | µ1, µ2 .. µk) ?

[YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY]

Page 51: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Unsupervised Learning:Mediumly Good NewsWe now have a procedure s.t. if you give me a guess at µ1, µ2 .. μk,

I can tell you the prob of the unlabeled data given those μ‘s.

Suppose x‘s are 1-dimensional.

There are two classes; w1 and w2

P(y1) = 1/3 P(y2) = 2/3 σ = 1 .

There are 25 unlabeled datapoints

x1 = 0.608x2 = -1.590x3 = 0.235x4 = 3.949 :x25 = -0.712

(From Duda and Hart)

Page 52: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Duda & Hart’s ExampleWe can graph the

prob. dist. functionof data given ourµ1 and µ2estimates.

We can also graph thetrue function fromwhich the data wasrandomly generated.

• They are close. Good.

• The 2nd solution tries to put the “2/3” hump where the “1/3” hump shouldgo, and vice versa.

• In this example unsupervised is almost as good as supervised. If the x1 ..x25 are given the class which was used to learn them, then the results are(μ1=-2.176, μ2=1.684). Unsupervised got (μ1=-2.13, μ2=1.668).

Page 53: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Graph oflog P(x1, x2 .. x25 | µ1, µ2 )

against µ1 (→) and µ2 (↑)

Max likelihood = (μ1 =-2.13, µ2 =1.668)

Local minimum, but very close to global at (μ1 =2.085, µ2 =-1.257)*

* corresponds to switching y1 with y2.

Duda & Hart’s Example

µ1

µ2

Page 54: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Finding the max likelihood µ1,µ2..µk

We can compute P( data | µ1,µ2..µk)How do we find the µi‘s which give max. likelihood?

The normal max likelihood trick:Set ∂ log Prob (….) = 0

∂ µi

and solve for µi‘s.# Here you get non-linear non-analytically- solvable

equations Use gradient descent

Slow but doable Use a much faster, cuter, and recently very popular method…

Page 55: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

ExpectationMaximalization

Page 56: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

The E.M. Algorithm

We’ll get back to unsupervised learning soon. But now we’ll look at an even simpler case with hidden

information. The EM algorithm

Can do trivial things, such as the contents of the next few slides. An excellent way of doing our unsupervised learning problem, as

we’ll see. Many, many other uses, including inference of Hidden Markov

Models (future lecture).

DETOUR

Page 57: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Silly ExampleLet events be “grades in a class”

w1 = Gets an A P(A) = ½w2 = Gets a B P(B) = µw3 = Gets a C P(C) = 2µw4 = Gets a D P(D) = ½-3µ

(Note 0 ≤ µ ≤1/6)Assume we want to estimate µ from data. In a given class there were

a A’sb B’sc C’sd D’s

What’s the maximum likelihood estimate of µ given a,b,c,d ?

Page 58: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Silly Example

Let events be “grades in a class”w1 = Gets an A P(A) = ½w2 = Gets a B P(B) = µw3 = Gets a C P(C) = 2µw4 = Gets a D P(D) = ½-3µ

(Note 0 ≤ µ ≤1/6)Assume we want to estimate µ from data. In a given class there were

a A’sb B’sc C’sd D’s

What’s the maximum likelihood estimate of µ given a,b,c,d ?

Page 59: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Trivial StatisticsP(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µP( a,b,c,d | µ) = K(½)a(µ)b(2µ)c(½-3µ)d

log P( a,b,c,d | µ) = log K + alog ½ + blog µ + clog 2µ + dlog (½-3µ)

( )

10

1ì likeMax

got class if So

6 ì likemax Gives

0ì32/1

3

ì2

2

ìì

LogP

LogP SET ì, LIKE MAX FOR

=

++

+=

=!

!+="

"

="

"

dcb

cb

dcb

109614

DCBA

Boring, but true!

Page 60: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Same Problem with Hidden Information

Someone tells us thatNumber of High grades (A’s + B’s) = hNumber of C’s = cNumber of D’s = d

What is the max. like estimate of μ now?

REMEMBER

P(A) = ½

P(B) = μP(C) = 2μP(D) = ½-3μ

Page 61: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Same Problem with Hidden Information

hbha

ì2

1

ì

ì2

1

21

+=

+=

( )dcb

cb

++

+=

6 ì

Someone tells us thatNumber of High grades (A’s + B’s) = hNumber of C’s = cNumber of D’s = d

What is the max. like estimate of μ now?

We can answer this question circularly:

EXPECTATION

MAXIMIZATION

If we know the value of μ we could compute theexpected value of a and b

If we know the expected values of a and bwe could compute the maximum likelihoodvalue of μ

REMEMBER

P(A) = ½

P(B) = μP(C) = 2μP(D) = ½-3μ

Since the ratio a:b should be the same as the ratio ½ : µ

Page 62: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

E.M. for our Trivial Problem

We begin with a guess for µWe iterate between EXPECTATION and MAXIMALIZATION to improve our estimatesof µ and a and b.

Define µ(t) the estimate of µ on the t’th iteration b(t) the estimate of b on t’th iteration

REMEMBER

P(A) = ½

P(B) = μP(C) = 2μP(D) = ½-3μ

[ ]

( )( )( )

( )tbdctb

ctbt

tb

t

htb

given ì ofest likemax

6)1(ì

)(ì|)(ì

21

ì(t) )(

guess initial )0(ì

=

++

+=+

!=+

=

=

E-step

M-step

Continue iterating until converged.Good news: Converging to local optimum is assured.Bad news: I said “local” optimum.

Page 63: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

E.M. Convergence Convergence proof based on fact that Prob(data | µ) must increase or remain

same between each iteration [NOT OBVIOUS]

But it can never exceed 1 [OBVIOUS]

So it must therefore converge [OBVIOUS]

3.1870.09486

3.1870.09485

3.1870.09484

3.1850.09473

3.1580.09372

2.8570.08331

000

b(t)µ(t)tIn our example,suppose we had

h = 20c = 10d = 10

μ(0) = 0

Convergence is generally linear: errordecreases by a constant factor each timestep.

Page 64: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Back to Unsupervised Learning ofGMMs

Remember:We have unlabeled data x1 x2 … xRWe know there are k classesWe know P(y1) P(y2) P(y3) … P(yk)We don’t know µ1 µ2 .. µk

We can write P( data | µ1…. µk)

( )

( )

( ) ( )

( ) ( )!"

!"

!

= =

= =

=

#$

%&'

())=

=

=

=

R

i

k

j

jji

R

i

k

j

jkji

R

i

ki

kR

yx

ywx

x

xx

1 1

2

2

1 1

1

1

1

11

Pìó2

1expK

Pì...ì,p

ì...ìp

ì...ì...p

Page 65: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

E.M. for GMMs

( )

( )

( )!

!

=

==

="

"

R

i

kij

i

R

i

kij

j

k

i

xyP

xxyP

1

1

1

1

1

ì...ì,

ì...ì,

ì

j, eachfor ,likelihoodFor Max " :into thisrnsalgebra tucrazy n' wild'Some

0ì...ìdataobPrlogì

know welikelihoodFor Max

This is n nonlinear equations in μj’s.”

…I feel an EM experience coming on!!

If, for each xi we knew that for each wj the prob that μj was in class yj isP(yj|xi,μ1…μk) Then… we would easily compute μj.

If we knew each μj then we could easily compute P(yj|xi,μ1…μk) for each yjand xi.

See

http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf

Page 66: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

E.M. for GMMsIterate. On the t’th iteration let our estimates be λt = { µ1(t), µ2(t) … µc(t) }

E-stepCompute “expected” classes of all datapoints for each class

( )( ) ( )

( )( )

( )!=

==c

j

jjjk

iiik

tk

titik

tki

tptyx

tptyx

x

yyxxy

1

2

2

)(),(,p

)(),(,p

p

P,p,P

I

I

#

###

M-step.Compute Max. like μ given our data’s class membership distributions

( )( )

( )!

!=+

k

tki

k

k

tki

ixy

xxy

t"

"

,P

,P

Just evaluatea Gaussian atxk

Page 67: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

E.M. Convergence

This algorithm is REALLY USED. And inhigh dimensional state spaces, too. E.G.Vector Quantization for Speech Data

• Your lecturer will(unless out oftime) give you anice intuitiveexplanation ofwhy this ruleworks.

• As with all EMprocedures,convergence to alocal optimumguaranteed.

Page 68: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

E.M. for General GMMsIterate. On the t’th iteration let our estimates be

λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }

E-stepCompute “expected” classes of all datapoints for each class

( )( ) ( )

( )( )

( )!=

"

"==

c

j

jjjjk

iiiik

tk

titik

tki

tpttyx

tpttyx

x

yyxxy

1

)()(),(,p

)()(),(,p

p

P,p,P

µ

µ

#

###

M-step.Compute Max. like μ given our data’s class membership distributions

pi(t) is shorthandfor estimate of P(yi)on t’th iteration

( )( )

( )!

!=+

k

tki

k

k

tki

ixy

xxy

t"

"

,P

,P

1ì ( )( ) ( )[ ] ( )[ ]

( )!

! +"+"

=+#

k

tki

T

ikik

k

tki

ixy

txtxxy

t$

µµ$

,P

11 ,P

1

( )( )

R

xy

tp k

tki

i

!=+

",P

1 R = #records

Just evaluatea Gaussian atxk

Page 69: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Advance apologies: in Blackand White this example will be

incomprehensible

Gaussian Mixture Example: Start

Page 70: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After first iteration

Page 71: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After 2nd iteration

Page 72: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After 3rd iteration

Page 73: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After 4th iteration

Page 74: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After 5th iteration

Page 75: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After 6th iteration

Page 76: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

After 20th iteration

Page 77: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Some Bio Assay data

Page 78: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

GMM clustering of the assay data

Page 79: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

ResultingDensityEstimator

Page 80: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Threeclasses ofassay(each learned withit’s own mixturemodel)

Page 81: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

ResultingBayesClassifier

Page 82: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Resulting BayesClassifier, usingposteriorprobabilities toalert aboutambiguity andanomalousness

Yellow meansanomalous

Cyan meansambiguous

Page 83: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Final Comments

Remember, E.M. can get stuck in local minima, andempirically it DOES.

Our unsupervised learning example assumed P(yi)’s known,and variances fixed and known. Easy to relax this.

It’s possible to do Bayesian unsupervised learning instead ofmax. likelihood.

Page 84: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

What you should know

How to “learn” maximum likelihood parameters (locally max.like.) in the case of unlabeled data.

Be happy with this kind of probabilistic analysis. Understand the two examples of E.M. given in these notes.

Page 85: Bayesian Networks –(Structure) Learningguestrin/Class/10701-S07/Slides/... · (e.g. k=5) ©2005-2007 Carlos Guestrin K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

©2005-2007 Carlos Guestrin

Acknowledgements

K-means & Gaussian mixture modelspresentation derived from excellent tutorial byAndrew Moore: http://www.autonlab.org/tutorials/

K-means Applet: http://www.elet.polimi.it/upload/matteucc/Clustering/tu

torial_html/AppletKM.html Gaussian mixture models Applet:

http://www.neurosci.aist.go.jp/%7Eakaho/MixtureEM.html