Being Bayesian about Network Structure

.

Being Bayesian about Network Structure

Nir Friedman Daphne Koller Hebrew Univ. Stanford Univ.

2

Structure Discovery Current practice: model selection

Pick a single model (of high score) Use that model to represent domain structure

Enough data “right” model overwhelmingly likely But what about the rest of the time?

Many high-scoring models Answer based on one model often useless

Bayesian model averaging is Bayesian ideal

G

DGPGfDfP )|()()|(Feature of G,

e.g., XY

3

Model Averaging

Unfortunately, it is intractable: # of possible structures is superexponential That’s why no one really does it*

Our contribution: Closed form solution for fixed ordering

over nodes MCMC over orderings for general case

Faster convergence, robust results.

* Exceptions: Madigan & Raftery, Madigan & York; see below

)log( 2

2 nnO

4

Fixed Ordering

Suppose that We know the ordering of variables

say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1

Limit number of parents per nodes to k

Intuition: Order decouples choice of parents

The choice of parents for X7 do not restrict the choice of parents for X12

We can exploit this to simplify the form of P(D)

2k•n•log n networks

5

Ordering: Computing P(D)

Set of possible parent sets for Xi consistent with has size at most k

i Ui

G i

Gii

i

UXScore

PaXScoreDP

,

)|(

)|()|(

U

G

,iU

Small number of potentialfamilies per node

Independenceof families

Efficient closed-form summation over exponential number of structures

)|( GDP

6

MCMC over Models

Cannot enumerate structures, so sample structures

MCMC Sampling Define Markov chain over BN models Run chain to get samples from posterior P(G | D)

Possible pitfalls: huge number of models mixing rate (also required burn-in) unknown islands of high posterior, connected by low bridges

)|(~)(1

)|)((1

DGPGwithGfn

DGfP i

n

ii

7

ICU Alarm BN: No Mixing However, with 500 instances:

the runs clearly do not mix.

MCMC Iteration

Sco

re o

f cu

ure

nt s

ampl

e

-9400

-9200

-9000

-8800

-8600

-8400

0 100000 200000 300000 400000 500000 600000

score

iteration

emptygreedy

8

Effects of Non-Mixing Two MCMC runs over same 500 instances Probability estimates for Markov features:

based on 50 nets sampled from MCMC process

Probability estimates highly variable, nonrobust

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

true BN vs randomInitialization

true BN vs true BN

9

Our Approach: Sample Orderings

We can write

Comment: Structure prior P(G) changes uniform prior over structures uniform prior over orderings and on

structures consistent with a given ordering

Sample orderings and approximate

)|(),|()|( DPDGPDGP

)|(~with],|)([]|)([1

DPDGfEDGfE ii

n

i

10

MCMC Over Orderings

Use Metropolis-Hasting algorithm Specify a proposal distribution q(’| )

flip: (i1 … ij … ik … in) (i1 … ik … ij … in)

“cut”: (i1 … ij ij+1 … in) (ij+1 … in i1 … ij)

Each iteration: Sample ’ from q(’| ) go ’ with probability

Since priors are uniform

Efficient computation!!!

))'|()|()|'()|'(

,1min(

qDPqDP

)|()'|(

)|()|'(

DPDP

DPDP

11

Why Ordering Helps

Smaller space Significant reduction in size of sample space

Better structured space We can get from one ordering to another in

(relatively) small number of steps Smoother posterior “landscape”

Score of an ordering is sum over many networks No ordering is “horrendous”

no “islands” of high posterior separated by a deep blue sea

12

Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances

fewer iterations than MCMC-Nets approximately same amount of computation

Process is clearly mixing!

-8450

-8445

-8440

-8435

-8430

-8425

-8420

-8415

-8410

-8405

-8400

0 10000 20000 30000 40000 50000 60000

score

iteration

randomgreedy

MCMC Iteration

Sco

re o

f cu

ure

nt s

ampl

e

13

Mixing of MCMC runs Two MCMC runs over same 500 instances Probability estimates for Markov features:

based on 50 nets sampled from MCMC process

Probability estimates very robust100 instances

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

1000 instances

14

Computing Feature Posterior: P(f|’,D)

Edges:

Markov Blanket: IfYZ or both Y and Z are parents of some X Posterior of these features are independent

Other features (e.g., existence of causal path): Sample networks from ordering Estimate features from networks

,

,

)|(

)|(}{

),|(

i

i

i

Ui

Ui

XY UXScore

UXScoreUY1

DfP

U

U

YX

XZYYZZY DfPDfPDfP

)),|(1()),|(1(1),|( },{~

15

Feature Reconstruction (ICU-Alarm) Markov Features

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

0

10

20

30

40

50

0 10 20 30 0 10 20 30 0 10 20 30

Reconstruct “true” features of generating network

16

Feature Reconstruction

(ICU-Alarm) Path Features

0

50

100

150

200

0

50

100

150

200

0

50

100

150

200

0 200 400 600

BootstrapOrder

Structure

17

Conclusion Full Bayesian model averaging is tractable for

known ordering.

MCMC over orderings allows robust approximation to full Bayesian averaging over Bayes nets

rapid and reliable mixing robust & reliable estimates for probability of

structural features

Crucial for structure discovery in domains with limited data

Biological discovery

Documents

Being Bayesian about Network Structure