Finding Optimal Bayesian Networks with Greedy Search Max Chickering

Finding Optimal Bayesian Networks with Greedy Search

Max Chickering

Outline

• Bayesian-Network Definitions

• Learning

• Greedy Equivalence Search (GES)

• Optimality of GES

Bayesian Networks

)),(|(),|,...,(1

1 ii

n

iin XParXpSXXp

Use B = (S,) to represent p(X1, …, Xn)

Markov Conditions

X

Desc

Desc

Par ParPar

ND

ND

From factorization: I(X, ND | Par(X))

Markov Conditions + Graphoid Axioms characterize all independencies

Structure/Distribution Inclusion

p is included in S if there exists s.t. B(S,) defines p

X Y Z

pAll distributions

S

Structure/Structure Inclusion T ≤ S

T is included in S if every p included in T is included in S

X Y Z

All distributions

X Y Z

S T

(S is an I-map of T)

Structure/Structure EquivalenceT S

X Y Z

All distributions

X Y Z

S T

Reflexive, Symmetric, Transitive

Equivalence

V-structure

D

A B C

Theorem (Verma and Pearl, 1990)S T same v-structures and skeletons

D

A B C

Skeleton

Learning Bayesian Networks

1. Learn the structure

2. Estimate the conditional distributions

X Y Z0 1 11 0 10 1 0

.

.

.1 0 1

X

Y

Z

p*

iidsamples

GenerativeDistribution

Observed Data Learned Model

Learning Structure

• Scoring criterion

F(D, S)

• Search procedure

Identify one or more structures with high values

for the scoring function

Properties of Scoring Criteria

• Consistent

• Locally Consistent

• Score Equivalent

Consistent Criterion

S includes p*, T does not include p* F(S,D) > F(T,D)

Both include p*, S has fewer parameters F(S,D) > F(T,D)

Criterion favors (in the limit) simplest model that includes the generative distribution p*

X Y Z

X Y Z

X Y ZX Y Z

p*

Locally Consistent Criterion

X Y X Y

S T

If I(X,Y|Par(X)) in p* then F(S,D) > F(T,D)Otherwise F(S,D) < F(T,D)

S and T differ by one edge:

Score-Equivalent Criterion

ST F(S,D) = F(T,D)

X Y

X Y

S

T

Bayesian Criterion(Consistent, locally consistent and score equivalent)

Sh : generative distribution p* has same

independence constraints as S.

FBayes(S,D) = log p(Sh |D)

= k + log p(D|Sh) + log p(Sh)

Marginal Likelihood(closed form w/ assumptions)

Structure Prior(e.g. prefer simple)

Search Procedure

• Set of states

• Representation for the states

• Operators to move between states

• Systematic Search Algorithm

Greedy Equivalence Search

• Set of statesEquivalence classes of DAGs

• Representation for the statesEssential graphs

• Operators to move between statesForward and Backward Operators

• Systematic Search AlgorithmTwo-phase Greedy

Representation: Essential Graphs

E

A B C

F

Compelled Edges

Reversible Edges

D

E

A B C

FD

GES Operators

Forward Direction – single edge additions

Backward Direction – single edge deletions

Two-Phase Greedy Algorithm

Phase 1: Forward Equivalence Search (FES)• Start with all-independence model• Run Greedy using forward operators

Phase 2: Backward Equivalence Search (BES)• Start with local max from FES• Run Greedy using backward operators

Forward Operators

• Consider all DAGs in the current state

• For each DAG, consider all single-edge additions (acyclic)

• Take the union of the resulting equivalence classes

Forward-Operators ExampleB

C

ACurrent State: All DAGs: B

C

A B

C

A

All DAGs resulting from single-edge addition:

B

C

A B

C

A

B

C

A B

C

A

B

C

A B

C

A

B

C

A B

C

A

B

C

A B

C

AB

C

A B

C

A

Union of corresponding essential graphs:

Forward-Operators Example

B

C

AB

C

A B

C

A

B

C

A

B

C

A

Backward Operators

• Consider all DAGs in the current state

• For each DAG, consider all single-edge deletions

• Take the union of the resulting equivalence classes

Backward-Operators ExampleCurrent State: All DAGs: B

C

A B

C

A

All DAGs resulting from single-edge deletion:

B

C

A

Union of corresponding essential graphs:

B

C

A B

C

A

B

C

A B

C

A B

C

A B

C

A B

C

A B

C

A

B

C

A

Backward-Operators Example

B

C

AB

C

A B

C

A

DAG PerfectDAG-perfect distribution p

Exists DAG G:

I(X,Y|Z) in p I(X,Y|Z) in G

Non-DAG-perfect distribution q

BA

DC

I(A,D|B,C)I(B,C|A,D)

BA

DC

BA

DC

I(B,C|A,D) I(A,D|B,C)

DAG-Perfect Consequence: Composition Axiom Holds in p*

If I(X,Y | Z) then I(X,Y | Z)for some singleton Y Y

A B C

X

D C

X

Optimality of GES

X Y Z0 1 11 0 10 1 0 . . .1 0 1

X

Y

ZS*

X

Y

Ziid

samples

If p* is DAG-perfect wrt some G*

G*

X

Y

Zn

GES

S

p*

For large n, S = S*

Optimality of GES

Proof Outline• After first phase (FES), current state includes S*• After second phase (BES), the current state = S*

FES BES

All-independence State includes S* State equals S*

FES Maximum Includes S*Assume: Local Max does NOT include S* Any DAG G from S

Markov Conditions characterize independencies:In p*, exists X not indep. non-desc given parents

E

A B C

XD I(X,{A,B,C,D} | E) in p*

p* is DAG-perfect composition axiom holds

E

A B C

XD I(X,C | E) in p*

Locally consistent: adding CX edge improves score, and EQ class isa neighbor

BES Identifies S*

• Current state always includes S*:

Local consistency of the criterion

• Local Minimum is S*:

Meek’s conjecture

Meek’s Conjecture

Any pair of DAGs G,H such that H includes G (G ≤ H)

There exists a sequence of

(1) covered edge reversals in G

(2) single-edge additions to G

after each change G ≤ Hafter all changes G=H

Meek’s Conjecture

BA

C D

BA

C D

I(A,B)I(C,B|A,D)

BA

C D

BA

C D

BA

C D

H

G

Meek’s Conjecture and BESS*≤S

Assume: Local Max S Not S* Any DAG H from S Any DAG G from S*

G H

Add AddRev Rev Rev



G H

Add AddRev Rev Rev

G H

Del DelRev Rev Rev



G H

Add AddRev Rev Rev

S* SNeighbor of S in BES

G H

Del DelRev Rev Rev

Discussion Points

• In practice, GES is as fast as DAG-based search

Neighborhood of essential graphs can be generated and scored very efficiently

• When DAG-perfect assumption fails, we still get optimality guarantees

As long as composition holds in generative distribution, local maximum is inclusion-minimal

Thanks!My Home Page:

http://research.microsoft.com/~dmax

Relevant Papers:

“Optimal Structure Identification with Greedy Search”JMLR SubmissionContains detailed proofs of Meek’s conjecture and optimality of GES

“Finding Optimal Bayesian Networks”UAI02 Paper with Chris MeekContains extension of optimality results of GES when not DAG perfect

Bayesian Criterion is Locally Consistent

• Bayesian score approaches BIC + constant

• BIC is decomposible:

• Difference in score same for any DAGS that differ by YX edge if X has same parents

))(,(),(1

i

n

ii XParXFSBIC

D

X Y X Y

Complete network (always includes p*)

Bayesian Criterion is Consistent

Assume Conditionals:(1) unconstrained multinomials(2) linear regressions

Network structures = curved exponential models

Bayesian Criterion is consistent

Geiger, Heckerman, King and Meek (2001)

Haughton (1988)

Bayesian Criterion isScore Equivalent

ST F(S,D) = F(T,D)

X Y

X Y

S

T

Sh = Th

Sh : no independence constraints

Th : no independence constraints

Active Paths

Z-active Path between X and Y: (non-standard)1. Neither X nor Y is in Z2. Every pair of colliding edges meets at a member of Z3. No other pair of edges meets at a member of Z

X Z Y

G ≤ H If Z-active path between X and Y in Gthen Z-active path between X and Y in H

Active Paths

X A Z W B Y

A B C

• X-Y: Out-of X and In-to Y

• X-W Out-of both X and W

• Any sub-path between A,BZ is also active

• A – B, B – C, at least one is out-of B Active path between A and C

Simple Active Paths

Y X BA

A B contains YX

Then active path

X Y BA Y X

(1) Edge appears exactly once

(2) Edge appears exactly twice

OR

Simplify discussion: Assume (1) only – proofs for (2) almost identical

Typical Argument:Combining Active Paths

X

Z

Y BA

X

Z

Y BA XA

Y B

X

Z

Y

G

H

G’ : Suppose AP in G’ (X not in CS) with no corresp. AP in H. Then Z not in CS.

Z sink node adj X,Y

G≤H

Proof Sketch

Two DAGs G, H with G<H

Identify either:

(1) a covered edge XY in G that has opposite orientation in H

(2) a new edge XY to be added to G such that it remains included in H

The Transformation

XY

Y

XY

XY

W

Y

XY

XY

XY

W

Y

Choose any node Y that is a sink in H

Case 1a: Y is a sink in G X ParH(Y) X ParG(Y)

Case 1b: Y is a sink in G same parents

Case 2a: X s.t. YX covered

Case 2b: X s.t. YX & W par of Y but not X

Case 2c: Every YX, Par (Y) Par(X)

Preliminaries

• The adjacencies in G are a subset of the adjacencies in H

• If XYZ is a v-structure in G but not H, then X and Z are adjacent in H

• Any new active path that results from adding XY to G includes XY

(G ≤ H)

Proof Sketch: Case 1

Z

Y is a sink in G

XY

Case 1a: X ParH(Y) X ParG(Y)

Case 1b: Parents identical

Remove Y from both graphs: proof similar

XYH:

XYG:

Suppose there’s some new active path between A and B not in H

Y X BA

1. Y is a sink in G, so it must be in CS2. Neither X nor next node Z is in CS3. In H, AP(A,Z), AP(X,B), ZYX

Proof Sketch: Case 2

Case 2a: There is a covered edge YX : Reverse the edge

Case 2b: There is a non-covered edge YX such that W is a parent of Y but not a parent of X

X

W

Y X

W

YG: G’:

X

W

YH:

Y must be in CS, else replace WX by W Y X (not new).If X not in CS, then in H active: A-W, X-B, WYX

X

W

Y Z X

W

YH:G’:

Z

A B BA

Y is not a sink in G

Suppose there’s some new active path between A and B not in H

Case 2c: The Difficult CaseAll non-covered edges YZ have Par(Y) Par(Z)

W1

Z1

Y

Z2

W2 W1

Z1

Y

Z2

W2

G H

W1Y: G no longer < H (Z2-active path between W1 and W2)W2Y: G < H

Choosing Z

D

Y

Z

Descendants of Y in G

D

Y

Descendants of Y in G

G H

D is the maximal G-descendant in HZ is any maximal child of Y such that D is a descendant of Z in G

Choosing ZW1

Z1

YZ2

W2 W1

Z1

YZ2

W2

G H

Descendants of Y in G:Y, Z1, Z2

Maximal descendant in H:D=Z2

Maximal child of Y in G that has D=Z2 as descendantZ2

Add W2Y

Difficult Case: Proof Intuition

D

Y

Z

G H

W Y

Z

W

1. W not in CS2. Y not in CS, else active in H3. In G, next edges must be away from Y until B or CS reached4. In G, neither Z nor desc in CS, else active before addition5. From (1,2,4), AP (A,D) and (B,D) in H6. Choice of D: directed path from D to B or CS in H

A A

DB or CS

B or CS

BB

Optimality of GES

Definitionp is DAG-perfect wrt G:Independence constraints in p are precisely those in G

Assumption Generative distribution p* is perfect wrt some G* definedover the observable variables

S* = Equivalence class containing G*

Under DAG-perfect assumption, GES results in S*

Important Definitions

• Bayesian Networks

• Markov Conditions

• Distribution/Structure Inclusion

• Structure/Structure Inclusion

Documents

Finding Optimal Bayesian Networks with Greedy Search Max Chickering