Finding Optimal Bayesian Networks with Greedy Search Max Chickering

Finding Optimal Bayesian Networks with Greedy Search

Max Chickering

Outline

• Bayesian-Network Definitions

• Learning

• Greedy Equivalence Search (GES)

• Optimality of GES

Bayesian Networks

)),(|(),|,...,(1

iin XParXpSXXp

Use B = (S,) to represent p(X1, …, Xn)

Markov Conditions

Par ParPar

From factorization: I(X, ND | Par(X))

Markov Conditions + Graphoid Axioms characterize all independencies

Structure/Distribution Inclusion

p is included in S if there exists s.t. B(S,) defines p

pAll distributions

Structure/Structure Inclusion T ≤ S

T is included in S if every p included in T is included in S

All distributions

(S is an I-map of T)

Structure/Structure EquivalenceT S

All distributions

Reflexive, Symmetric, Transitive

Equivalence

V-structure

Theorem (Verma and Pearl, 1990)S T same v-structures and skeletons

Skeleton

Learning Bayesian Networks

1. Learn the structure

2. Estimate the conditional distributions

X Y Z0 1 11 0 10 1 0

.1 0 1

iidsamples

GenerativeDistribution

Observed Data Learned Model

Learning Structure

• Scoring criterion

F(D, S)

• Search procedure

Identify one or more structures with high values

for the scoring function

Properties of Scoring Criteria

• Consistent

• Locally Consistent

• Score Equivalent

Consistent Criterion

S includes p*, T does not include p* F(S,D) > F(T,D)

Both include p*, S has fewer parameters F(S,D) > F(T,D)

Criterion favors (in the limit) simplest model that includes the generative distribution p*

X Y ZX Y Z

Locally Consistent Criterion

X Y X Y

If I(X,Y|Par(X)) in p* then F(S,D) > F(T,D)Otherwise F(S,D) < F(T,D)

S and T differ by one edge:

Score-Equivalent Criterion

ST F(S,D) = F(T,D)

Bayesian Criterion(Consistent, locally consistent and score equivalent)

Sh : generative distribution p* has same

independence constraints as S.

FBayes(S,D) = log p(Sh |D)

= k + log p(D|Sh) + log p(Sh)

Marginal Likelihood(closed form w/ assumptions)

Structure Prior(e.g. prefer simple)

Search Procedure

• Set of states

• Representation for the states

• Operators to move between states

• Systematic Search Algorithm

Greedy Equivalence Search

• Set of statesEquivalence classes of DAGs

• Representation for the statesEssential graphs

• Operators to move between statesForward and Backward Operators

• Systematic Search AlgorithmTwo-phase Greedy

Representation: Essential Graphs

Compelled Edges

Reversible Edges

GES Operators

Forward Direction – single edge additions

Backward Direction – single edge deletions

Two-Phase Greedy Algorithm

Phase 1: Forward Equivalence Search (FES)• Start with all-independence model• Run Greedy using forward operators

Phase 2: Backward Equivalence Search (BES)• Start with local max from FES• Run Greedy using backward operators

Forward Operators

• Consider all DAGs in the current state

• For each DAG, consider all single-edge additions (acyclic)

• Take the union of the resulting equivalence classes

Forward-Operators ExampleB

ACurrent State: All DAGs: B

All DAGs resulting from single-edge addition:

Union of corresponding essential graphs:

Forward-Operators Example

Backward Operators

• Consider all DAGs in the current state

• For each DAG, consider all single-edge deletions

• Take the union of the resulting equivalence classes

Backward-Operators ExampleCurrent State: All DAGs: B

All DAGs resulting from single-edge deletion:

Union of corresponding essential graphs:

Backward-Operators Example

DAG PerfectDAG-perfect distribution p

Exists DAG G:

I(X,Y|Z) in p I(X,Y|Z) in G

Non-DAG-perfect distribution q

I(A,D|B,C)I(B,C|A,D)

I(B,C|A,D) I(A,D|B,C)

DAG-Perfect Consequence: Composition Axiom Holds in p*

If I(X,Y | Z) then I(X,Y | Z)for some singleton Y Y

Optimality of GES

X Y Z0 1 11 0 10 1 0 . . .1 0 1

samples

If p* is DAG-perfect wrt some G*

For large n, S = S*

Optimality of GES

Proof Outline• After first phase (FES), current state includes S*• After second phase (BES), the current state = S*

FES BES

All-independence State includes S* State equals S*

FES Maximum Includes S*Assume: Local Max does NOT include S* Any DAG G from S

Markov Conditions characterize independencies:In p*, exists X not indep. non-desc given parents

XD I(X,{A,B,C,D} | E) in p*

p* is DAG-perfect composition axiom holds

XD I(X,C | E) in p*

Locally consistent: adding CX edge improves score, and EQ class isa neighbor

BES Identifies S*

• Current state always includes S*:

Local consistency of the criterion

• Local Minimum is S*:

Meek’s conjecture

Meek’s Conjecture

Any pair of DAGs G,H such that H includes G (G ≤ H)

There exists a sequence of

(1) covered edge reversals in G

(2) single-edge additions to G

after each change G ≤ Hafter all changes G=H

Meek’s Conjecture

I(A,B)I(C,B|A,D)

Meek’s Conjecture and BESS*≤S

Assume: Local Max S Not S* Any DAG H from S Any DAG G from S*

Add AddRev Rev Rev

Del DelRev Rev Rev

Add AddRev Rev Rev

S* SNeighbor of S in BES

Del DelRev Rev Rev

Discussion Points

• In practice, GES is as fast as DAG-based search

Neighborhood of essential graphs can be generated and scored very efficiently

• When DAG-perfect assumption fails, we still get optimality guarantees

As long as composition holds in generative distribution, local maximum is inclusion-minimal

Thanks!My Home Page:

http://research.microsoft.com/~dmax

Relevant Papers:

“Optimal Structure Identification with Greedy Search”JMLR SubmissionContains detailed proofs of Meek’s conjecture and optimality of GES

“Finding Optimal Bayesian Networks”UAI02 Paper with Chris MeekContains extension of optimality results of GES when not DAG perfect

Bayesian Criterion is Locally Consistent

• Bayesian score approaches BIC + constant

• BIC is decomposible:

• Difference in score same for any DAGS that differ by YX edge if X has same parents

))(,(),(1

ii XParXFSBIC

X Y X Y

Complete network (always includes p*)

Bayesian Criterion is Consistent

Assume Conditionals:(1) unconstrained multinomials(2) linear regressions

Network structures = curved exponential models

Bayesian Criterion is consistent

Geiger, Heckerman, King and Meek (2001)

Haughton (1988)

Bayesian Criterion isScore Equivalent

ST F(S,D) = F(T,D)

Sh = Th

Sh : no independence constraints

Th : no independence constraints

Active Paths

Z-active Path between X and Y: (non-standard)1. Neither X nor Y is in Z2. Every pair of colliding edges meets at a member of Z3. No other pair of edges meets at a member of Z

G ≤ H If Z-active path between X and Y in Gthen Z-active path between X and Y in H

Active Paths

X A Z W B Y

• X-Y: Out-of X and In-to Y

• X-W Out-of both X and W

• Any sub-path between A,BZ is also active

• A – B, B – C, at least one is out-of B Active path between A and C

Simple Active Paths

Y X BA

A B contains YX

Then active path

X Y BA Y X

(1) Edge appears exactly once

(2) Edge appears exactly twice

Simplify discussion: Assume (1) only – proofs for (2) almost identical

Typical Argument:Combining Active Paths

Y BA XA

G’ : Suppose AP in G’ (X not in CS) with no corresp. AP in H. Then Z not in CS.

Z sink node adj X,Y

Proof Sketch

Two DAGs G, H with G<H

Identify either:

(1) a covered edge XY in G that has opposite orientation in H

(2) a new edge XY to be added to G such that it remains included in H

The Transformation

Choose any node Y that is a sink in H

Case 1a: Y is a sink in G X ParH(Y) X ParG(Y)

Case 1b: Y is a sink in G same parents

Case 2a: X s.t. YX covered

Case 2b: X s.t. YX & W par of Y but not X

Case 2c: Every YX, Par (Y) Par(X)

Preliminaries

• The adjacencies in G are a subset of the adjacencies in H

• If XYZ is a v-structure in G but not H, then X and Z are adjacent in H

• Any new active path that results from adding XY to G includes XY

(G ≤ H)

Proof Sketch: Case 1

Y is a sink in G

Case 1a: X ParH(Y) X ParG(Y)

Case 1b: Parents identical

Remove Y from both graphs: proof similar

Suppose there’s some new active path between A and B not in H

Y X BA

1. Y is a sink in G, so it must be in CS2. Neither X nor next node Z is in CS3. In H, AP(A,Z), AP(X,B), ZYX

Proof Sketch: Case 2

Case 2a: There is a covered edge YX : Reverse the edge

Case 2b: There is a non-covered edge YX such that W is a parent of Y but not a parent of X

YG: G’:

Y must be in CS, else replace WX by W Y X (not new).If X not in CS, then in H active: A-W, X-B, WYX

YH:G’:

A B BA

Y is not a sink in G

Suppose there’s some new active path between A and B not in H

Case 2c: The Difficult CaseAll non-covered edges YZ have Par(Y) Par(Z)

W1Y: G no longer < H (Z2-active path between W1 and W2)W2Y: G < H

Choosing Z

Descendants of Y in G

D is the maximal G-descendant in HZ is any maximal child of Y such that D is a descendant of Z in G

Choosing ZW1

Descendants of Y in G:Y, Z1, Z2

Maximal descendant in H:D=Z2

Maximal child of Y in G that has D=Z2 as descendantZ2

Add W2Y

Difficult Case: Proof Intuition

1. W not in CS2. Y not in CS, else active in H3. In G, next edges must be away from Y until B or CS reached4. In G, neither Z nor desc in CS, else active before addition5. From (1,2,4), AP (A,D) and (B,D) in H6. Choice of D: directed path from D to B or CS in H

DB or CS

B or CS

Optimality of GES

Definitionp is DAG-perfect wrt G:Independence constraints in p are precisely those in G

Assumption Generative distribution p* is perfect wrt some G* definedover the observable variables

S* = Equivalence class containing G*

Under DAG-perfect assumption, GES results in S*

Important Definitions

• Bayesian Networks

• Markov Conditions

• Distribution/Structure Inclusion

• Structure/Structure Inclusion

Finding Optimal Bayesian Networks with Greedy Search Max Chickering

Documents

Selective Greedy Equivalence Search: Finding Optimal ... · PDF fileSelective Greedy Equivalence Search: Finding Optimal Bayesian Networks ... we can remove from consideration any

Greedy Method

Table of Contents · Chickering’s Theory of Identity Development Chickering (1969, Chickering & Reisser, 1993) explored the progression that students have in developing identity,

Learning Bayesian Networks For Managing Inventory Of Display Advertisements Max Chickering Mad Scientist Live Labs Microsoft Corporation Max Chickering

Probabilistic Methods for Targeted Advertising Max Chickering Microsoft Research

Learning Bayesian networks: The combination of …dang/journal_papers/heckerman1995...200 HECKERMAN: GEIGER AND CHICKERING where c' is another normalization constant 1/[~Æ) eH p(D,

Greedy Algorithms

Causal Models for Scientiﬁc Discovery · • Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering

Greedy Methods

Greedy Algorithms - SJTUgao-xf/algorithm/Document/Slide04-GreedyAlgorithm.pdf · Greedy Analysis Strategies Greedy algorithm stays ahead. Show that after each step of the greedy algorithm,

Elements of Greedy Algorithms Greedy Choice Property for Kruskal…huding/331material/notes6.pdf · 2 Elements of Greedy Algorithms 3 Greedy Choice Property for Kruskal’s Algorithm

Communications Breakout - Jeff Chickering & Nicole Kasin - Friday - Regionals 2011

Chapter 4: Greedy Algorithms - courses.cs.washington.educourses.cs.washington.edu/courses/cse417/14sp/slides/04greed.pdf · Interval Scheduling: Greedy Algorithms! Greedy template

Using Campbell, Chickering, Marcia and Schlossberg to Analyze Literary Characters

Greedy Algoritham

Greedy Algorithms:

F. William Chickering , Dean Of University Libraries

Greedy method1

Greedy Covering

ANALIZA TRANSPORTNIH MREMREŽŽAA - nastava.sf.bg.ac.rs · Prožždrljivi (Greedy) heuristi drljivi (Greedy) heurističčki algoritmiki algoritmi Proždrljivi (Greedy) algoritmi generišu