Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Approach to Causal Structure Learning

Jin Tian

Computer Science Department, Iowa State University, Ames, IA, USA

Tian CRM 2016 1 / 36

Table of contents

1 Introduction

2 Bayesian Approach

3 Incorporating Experimental Data

4 Structure Discovery

5 Bayesian Model Averaging

6 Curriculum Learning of Bayesian Network Structures

Tian CRM 2016 2 / 36

Introduction

Causal Bayesian Networks

A DAGNodes: random variables.Edges: direct causal influence.

CancerSmoking Tar inlungs

Z Y

U

X

Modularity: Each parent-child relationship represents an autonomouscausal mechanism.

Functional: vi = f (pai , ε)Probabilistic: P(vi |pai )

Tian CRM 2016 3 / 36

Introduction

Causal Bayesian Networks: Applications

Tian CRM 2016 4 / 36

Introduction

Learning Causal Structures from Data

Tian CRM 2016 5 / 36

Bayesian Approach

Bayesian Approach

Deal with uncertainty by assigning probability to all possibilitiesThe posterior probability of a network G :

P(G |D) = P(D|G)P(G)P(D) .

Need to compute the marginal likelihood

P(D|G) =∫

ΘP(D|Θ,G)P(Θ|G)dΘ

Tian CRM 2016 6 / 36

Bayesian Approach

Bayesian Approach

Assumptions on parameter priorGlobal Parameter Independence:

P(Θ|G) =n∏

i=1P(Ψi |G)

Local Parameter Independence:

P(Ψi |G) =∏pai

P(~θpai ), i = 1, . . . , n.

Dirichlet distribution:

P(~θpai ) = Dir(~θpai |~αpai ),

Tian CRM 2016 7 / 36

Bayesian Approach

Bayesian Approach

Assume complete data,Closed form expression for the marginal likelihood has been derived

P(D|G) =n∏

i=1fi (Vi ,Pai : D) decomposable

Tian CRM 2016 8 / 36

Incorporating Experimental Data


Two data sets, D and D′, generated from a causal structure G butwith different parameters, ΘG and Θ′G

P(D,D′|G) =∫

P(D,D′|ΘG ,Θ′G ,G)P(ΘG ,Θ′G |G)dΘGdΘ′G

P(D,D′|ΘG ,Θ′G ,G) = P(D|ΘG ,G)P(D′|Θ′G ,G)

Encode the knowledge on the experimental setting in the priorP(ΘG ,Θ′G |G)

Tian CRM 2016 9 / 36



Known intervention target Vl

Mechanism change (imperfect intervention)

P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l

δ(Ψi −Ψ′i ),

=⇒ Pool the data for nodes i 6= l

P(D,D′|G) = fl (Vl ,Pal : D)fl (Vl ,Pal : D′)∏i 6=l

fi (Vi ,Pai : D,D′)

Tian CRM 2016 10 / 36




Ideal intervention do(Vl = vlj)


δ(Ψi −Ψ′i ),

P(~θ′pal ) = δ(θ′vlj ;pal − 1)∏

vl 6=vlj

δ(θ′vl ;pal ),

=⇒ Pool the data for nodes i 6= l , drop D′ for l

P(D,D′|G) = fl (Vl ,Pal : D)∏i 6=l

fi (Vi ,Pai : D,D′)

Tian CRM 2016 11 / 36




Mechanism change


δ(Ψi −Ψ′i ),

Ψ′l some parametric function (of Ψl )?

Tian CRM 2016 12 / 36



Unknown intervention targetAssume independent parameters

P(ΘG ,Θ′G |G) = P(ΘG |G)P(Θ′G |G),

=⇒

P(D,D′|G) = P(D|G)P(D′|G)

Without knowledge on how they came about, two datasets do notincrease our power of structure discrimination, save for providingmore samples.

Tian CRM 2016 13 / 36



Unknown intervention targetIntroduce interventional nodes E1,E2, . . .

Assumptions: number of states of Ei ; each Ei has one or multiplechildren;Learn a structure over V ∪ E variables with E variables being sourcenodesCan we learn the intervention target more efficiently?

Tian CRM 2016 14 / 36

Structure Discovery

Structure Discovery as Model Selection

Model selection: look for the network that maximizes P(G |D) - themaximum a posteriori (MAP) networkScore-based searchSearching in the space of possible DAGs.NP-hard: the number of possible DAGs is O(n!2n(n−1)/2).

Tian CRM 2016 15 / 36

Structure Discovery

Existing Work

Exact methods: find an optimal BN given a decomposable scoreDynamic programming (Silander et al., 2006): exponential time andspace.A* search (Yuan et al., 2011): shortest path findingInteger linear programming (ILP) (Jaakkola et al., 2010; Cussens,2011,2014): bounded in-degree.

Tian CRM 2016 16 / 36

Structure Discovery

Existing Work

Heuristic searchRandom restart hill-climbing, Simulated Annealing, . . .

Tian CRM 2016 17 / 36

Structure Discovery

Existing Work

Hybrid:Max-Min Hill-Climbing (MMHC) (Tsamardinos et al., 2006)

First estimates the parents and children of each node from CI tests,then performs a constrained greedy search.The state-of-the-art heuristic algorithm.

Tian CRM 2016 18 / 36

Structure Discovery

Open Problem

How to efficiently learn good CBNs from high dimensional data?

Tian CRM 2016 19 / 36

Bayesian Model Averaging

Bayesian Model Averaging (BMA)

When the sample size is small

If we are interested in, say the probability that an edge A→ C is inthe true network,

Answer based on one model often uselessWant features common to many models

Tian CRM 2016 20 / 36



The posterior probability of any hypothesis of interest f

P(f |D) =∑G

P(f |G)P(G |D)

E.g., posterior probability of an edge

P(j → i |D) =∑

G: j→i ∈GP(G |D)

Difficulty: super-exponential number of possible DAG structuresO(n!2n(n−1)/2)

Tian CRM 2016 21 / 36


Existing Work

Exact methods: sum over all possible DAGs using dynamic programming→ from super-exponential to exponential

Compute the posteriors for all n(n − 1) potential edges(Koivisto 2006): O(n2n) time and space; biased favoring graphsconsistent with more orderings(Tian and He 2009): O(n3n) time and O(n2n) space

Compute the posteriors for all n(n − 1) potential ancestor relationsParviainen et al (2011): time O(n3n) and space O(3n); biased(Chen et al 2015): O(n25n−1) time and O(3n) space

Tian CRM 2016 22 / 36


Existing Work

Approximate computation by using a set G of high-scoring networks

P(f |D) ≈∑

G∈G P(f |G)P(G |D)∑G∈G P(G |D)

Find the k-best DAGs by dynamic programming (Tian et al 2010):Time O(n2nk log k), Space O(k2n)Find the k-best equivalence classes (Chen and Tian 2014)

Tian CRM 2016 23 / 36


Existing Work

Approximate computation via samplingIf we manage to sample graphs G1, . . . ,GK from P(G |D), then

P(f |D) ≈ 1K

∑i

f (Gi )

Markov chain Monte Carlo (MCMC) sampling in the space of DAGs[Madigan and York 1995, Grzegorczyk and Husmeier 2008]:Metropolis-Hastings algorithmMCMC in the space of node orderings [Friedman and Koller 2003,Ellis and Wong 2008], Partial order MCMC [Niinimaki etal, 2011]Kuipers and Moffa 2016, He et al 2016

Tian CRM 2016 24 / 36


Open Problem

How to efficiently sample the DAG space in high dimensional settings?

Tian CRM 2016 25 / 36

Curriculum Learning of BNs

Curriculum Learning of Bayesian Network StructuresPropose a heuristic search algorithm based on the idea ofcurriculum learning

Tian CRM 2016 26 / 36


Curriculum Learning

Guided learning helps training humans and animals.

Start from simpler examples/easier tasks (Piaget,1952; Skinner, 1958).

Tian CRM 2016 27 / 36


Curriculum Learning in Machine Learning

Bengio et al. (2009):A curriculum is a sequence of weighting schemes of the training data〈W1,W2, . . . ,Wn〉:

W1 assigns more weight to easier samplesnext scheme assigns more weight to harder samplesWn assigns uniform weight to all samples

Advantages:faster convergence to (local) optimumconvergent to better local optimum

Difficulties:how to design a good curriculum strategy?

Tian CRM 2016 28 / 36


Curriculum Learning of BN Structures

We define the curriculum as (X(1), ...,X(n)), a sequence of selected subsets

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D��

Curriculum: {X(1),X(2),X(3)}

X(1) = {S,B,D},X(2) = {S,B,D, L,E ,X},X(3) = {S,B,D, L,E ,X ,A,T}

Tian CRM 2016 29 / 36


Curriculum Learning of BN Structures

Intermediate learning target Gi : a network over X(i) conditioned onthe rest of the variables X′(i) = X \ X(i)

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D��

Curriculum: {X(1),X(2),X(3)}

X(1) = {S,B,D},X(2) = {S,B,D, L,E ,X},X(3) = {S,B,D, L,E ,X ,A,T}

Tian CRM 2016 30 / 36


Bayesian Approach to Learn Gi

Let X′(i) take q valuesDi = {Di ,1, ...,Di ,q}: group samples over X(i) based on the values ofX′(i).

Assumption: Di ,1, ...,Di ,q are generated by the same Gi but withindependent parameters

P(Di |Gi ) =q∏

j=1P(Di ,j |Gi )

Run heuristic search

Tian CRM 2016 31 / 36


Construct a Curriculum

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D��

S,B,D, L,E ,X︸︷︷︸X(2)

= S,B,D︸︷︷︸X(1)

+ L,E ,X︸︷︷︸to be included

Q: Which variables shall be included next (added into X(i−1))?

Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information

AveMI(Y ,X(i−1)) =∑

X∈X(i−1)I(X ,Y )/|X(i−1)|.

Q: How many variables shall be included next? step size

Tian CRM 2016 32 / 36



A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D��

S,B,D, L,E ,X︸︷︷︸X(2)

= S,B,D︸︷︷︸X(1)


Q: Which variables shall be included next (added into X(i−1))?Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information


X∈X(i−1)I(X ,Y )/|X(i−1)|.


Tian CRM 2016 32 / 36



A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D��

S,B,D, L,E ,X︸︷︷︸X(2)

= S,B,D︸︷︷︸X(1)


Q: Which variables shall be included next (added into X(i−1))?Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information


X∈X(i−1)I(X ,Y )/|X(i−1)|.


Tian CRM 2016 32 / 36


Theoretical AnalysisIdeally, each intermediate target should be closer to the subsequent targets thanany of its predecessors in the sequence.

Theorem

For any i , j , k s.t. 1 ≤ i < j < k ≤ n, we have

dH(Gi ,Gk) ≥ dH(Gj ,Gk)

where dH(Gi ,Gj) is the Structural Hamming Distance (SHD) between thestructures of two BNs Gi and Gj .

Theorem

For any i , j , k s.t. 1 ≤ i < j < k ≤ n, we have

dTV (Gi ,Gk) ≥ dTV (Gj ,Gk)

where dTV (Gi ,Gj) is the total variation distance between the two distributionsdefined by the two BNs Gi and Gj .

Tian CRM 2016 33 / 36


Comparison With MMHC

Table: Comparisons under different metrics

Sample Size (SS)

Metric Algorithm 100 500 1000 5000 10000 50000

BDeu CL 1(0) 1(10) 1(9) 1(8) 1(10) 1(8)MMHC 0.89(10) 1.06(0) 1.02(1) 1.01(2) 1.02(0) 1.01(2)

BIC CL 1(0) 1(9) 1(9) 1(6) 1(8) 1(8)MMHC 0.88(10) 1.07(1) 1.02(1) 1.02(4) 1.02(2) 1.01(2)

KL CL 1(0) 1(10) 1(9) 1(7) 1(9) 1(9)MMHC 1.71(10) 0.82(0) 0.96(1) 0.96(2) 0.97(0) 0.97(0)

SHD CL 1(7) 1(9) 1(7) 1(7) 1(8) 1(6)MMHC 1.06(3) 1.26(1) 1.29(3) 1.07(2) 1.21(1) 1.24(3)

Tian CRM 2016 34 / 36


Curriculum Learning of BN: Conclusion

We propose a novel heuristic algorithm for Bayesian network structurelearning.

It has desired theoretical properties of curriculum learning

We empirically show that our algorithm outperformed thestate-of-the-art MMHC algorithm

Future work: other types of curriculum?

Tian CRM 2016 35 / 36


Open Problem

How to efficiently learn good CBNs from high dimensional data?curriculum learning?sparse learning?parallel algorithm?deep learning?

Tian CRM 2016 36 / 36

Documents

Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum