38
Bayesian Approach to Causal Structure Learning Jin Tian Computer Science Department, Iowa State University, Ames, IA, USA Tian CRM 2016 1 / 36

Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Approach to Causal Structure Learning

Jin Tian

Computer Science Department, Iowa State University, Ames, IA, USA

Tian CRM 2016 1 / 36

Page 2: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Table of contents

1 Introduction

2 Bayesian Approach

3 Incorporating Experimental Data

4 Structure Discovery

5 Bayesian Model Averaging

6 Curriculum Learning of Bayesian Network Structures

Tian CRM 2016 2 / 36

Page 3: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Introduction

Causal Bayesian Networks

A DAGNodes: random variables.Edges: direct causal influence.

CancerSmoking Tar inlungs

Z Y

U

X

Modularity: Each parent-child relationship represents an autonomouscausal mechanism.

Functional: vi = f (pai , ε)Probabilistic: P(vi |pai )

Tian CRM 2016 3 / 36

Page 4: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Introduction

Causal Bayesian Networks: Applications

Tian CRM 2016 4 / 36

Page 5: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Introduction

Learning Causal Structures from Data

Tian CRM 2016 5 / 36

Page 6: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Approach

Bayesian Approach

Deal with uncertainty by assigning probability to all possibilitiesThe posterior probability of a network G :

P(G |D) = P(D|G)P(G)P(D) .

Need to compute the marginal likelihood

P(D|G) =∫

ΘP(D|Θ,G)P(Θ|G)dΘ

Tian CRM 2016 6 / 36

Page 7: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Approach

Bayesian Approach

Assumptions on parameter priorGlobal Parameter Independence:

P(Θ|G) =n∏

i=1P(Ψi |G)

Local Parameter Independence:

P(Ψi |G) =∏pai

P(~θpai ), i = 1, . . . , n.

Dirichlet distribution:

P(~θpai ) = Dir(~θpai |~αpai ),

Tian CRM 2016 7 / 36

Page 8: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Approach

Bayesian Approach

Assume complete data,Closed form expression for the marginal likelihood has been derived

P(D|G) =n∏

i=1fi (Vi ,Pai : D) decomposable

Tian CRM 2016 8 / 36

Page 9: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Incorporating Experimental Data

Incorporating Experimental Data

Two data sets, D and D′, generated from a causal structure G butwith different parameters, ΘG and Θ′G

P(D,D′|G) =∫

P(D,D′|ΘG ,Θ′G ,G)P(ΘG ,Θ′G |G)dΘGdΘ′G

P(D,D′|ΘG ,Θ′G ,G) = P(D|ΘG ,G)P(D′|Θ′G ,G)

Encode the knowledge on the experimental setting in the priorP(ΘG ,Θ′G |G)

Tian CRM 2016 9 / 36

Page 10: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Incorporating Experimental Data

Incorporating Experimental Data

Known intervention target Vl

Mechanism change (imperfect intervention)

P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l

δ(Ψi −Ψ′i ),

=⇒ Pool the data for nodes i 6= l

P(D,D′|G) = fl (Vl ,Pal : D)fl (Vl ,Pal : D′)∏i 6=l

fi (Vi ,Pai : D,D′)

Tian CRM 2016 10 / 36

Page 11: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Incorporating Experimental Data

Incorporating Experimental Data

Known intervention target Vl

Ideal intervention do(Vl = vlj)

P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l

δ(Ψi −Ψ′i ),

P(~θ′pal ) = δ(θ′vlj ;pal − 1)∏

vl 6=vlj

δ(θ′vl ;pal ),

=⇒ Pool the data for nodes i 6= l , drop D′ for l

P(D,D′|G) = fl (Vl ,Pal : D)∏i 6=l

fi (Vi ,Pai : D,D′)

Tian CRM 2016 11 / 36

Page 12: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Incorporating Experimental Data

Incorporating Experimental Data

Known intervention target Vl

Mechanism change

P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l

δ(Ψi −Ψ′i ),

Ψ′l some parametric function (of Ψl )?

Tian CRM 2016 12 / 36

Page 13: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Incorporating Experimental Data

Incorporating Experimental Data

Unknown intervention targetAssume independent parameters

P(ΘG ,Θ′G |G) = P(ΘG |G)P(Θ′G |G),

=⇒

P(D,D′|G) = P(D|G)P(D′|G)

Without knowledge on how they came about, two datasets do notincrease our power of structure discrimination, save for providingmore samples.

Tian CRM 2016 13 / 36

Page 14: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Incorporating Experimental Data

Incorporating Experimental Data

Unknown intervention targetIntroduce interventional nodes E1,E2, . . .

Assumptions: number of states of Ei ; each Ei has one or multiplechildren;Learn a structure over V ∪ E variables with E variables being sourcenodesCan we learn the intervention target more efficiently?

Tian CRM 2016 14 / 36

Page 15: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Structure Discovery

Structure Discovery as Model Selection

Model selection: look for the network that maximizes P(G |D) - themaximum a posteriori (MAP) networkScore-based searchSearching in the space of possible DAGs.NP-hard: the number of possible DAGs is O(n!2n(n−1)/2).

Tian CRM 2016 15 / 36

Page 16: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Structure Discovery

Existing Work

Exact methods: find an optimal BN given a decomposable scoreDynamic programming (Silander et al., 2006): exponential time andspace.A* search (Yuan et al., 2011): shortest path findingInteger linear programming (ILP) (Jaakkola et al., 2010; Cussens,2011,2014): bounded in-degree.

Tian CRM 2016 16 / 36

Page 17: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Structure Discovery

Existing Work

Heuristic searchRandom restart hill-climbing, Simulated Annealing, . . .

Tian CRM 2016 17 / 36

Page 18: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Structure Discovery

Existing Work

Hybrid:Max-Min Hill-Climbing (MMHC) (Tsamardinos et al., 2006)

First estimates the parents and children of each node from CI tests,then performs a constrained greedy search.The state-of-the-art heuristic algorithm.

Tian CRM 2016 18 / 36

Page 19: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Structure Discovery

Open Problem

How to efficiently learn good CBNs from high dimensional data?

Tian CRM 2016 19 / 36

Page 20: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Model Averaging

Bayesian Model Averaging (BMA)

When the sample size is small

If we are interested in, say the probability that an edge A→ C is inthe true network,

Answer based on one model often uselessWant features common to many models

Tian CRM 2016 20 / 36

Page 21: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Model Averaging

Bayesian Model Averaging

The posterior probability of any hypothesis of interest f

P(f |D) =∑G

P(f |G)P(G |D)

E.g., posterior probability of an edge

P(j → i |D) =∑

G: j→i ∈GP(G |D)

Difficulty: super-exponential number of possible DAG structuresO(n!2n(n−1)/2)

Tian CRM 2016 21 / 36

Page 22: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Model Averaging

Existing Work

Exact methods: sum over all possible DAGs using dynamic programming→ from super-exponential to exponential

Compute the posteriors for all n(n − 1) potential edges(Koivisto 2006): O(n2n) time and space; biased favoring graphsconsistent with more orderings(Tian and He 2009): O(n3n) time and O(n2n) space

Compute the posteriors for all n(n − 1) potential ancestor relationsParviainen et al (2011): time O(n3n) and space O(3n); biased(Chen et al 2015): O(n25n−1) time and O(3n) space

Tian CRM 2016 22 / 36

Page 23: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Model Averaging

Existing Work

Approximate computation by using a set G of high-scoring networks

P(f |D) ≈∑

G∈G P(f |G)P(G |D)∑G∈G P(G |D)

Find the k-best DAGs by dynamic programming (Tian et al 2010):Time O(n2nk log k), Space O(k2n)Find the k-best equivalence classes (Chen and Tian 2014)

Tian CRM 2016 23 / 36

Page 24: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Model Averaging

Existing Work

Approximate computation via samplingIf we manage to sample graphs G1, . . . ,GK from P(G |D), then

P(f |D) ≈ 1K

∑i

f (Gi )

Markov chain Monte Carlo (MCMC) sampling in the space of DAGs[Madigan and York 1995, Grzegorczyk and Husmeier 2008]:Metropolis-Hastings algorithmMCMC in the space of node orderings [Friedman and Koller 2003,Ellis and Wong 2008], Partial order MCMC [Niinimaki etal, 2011]Kuipers and Moffa 2016, He et al 2016

Tian CRM 2016 24 / 36

Page 25: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Bayesian Model Averaging

Open Problem

How to efficiently sample the DAG space in high dimensional settings?

Tian CRM 2016 25 / 36

Page 26: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Curriculum Learning of Bayesian Network StructuresPropose a heuristic search algorithm based on the idea ofcurriculum learning

Tian CRM 2016 26 / 36

Page 27: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Curriculum Learning

Guided learning helps training humans and animals.

Start from simpler examples/easier tasks (Piaget,1952; Skinner, 1958).

Tian CRM 2016 27 / 36

Page 28: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Curriculum Learning in Machine Learning

Bengio et al. (2009):A curriculum is a sequence of weighting schemes of the training data〈W1,W2, . . . ,Wn〉:

W1 assigns more weight to easier samplesnext scheme assigns more weight to harder samplesWn assigns uniform weight to all samples

Advantages:faster convergence to (local) optimumconvergent to better local optimum

Difficulties:how to design a good curriculum strategy?

Tian CRM 2016 28 / 36

Page 29: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Curriculum Learning of BN Structures

We define the curriculum as (X(1), ...,X(n)), a sequence of selected subsets

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D�� ����

Curriculum: {X(1),X(2),X(3)}

X(1) = {S,B,D},X(2) = {S,B,D, L,E ,X},X(3) = {S,B,D, L,E ,X ,A,T}

Tian CRM 2016 29 / 36

Page 30: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Curriculum Learning of BN Structures

Intermediate learning target Gi : a network over X(i) conditioned onthe rest of the variables X′(i) = X \ X(i)

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D�� ����

Curriculum: {X(1),X(2),X(3)}

X(1) = {S,B,D},X(2) = {S,B,D, L,E ,X},X(3) = {S,B,D, L,E ,X ,A,T}

Tian CRM 2016 30 / 36

Page 31: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Bayesian Approach to Learn Gi

Let X′(i) take q valuesDi = {Di ,1, ...,Di ,q}: group samples over X(i) based on the values ofX′(i).

Assumption: Di ,1, ...,Di ,q are generated by the same Gi but withindependent parameters

P(Di |Gi ) =q∏

j=1P(Di ,j |Gi )

Run heuristic search

Tian CRM 2016 31 / 36

Page 32: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Construct a Curriculum

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D�� ����

S,B,D, L,E ,X︸ ︷︷ ︸X(2)

= S,B,D︸ ︷︷ ︸X(1)

+ L,E ,X︸ ︷︷ ︸to be included

Q: Which variables shall be included next (added into X(i−1))?

Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information

AveMI(Y ,X(i−1)) =∑

X∈X(i−1)I(X ,Y )/|X(i−1)|.

Q: How many variables shall be included next? step size

Tian CRM 2016 32 / 36

Page 33: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Construct a Curriculum

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D�� ����

S,B,D, L,E ,X︸ ︷︷ ︸X(2)

= S,B,D︸ ︷︷ ︸X(1)

+ L,E ,X︸ ︷︷ ︸to be included

Q: Which variables shall be included next (added into X(i−1))?Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information

AveMI(Y ,X(i−1)) =∑

X∈X(i−1)I(X ,Y )/|X(i−1)|.

Q: How many variables shall be included next? step size

Tian CRM 2016 32 / 36

Page 34: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Construct a Curriculum

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D

A S

T L B

E

X D�� ����

S,B,D, L,E ,X︸ ︷︷ ︸X(2)

= S,B,D︸ ︷︷ ︸X(1)

+ L,E ,X︸ ︷︷ ︸to be included

Q: Which variables shall be included next (added into X(i−1))?Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information

AveMI(Y ,X(i−1)) =∑

X∈X(i−1)I(X ,Y )/|X(i−1)|.

Q: How many variables shall be included next? step size

Tian CRM 2016 32 / 36

Page 35: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Theoretical AnalysisIdeally, each intermediate target should be closer to the subsequent targets thanany of its predecessors in the sequence.

Theorem

For any i , j , k s.t. 1 ≤ i < j < k ≤ n, we have

dH(Gi ,Gk) ≥ dH(Gj ,Gk)

where dH(Gi ,Gj) is the Structural Hamming Distance (SHD) between thestructures of two BNs Gi and Gj .

Theorem

For any i , j , k s.t. 1 ≤ i < j < k ≤ n, we have

dTV (Gi ,Gk) ≥ dTV (Gj ,Gk)

where dTV (Gi ,Gj) is the total variation distance between the two distributionsdefined by the two BNs Gi and Gj .

Tian CRM 2016 33 / 36

Page 36: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Comparison With MMHC

Table: Comparisons under different metrics

Sample Size (SS)

Metric Algorithm 100 500 1000 5000 10000 50000

BDeu CL 1(0) 1(10) 1(9) 1(8) 1(10) 1(8)MMHC 0.89(10) 1.06(0) 1.02(1) 1.01(2) 1.02(0) 1.01(2)

BIC CL 1(0) 1(9) 1(9) 1(6) 1(8) 1(8)MMHC 0.88(10) 1.07(1) 1.02(1) 1.02(4) 1.02(2) 1.01(2)

KL CL 1(0) 1(10) 1(9) 1(7) 1(9) 1(9)MMHC 1.71(10) 0.82(0) 0.96(1) 0.96(2) 0.97(0) 0.97(0)

SHD CL 1(7) 1(9) 1(7) 1(7) 1(8) 1(6)MMHC 1.06(3) 1.26(1) 1.29(3) 1.07(2) 1.21(1) 1.24(3)

Tian CRM 2016 34 / 36

Page 37: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Curriculum Learning of BN: Conclusion

We propose a novel heuristic algorithm for Bayesian network structurelearning.

It has desired theoretical properties of curriculum learning

We empirically show that our algorithm outperformed thestate-of-the-art MMHC algorithm

Future work: other types of curriculum?

Tian CRM 2016 35 / 36

Page 38: Bayesian Approach to Causal Structure Learning · 1 Introduction 2 Bayesian Approach 3 Incorporating Experimental Data 4 Structure Discovery 5 Bayesian Model Averaging 6 Curriculum

Curriculum Learning of BNs

Open Problem

How to efficiently learn good CBNs from high dimensional data?curriculum learning?sparse learning?parallel algorithm?deep learning?

Tian CRM 2016 36 / 36