Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Bayesian Approach to Causal Structure Learning
Jin Tian
Computer Science Department, Iowa State University, Ames, IA, USA
Tian CRM 2016 1 / 36
Table of contents
1 Introduction
2 Bayesian Approach
3 Incorporating Experimental Data
4 Structure Discovery
5 Bayesian Model Averaging
6 Curriculum Learning of Bayesian Network Structures
Tian CRM 2016 2 / 36
Introduction
Causal Bayesian Networks
A DAGNodes: random variables.Edges: direct causal influence.
CancerSmoking Tar inlungs
Z Y
U
X
Modularity: Each parent-child relationship represents an autonomouscausal mechanism.
Functional: vi = f (pai , ε)Probabilistic: P(vi |pai )
Tian CRM 2016 3 / 36
Introduction
Causal Bayesian Networks: Applications
Tian CRM 2016 4 / 36
Introduction
Learning Causal Structures from Data
Tian CRM 2016 5 / 36
Bayesian Approach
Bayesian Approach
Deal with uncertainty by assigning probability to all possibilitiesThe posterior probability of a network G :
P(G |D) = P(D|G)P(G)P(D) .
Need to compute the marginal likelihood
P(D|G) =∫
ΘP(D|Θ,G)P(Θ|G)dΘ
Tian CRM 2016 6 / 36
Bayesian Approach
Bayesian Approach
Assumptions on parameter priorGlobal Parameter Independence:
P(Θ|G) =n∏
i=1P(Ψi |G)
Local Parameter Independence:
P(Ψi |G) =∏pai
P(~θpai ), i = 1, . . . , n.
Dirichlet distribution:
P(~θpai ) = Dir(~θpai |~αpai ),
Tian CRM 2016 7 / 36
Bayesian Approach
Bayesian Approach
Assume complete data,Closed form expression for the marginal likelihood has been derived
P(D|G) =n∏
i=1fi (Vi ,Pai : D) decomposable
Tian CRM 2016 8 / 36
Incorporating Experimental Data
Incorporating Experimental Data
Two data sets, D and D′, generated from a causal structure G butwith different parameters, ΘG and Θ′G
P(D,D′|G) =∫
P(D,D′|ΘG ,Θ′G ,G)P(ΘG ,Θ′G |G)dΘGdΘ′G
P(D,D′|ΘG ,Θ′G ,G) = P(D|ΘG ,G)P(D′|Θ′G ,G)
Encode the knowledge on the experimental setting in the priorP(ΘG ,Θ′G |G)
Tian CRM 2016 9 / 36
Incorporating Experimental Data
Incorporating Experimental Data
Known intervention target Vl
Mechanism change (imperfect intervention)
P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l
δ(Ψi −Ψ′i ),
=⇒ Pool the data for nodes i 6= l
P(D,D′|G) = fl (Vl ,Pal : D)fl (Vl ,Pal : D′)∏i 6=l
fi (Vi ,Pai : D,D′)
Tian CRM 2016 10 / 36
Incorporating Experimental Data
Incorporating Experimental Data
Known intervention target Vl
Ideal intervention do(Vl = vlj)
P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l
δ(Ψi −Ψ′i ),
P(~θ′pal ) = δ(θ′vlj ;pal − 1)∏
vl 6=vlj
δ(θ′vl ;pal ),
=⇒ Pool the data for nodes i 6= l , drop D′ for l
P(D,D′|G) = fl (Vl ,Pal : D)∏i 6=l
fi (Vi ,Pai : D,D′)
Tian CRM 2016 11 / 36
Incorporating Experimental Data
Incorporating Experimental Data
Known intervention target Vl
Mechanism change
P(ΘG ,Θ′G |G) = P(ΘG |G)P(Ψ′l |G)∏i 6=l
δ(Ψi −Ψ′i ),
Ψ′l some parametric function (of Ψl )?
Tian CRM 2016 12 / 36
Incorporating Experimental Data
Incorporating Experimental Data
Unknown intervention targetAssume independent parameters
P(ΘG ,Θ′G |G) = P(ΘG |G)P(Θ′G |G),
=⇒
P(D,D′|G) = P(D|G)P(D′|G)
Without knowledge on how they came about, two datasets do notincrease our power of structure discrimination, save for providingmore samples.
Tian CRM 2016 13 / 36
Incorporating Experimental Data
Incorporating Experimental Data
Unknown intervention targetIntroduce interventional nodes E1,E2, . . .
Assumptions: number of states of Ei ; each Ei has one or multiplechildren;Learn a structure over V ∪ E variables with E variables being sourcenodesCan we learn the intervention target more efficiently?
Tian CRM 2016 14 / 36
Structure Discovery
Structure Discovery as Model Selection
Model selection: look for the network that maximizes P(G |D) - themaximum a posteriori (MAP) networkScore-based searchSearching in the space of possible DAGs.NP-hard: the number of possible DAGs is O(n!2n(n−1)/2).
Tian CRM 2016 15 / 36
Structure Discovery
Existing Work
Exact methods: find an optimal BN given a decomposable scoreDynamic programming (Silander et al., 2006): exponential time andspace.A* search (Yuan et al., 2011): shortest path findingInteger linear programming (ILP) (Jaakkola et al., 2010; Cussens,2011,2014): bounded in-degree.
Tian CRM 2016 16 / 36
Structure Discovery
Existing Work
Heuristic searchRandom restart hill-climbing, Simulated Annealing, . . .
Tian CRM 2016 17 / 36
Structure Discovery
Existing Work
Hybrid:Max-Min Hill-Climbing (MMHC) (Tsamardinos et al., 2006)
First estimates the parents and children of each node from CI tests,then performs a constrained greedy search.The state-of-the-art heuristic algorithm.
Tian CRM 2016 18 / 36
Structure Discovery
Open Problem
How to efficiently learn good CBNs from high dimensional data?
Tian CRM 2016 19 / 36
Bayesian Model Averaging
Bayesian Model Averaging (BMA)
When the sample size is small
If we are interested in, say the probability that an edge A→ C is inthe true network,
Answer based on one model often uselessWant features common to many models
Tian CRM 2016 20 / 36
Bayesian Model Averaging
Bayesian Model Averaging
The posterior probability of any hypothesis of interest f
P(f |D) =∑G
P(f |G)P(G |D)
E.g., posterior probability of an edge
P(j → i |D) =∑
G: j→i ∈GP(G |D)
Difficulty: super-exponential number of possible DAG structuresO(n!2n(n−1)/2)
Tian CRM 2016 21 / 36
Bayesian Model Averaging
Existing Work
Exact methods: sum over all possible DAGs using dynamic programming→ from super-exponential to exponential
Compute the posteriors for all n(n − 1) potential edges(Koivisto 2006): O(n2n) time and space; biased favoring graphsconsistent with more orderings(Tian and He 2009): O(n3n) time and O(n2n) space
Compute the posteriors for all n(n − 1) potential ancestor relationsParviainen et al (2011): time O(n3n) and space O(3n); biased(Chen et al 2015): O(n25n−1) time and O(3n) space
Tian CRM 2016 22 / 36
Bayesian Model Averaging
Existing Work
Approximate computation by using a set G of high-scoring networks
P(f |D) ≈∑
G∈G P(f |G)P(G |D)∑G∈G P(G |D)
Find the k-best DAGs by dynamic programming (Tian et al 2010):Time O(n2nk log k), Space O(k2n)Find the k-best equivalence classes (Chen and Tian 2014)
Tian CRM 2016 23 / 36
Bayesian Model Averaging
Existing Work
Approximate computation via samplingIf we manage to sample graphs G1, . . . ,GK from P(G |D), then
P(f |D) ≈ 1K
∑i
f (Gi )
Markov chain Monte Carlo (MCMC) sampling in the space of DAGs[Madigan and York 1995, Grzegorczyk and Husmeier 2008]:Metropolis-Hastings algorithmMCMC in the space of node orderings [Friedman and Koller 2003,Ellis and Wong 2008], Partial order MCMC [Niinimaki etal, 2011]Kuipers and Moffa 2016, He et al 2016
Tian CRM 2016 24 / 36
Bayesian Model Averaging
Open Problem
How to efficiently sample the DAG space in high dimensional settings?
Tian CRM 2016 25 / 36
Curriculum Learning of BNs
Curriculum Learning of Bayesian Network StructuresPropose a heuristic search algorithm based on the idea ofcurriculum learning
Tian CRM 2016 26 / 36
Curriculum Learning of BNs
Curriculum Learning
Guided learning helps training humans and animals.
Start from simpler examples/easier tasks (Piaget,1952; Skinner, 1958).
Tian CRM 2016 27 / 36
Curriculum Learning of BNs
Curriculum Learning in Machine Learning
Bengio et al. (2009):A curriculum is a sequence of weighting schemes of the training data〈W1,W2, . . . ,Wn〉:
W1 assigns more weight to easier samplesnext scheme assigns more weight to harder samplesWn assigns uniform weight to all samples
Advantages:faster convergence to (local) optimumconvergent to better local optimum
Difficulties:how to design a good curriculum strategy?
Tian CRM 2016 28 / 36
Curriculum Learning of BNs
Curriculum Learning of BN Structures
We define the curriculum as (X(1), ...,X(n)), a sequence of selected subsets
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D�� ����
Curriculum: {X(1),X(2),X(3)}
X(1) = {S,B,D},X(2) = {S,B,D, L,E ,X},X(3) = {S,B,D, L,E ,X ,A,T}
Tian CRM 2016 29 / 36
Curriculum Learning of BNs
Curriculum Learning of BN Structures
Intermediate learning target Gi : a network over X(i) conditioned onthe rest of the variables X′(i) = X \ X(i)
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D�� ����
Curriculum: {X(1),X(2),X(3)}
X(1) = {S,B,D},X(2) = {S,B,D, L,E ,X},X(3) = {S,B,D, L,E ,X ,A,T}
Tian CRM 2016 30 / 36
Curriculum Learning of BNs
Bayesian Approach to Learn Gi
Let X′(i) take q valuesDi = {Di ,1, ...,Di ,q}: group samples over X(i) based on the values ofX′(i).
Assumption: Di ,1, ...,Di ,q are generated by the same Gi but withindependent parameters
P(Di |Gi ) =q∏
j=1P(Di ,j |Gi )
Run heuristic search
Tian CRM 2016 31 / 36
Curriculum Learning of BNs
Construct a Curriculum
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D�� ����
S,B,D, L,E ,X︸ ︷︷ ︸X(2)
= S,B,D︸ ︷︷ ︸X(1)
+ L,E ,X︸ ︷︷ ︸to be included
Q: Which variables shall be included next (added into X(i−1))?
Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information
AveMI(Y ,X(i−1)) =∑
X∈X(i−1)I(X ,Y )/|X(i−1)|.
Q: How many variables shall be included next? step size
Tian CRM 2016 32 / 36
Curriculum Learning of BNs
Construct a Curriculum
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D�� ����
S,B,D, L,E ,X︸ ︷︷ ︸X(2)
= S,B,D︸ ︷︷ ︸X(1)
+ L,E ,X︸ ︷︷ ︸to be included
Q: Which variables shall be included next (added into X(i−1))?Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information
AveMI(Y ,X(i−1)) =∑
X∈X(i−1)I(X ,Y )/|X(i−1)|.
Q: How many variables shall be included next? step size
Tian CRM 2016 32 / 36
Curriculum Learning of BNs
Construct a Curriculum
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D
A S
T L B
E
X D�� ����
S,B,D, L,E ,X︸ ︷︷ ︸X(2)
= S,B,D︸ ︷︷ ︸X(1)
+ L,E ,X︸ ︷︷ ︸to be included
Q: Which variables shall be included next (added into X(i−1))?Intuition: the variables that are most likely to have connections with the currentset of variables X(i−1).Heuristic: use average pairwise mutual information
AveMI(Y ,X(i−1)) =∑
X∈X(i−1)I(X ,Y )/|X(i−1)|.
Q: How many variables shall be included next? step size
Tian CRM 2016 32 / 36
Curriculum Learning of BNs
Theoretical AnalysisIdeally, each intermediate target should be closer to the subsequent targets thanany of its predecessors in the sequence.
Theorem
For any i , j , k s.t. 1 ≤ i < j < k ≤ n, we have
dH(Gi ,Gk) ≥ dH(Gj ,Gk)
where dH(Gi ,Gj) is the Structural Hamming Distance (SHD) between thestructures of two BNs Gi and Gj .
Theorem
For any i , j , k s.t. 1 ≤ i < j < k ≤ n, we have
dTV (Gi ,Gk) ≥ dTV (Gj ,Gk)
where dTV (Gi ,Gj) is the total variation distance between the two distributionsdefined by the two BNs Gi and Gj .
Tian CRM 2016 33 / 36
Curriculum Learning of BNs
Comparison With MMHC
Table: Comparisons under different metrics
Sample Size (SS)
Metric Algorithm 100 500 1000 5000 10000 50000
BDeu CL 1(0) 1(10) 1(9) 1(8) 1(10) 1(8)MMHC 0.89(10) 1.06(0) 1.02(1) 1.01(2) 1.02(0) 1.01(2)
BIC CL 1(0) 1(9) 1(9) 1(6) 1(8) 1(8)MMHC 0.88(10) 1.07(1) 1.02(1) 1.02(4) 1.02(2) 1.01(2)
KL CL 1(0) 1(10) 1(9) 1(7) 1(9) 1(9)MMHC 1.71(10) 0.82(0) 0.96(1) 0.96(2) 0.97(0) 0.97(0)
SHD CL 1(7) 1(9) 1(7) 1(7) 1(8) 1(6)MMHC 1.06(3) 1.26(1) 1.29(3) 1.07(2) 1.21(1) 1.24(3)
Tian CRM 2016 34 / 36
Curriculum Learning of BNs
Curriculum Learning of BN: Conclusion
We propose a novel heuristic algorithm for Bayesian network structurelearning.
It has desired theoretical properties of curriculum learning
We empirically show that our algorithm outperformed thestate-of-the-art MMHC algorithm
Future work: other types of curriculum?
Tian CRM 2016 35 / 36
Curriculum Learning of BNs
Open Problem
How to efficiently learn good CBNs from high dimensional data?curriculum learning?sparse learning?parallel algorithm?deep learning?
Tian CRM 2016 36 / 36