Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Structure learning in Bayesian networks
Probabilistic Graphical Models
Sharif University of Technology
Spring 2018
Soleymani
Recall: Learning problem
2
Target: true distribution 𝑃∗ that maybe correspond to ℳ∗
= 𝒦∗, 𝜽∗
Hypothesis space: specified probabilistic graphical models
Data: set of instances sampled from 𝑃∗
Learning goal: selecting a model ℳ to construct the best
approximation to 𝑀∗ according to a performance metric
We may use Bayesian model averaging instead of selecting a model ℳ
Structure learning of Bayesian networks
3
For a fixed set of variables (i.e. nodes), we intend to learn
the directed graph structure 𝒢∗:
Which edges are included in the model?
Fewer edges miss dependencies
More edges cause spurious dependencies
We often prefer sparser structures that show better
generalization
Learning the structure of DGMs
4
Approaches
Constraint-based
Find conditional independencies in data using statistical tests
Find an I-map graph for the obtained conditional independencies
Score-based
View a BN structure and parameter learning as a density estimation
task and address structure learning as a model selection problem
Score function measures how well the model fits the observed data
Maximum likelihood
Bayesian score
Score-based approach
5
Input:
Hypothesis space: set of possible structures
Training data
Output:A network that maximizes the score function
Thus, we need a score function (including priors, if needed)
and a search strategy (i.e. optimization procedure) to find a
structure in the hypothesis space
Structural search
6
Hypothesis space
General graphs: Number of graphs on 𝑀 nodes: O 2𝑀2
Trees: Number of trees on 𝑀 nodes O 𝑀!
Combinatorial optimization problem
Many heuristics to solve this problem but no guarantee on attaining
optimality
super-exponential
Likelihood score
7
Find 𝒢, 𝜃𝒢 that maximize the likelihood:
max𝒢,𝜽𝒢
𝑃 𝒟|𝒢, 𝜽𝒢 = max𝒢
max𝜽𝒢
𝑃 𝒟|𝒢, 𝜽𝒢
= max𝒢
𝑃 𝒟|𝒢, 𝜽𝒢
ScoreL 𝒢, 𝒟 = 𝑃 𝒟|𝒢, 𝜽𝒢
Likelihood Score
Likelihood score: Information-theoretic
interpretation
8
Given structure, log likelihood of data:
ln 𝑃(𝒟|𝜃𝒢, 𝒢) = ln 𝑛=1
𝑁
𝑃(𝒙 𝑛 |𝜃𝒢, 𝒢) =
𝑛=1
𝑁
ln
𝑖
𝑃(𝑥𝑖𝑛
|𝑷𝒂𝑋𝑖
𝑛, 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖
, 𝒢)
=
𝑖
𝑛=1
𝑁
ln 𝑃(𝑥𝑖𝑛
|𝑷𝒂𝑋𝑖
𝑛, 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖
, 𝒢)
= 𝑁
𝑖
𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)
𝒖𝑖∈𝑉𝑎𝑙(𝑷𝒂𝑋𝑖)
𝐶𝑜𝑢𝑛𝑡(𝑥𝑖 , 𝒖𝑖)
𝑁ln 𝑃(𝑥𝑖|𝒖𝑖 , 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖
, 𝒢)
ln 𝑃(𝒟| 𝜃𝒢, 𝒢) = 𝑁
𝑖
𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)
𝒖𝑖∈𝑉𝑎𝑙(𝑷𝒂𝑋𝑖)
𝑃(𝑥𝑖 , 𝒖𝑖) ln 𝑃(𝑥𝑖|𝒖𝑖) 𝑃(𝑥𝑖)
𝑃(𝑥𝑖)
= 𝑁
𝑖
𝐼 𝑃 𝑋𝑖; 𝑃𝑎𝑋𝑖− 𝐻 𝑃 𝑋𝑖
This score measures the strength of the dependencies between variables and their parents.
Limitations of likelihood score
9
We have 𝐼𝑃 𝑋; 𝑌 ≥ 0 and 𝐼𝑃 𝑋; 𝑌 = 0 iff 𝑃 𝑋, 𝑌= 𝑃 𝑋 𝑃(𝑌)
In the training data, almost always 𝐼 𝑃 𝑋; 𝑌 > 0
Due to statistical noise in the training data, exact independence
almost never occurs in the empirical distribution
Thus, ML score never prefers the simpler network
Score is maximized for fully connected network
Therefore likelihood score fails to generalize well
𝐼 𝑋; 𝑌 ∪ 𝑍 ≥ 𝐼 𝑋; 𝑌𝐼 𝑋; 𝑌 ∪ 𝑍 = 𝐼 𝑋; 𝑌 iff 𝑋 ⊥ 𝑍|𝑌
Avoiding overfitting
10
Restricting the hypothesis space explicitly:
restrict number of parents or number of parameters
Impose penalty on the complexity of the structure:
Explicitly
Bayesian score and BIC
Tree structure: Chow-Liu algorithm
11
Compute 𝑤𝑖𝑗 = 𝐼(𝑋𝑖 , 𝑋𝑗) for each pair of nodes 𝑋𝑖 and 𝑋𝑗
Find a maximum weight spanning tree (MWST)
tree with the greatest total weight
max𝑇
(𝑖,𝑗)∈𝑇
𝑤𝑖𝑗
Give directions to edges in MST
pick an arbitrary node as the root and draw arrows going away
from it (e.g., using a BFS algorithm)
we can find an optimal
tree in polynomial time
BIC score
12
ScoreBIC 𝒢 = log 𝑃(𝒟| 𝜽𝒢) −log 𝑁
2Dim 𝒢
= 𝑁
𝑖
𝐼 𝑃 𝑋𝑖; 𝑃𝑎𝑋𝑖− 𝑁
𝑖
𝐻 𝑃 𝑋𝑖 −log 𝑁
2Dim 𝒢
Its negation is also known as Minimum Description Length
(MDL)
For the larger 𝑁, the more emphasis is given to the fit to data
Mutual information grows linearly with 𝑁 while complexity grows
logarithmically with 𝑁
can be ignoredModel complexity
(i.e., # of independent
parameters)
Asymptotic consistency of BIC score
13
Definition: A scoring function is called consistent if as 𝑁 → ∞and probability→ 1:
𝒢∗ (or any I-equivalent structure) maximizes the score
All non-I-equivalent structures have strictly lower score
Theorem: As 𝑁 → ∞, the structure that maximizes the BIC
score is the true structure 𝒢∗ (or any I-equivalent structure)
Asymptotically, spurious dependencies will not contribute to likelihood
and will be penalized
Required dependencies will be added due to linear growth of likelihood
term compared to logarithmic growth of model complexity
Bayesian score
14
Uncertainty over parameters and structure
Try to average over all possible parameter values and
also consider the structure prior
Bayesian score:
ScoreB 𝒢 = log 𝑃 𝒟|𝒢 + log 𝑃 𝒢
We consider the whole distribution over
parameters 𝜽𝒢 for the structure 𝒢 to find the
marginal likelihood:
𝑃 𝒟|𝒢 = 𝜽𝒢
𝑃 𝒟|𝒢, 𝜽𝒢 𝑃 𝜽𝒢|𝒢 𝑑𝜽𝒢
𝑃 𝒢|𝒟 =𝑃 𝒟|𝒢 𝑃 𝒢
𝑃 𝒟
Independent of 𝒢
Structure prior
Marginal
likelihood
likelihood
Marginal likelihood
15
𝑃 𝒟|𝒢 = 𝑃 𝒙 1 , … , 𝒙 𝑁 𝒢
= 𝑃 𝒙 1 𝒢 × 𝑃 𝒙 2 𝒙 1 , 𝒢 × ⋯ × 𝑃 𝒙 𝑁 𝒙 1 , … , 𝒙 𝑁−1 , 𝒢
Dirichlet priors and single variable:
𝑃 𝑥 1 , … , 𝑥 𝑁 = 𝑘=1
𝐾 𝛼𝑘 𝛼𝑘 + 1 … 𝛼𝑘 + 𝑀[𝑘] − 1
𝛼 𝛼 + 1 … 𝛼 + 𝑁 − 1
=Γ 𝛼
Γ 𝛼 + 𝑁×
𝑘=1
𝐾Γ 𝛼𝑘 + 𝑀 𝑘
Γ 𝛼𝑘
𝑀 𝑘 =
𝑛=1
𝑁
𝐼(𝑥 𝑛 = 𝑘)
𝛼 𝛼 + 1 … 𝛼 + 𝑁 − 1 =Γ(𝛼 + 𝑁)
Γ(𝛼)
𝛼 =
𝑘=1
𝐾
𝛼𝑘
Marginal likelihood in Bayesian networks:
global & local parameter independence
16
Let 𝑃 𝜽𝒢|𝒢 satisfy global parameter independence assumptions:
𝑃 𝒟|𝒢 =
𝑖
𝜽𝑋𝑖|𝑃𝑎𝑋𝑖
𝑛=1
𝑁
𝑃 𝑥𝑖(𝑛)
|𝑃𝑎𝑋𝑖
𝒢 (𝑛), 𝜽
𝑋𝑖|𝑃𝑎𝑋𝑖
𝒢 , 𝒢 𝑃 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖
𝒢 |𝒢 𝑑𝜽𝑃𝑎𝑋𝑖
𝒢
Let 𝑃 𝜽𝒢|𝒢 satisfy global & local parameter independence assumptions:
𝑃 𝒟|𝒢 =
𝑖
𝒖𝑖∈𝑉𝑎𝑙(𝑃𝑎𝑋𝑖
𝒢)
𝜽𝑋𝑖|𝒖𝑖
𝑛=1
𝑃𝑎𝑋
𝑖(𝑛)
𝒢=𝒖𝑖
𝑁
𝑃 𝑥𝑖(𝑛)
|𝒖𝑖 , 𝜽𝑋𝑖|𝒖𝑖, 𝒢 𝑃 𝜽𝑋𝑖|𝒖𝑖
|𝒢 𝑑𝜽𝑋𝑖|𝒖𝑖
If Dirichlet priors 𝑃 𝜽𝑋𝑖|𝒖𝑖|𝒢 has hyperparameters 𝛼𝑥𝑖|𝒖𝒊
𝒢|𝑥 ∈ 𝑉𝑎𝑙 𝑋𝑖 :
𝑃 𝒟|𝒢 =
𝑖
𝒖𝑖∈𝑉𝑎𝑙(𝑃𝑎𝑋𝑖
𝒢)
Γ 𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)𝛼𝑥𝑖|𝒖𝑖
𝒢
Γ 𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)𝛼𝑥𝑖|𝒖𝑖
𝒢+ 𝑀𝑖 𝑥𝑖, 𝒖𝑖
𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)
Γ 𝛼𝑥𝑖|𝒖𝑖
𝒢+ 𝑀𝑖 𝑥𝑖 , 𝒖𝑖
Γ 𝛼𝑥𝑖|𝒖𝑖
𝒢
Asymptotic behavior of Bayesian score
17
As 𝑁 → ∞, a network 𝒢 with Dirichlet priors satisfies:
log 𝑃 𝒟|𝒢 = log 𝑃(𝒟| 𝜽𝒢) −log 𝑁
2Dim 𝒢 + 𝑂(1)
Thus, Bayesian score is asymptotically equivalent to BIC and
asymptotically consistent
𝑂(1) term does not
grow with 𝑀
Priors
18
Structure prior: 𝑃(𝒢) Uniform prior: 𝑃(𝒢) ∝ constant
Prior penalizing number of edges: 𝑃(𝒢) ∝ 𝑐 𝐸(𝒢) (0 < 𝑐 < 1)
Prior penalizing number of parameters
Parameter priors: 𝑃(𝜽|𝒢) is usually instantiated as BDe prior
𝛼 𝑋𝑖 , 𝑃𝑎𝑋𝑖
𝒢= 𝛼𝑃′ 𝑋𝑖 , 𝑃𝑎𝑋𝑖
𝒢
BDe prior A single network 𝐵′ provides priors for all candidate networks
Unique prior with the property that I-equivalent networks have the sameBayesian score
Normalizing constant can be ignored
𝛼: equivalent sample size
𝐵0: prior network representing 𝑃′
𝑃𝑎𝑋𝑖
𝒢are not the same as parents of 𝑋𝑖 in 𝐵0
Score Equivalence
19
A score function satisfies score equivalence when all I-
equivalent networks have the same score
The likelihood score and BIC satisfy score equivalence
General graph structure
20
Theorem:
The problem of learning a BN structure with at most 𝑑 (𝑑 ≥ 2)
parents is NP-hard.
Search in the space of graph structures using local search
algorithms
Algorithms like hill-climbing and simulated Annealing
Exploit score decomposition during the search to alleviate
computational overhead
Optimization by local searches
21
Graphs as states
Score function is used as the objective function
Neighbors of a state are found using these modifications:
Edge addition
Edge deletion
Edge reversal
We only consider legal neighbors
that result in DAGs
Decomposable score
22
Decomposable structure score:
Score 𝒢; 𝒟 =
𝑖
FamScore(𝑋𝑖|𝑃𝑎𝑋𝑖
𝒢; 𝒟)
Decomposability is useful to reduce the computational
overhead of evaluating different structures during search.
if we have a decomposable score, then a local change in the structure
does not change the score of the remaining parts of the structure.
FamScore(𝑋|𝑈; 𝒟) measures how well the set of
variables 𝑈 serves as parent of 𝑋 in the data set 𝒟
Decomposability of scores
23
Likelihood score is decomposable
Bayesian score is decomposable when:
parameter priors satisfy global parameter independence
and parameter modularity
Parameter modularity:
𝑃𝑎𝑋𝑖
𝒢= 𝑃𝑎𝑋𝑖
𝒢′
= 𝑈 ⇒ 𝑃 𝜽𝑋𝑖|𝑈 𝒢 = 𝑃(𝜽𝑋𝑖|𝑈|𝒢′)
and structure prior satisfies structure modularity
𝑃(𝒢) ∝
𝑖
𝑃(𝑃𝑎𝑋𝑖= 𝑃𝑎𝑋𝑖
𝒢)
Summary
24
Score functions
Likelihood score
BIC score
Bayesian score
Optimal structure for trees is found using a greedy algorithm
When hypothesis space includes general graphs, the model
selection problem will be NP-hard and we use local search
strategies to optimize the structure
References
25
D. Koller and N. Friedman, “Probabilistic Graphical Models:
Principles and Techniques”, MIT Press, 2009, Chapter 18.1, 18.3,
18.4.