Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Structure learning in Bayesian networks

Probabilistic Graphical Models

Sharif University of Technology

Spring 2018

Soleymani

Recall: Learning problem

2

Target: true distribution 𝑃∗ that maybe correspond to ℳ∗

= 𝒦∗, 𝜽∗

Hypothesis space: specified probabilistic graphical models

Data: set of instances sampled from 𝑃∗

Learning goal: selecting a model ℳ to construct the best

approximation to 𝑀∗ according to a performance metric

We may use Bayesian model averaging instead of selecting a model ℳ

Structure learning of Bayesian networks

3

For a fixed set of variables (i.e. nodes), we intend to learn

the directed graph structure 𝒢∗:

Which edges are included in the model?

Fewer edges miss dependencies

More edges cause spurious dependencies

We often prefer sparser structures that show better

generalization

Learning the structure of DGMs

4

Approaches

Constraint-based

Find conditional independencies in data using statistical tests

Find an I-map graph for the obtained conditional independencies

Score-based

View a BN structure and parameter learning as a density estimation

task and address structure learning as a model selection problem

Score function measures how well the model fits the observed data

Maximum likelihood

Bayesian score

Score-based approach

5

Input:

Hypothesis space: set of possible structures

Training data

Output:A network that maximizes the score function

Thus, we need a score function (including priors, if needed)

and a search strategy (i.e. optimization procedure) to find a

structure in the hypothesis space

Structural search

6

Hypothesis space

General graphs: Number of graphs on 𝑀 nodes: O 2𝑀2

Trees: Number of trees on 𝑀 nodes O 𝑀!

Combinatorial optimization problem

Many heuristics to solve this problem but no guarantee on attaining

optimality

super-exponential

Likelihood score

7

Find 𝒢, 𝜃𝒢 that maximize the likelihood:

max𝒢,𝜽𝒢

𝑃 𝒟|𝒢, 𝜽𝒢 = max𝒢

max𝜽𝒢

𝑃 𝒟|𝒢, 𝜽𝒢

= max𝒢

𝑃 𝒟|𝒢, 𝜽𝒢

ScoreL 𝒢, 𝒟 = 𝑃 𝒟|𝒢, 𝜽𝒢

Likelihood Score

Likelihood score: Information-theoretic

interpretation

8

Given structure, log likelihood of data:

ln 𝑃(𝒟|𝜃𝒢, 𝒢) = ln 𝑛=1

𝑁

𝑃(𝒙 𝑛 |𝜃𝒢, 𝒢) =

𝑛=1

𝑁

ln

𝑖

𝑃(𝑥𝑖𝑛

|𝑷𝒂𝑋𝑖

𝑛, 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖

, 𝒢)

=

𝑖

𝑛=1

𝑁

ln 𝑃(𝑥𝑖𝑛

|𝑷𝒂𝑋𝑖

𝑛, 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖

, 𝒢)

= 𝑁

𝑖

𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)

𝒖𝑖∈𝑉𝑎𝑙(𝑷𝒂𝑋𝑖)

𝐶𝑜𝑢𝑛𝑡(𝑥𝑖 , 𝒖𝑖)

𝑁ln 𝑃(𝑥𝑖|𝒖𝑖 , 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖

, 𝒢)

ln 𝑃(𝒟| 𝜃𝒢, 𝒢) = 𝑁

𝑖


𝒖𝑖∈𝑉𝑎𝑙(𝑷𝒂𝑋𝑖)

𝑃(𝑥𝑖 , 𝒖𝑖) ln 𝑃(𝑥𝑖|𝒖𝑖) 𝑃(𝑥𝑖)

𝑃(𝑥𝑖)

= 𝑁

𝑖

𝐼 𝑃 𝑋𝑖; 𝑃𝑎𝑋𝑖− 𝐻 𝑃 𝑋𝑖

This score measures the strength of the dependencies between variables and their parents.

Limitations of likelihood score

9

We have 𝐼𝑃 𝑋; 𝑌 ≥ 0 and 𝐼𝑃 𝑋; 𝑌 = 0 iff 𝑃 𝑋, 𝑌= 𝑃 𝑋 𝑃(𝑌)

In the training data, almost always 𝐼 𝑃 𝑋; 𝑌 > 0

Due to statistical noise in the training data, exact independence

almost never occurs in the empirical distribution

Thus, ML score never prefers the simpler network

Score is maximized for fully connected network

Therefore likelihood score fails to generalize well

𝐼 𝑋; 𝑌 ∪ 𝑍 ≥ 𝐼 𝑋; 𝑌𝐼 𝑋; 𝑌 ∪ 𝑍 = 𝐼 𝑋; 𝑌 iff 𝑋 ⊥ 𝑍|𝑌

Avoiding overfitting

10

Restricting the hypothesis space explicitly:

restrict number of parents or number of parameters

Impose penalty on the complexity of the structure:

Explicitly

Bayesian score and BIC

Tree structure: Chow-Liu algorithm

11

Compute 𝑤𝑖𝑗 = 𝐼(𝑋𝑖 , 𝑋𝑗) for each pair of nodes 𝑋𝑖 and 𝑋𝑗

Find a maximum weight spanning tree (MWST)

tree with the greatest total weight

max𝑇

(𝑖,𝑗)∈𝑇

𝑤𝑖𝑗

Give directions to edges in MST

pick an arbitrary node as the root and draw arrows going away

from it (e.g., using a BFS algorithm)

we can find an optimal

tree in polynomial time

BIC score

12

ScoreBIC 𝒢 = log 𝑃(𝒟| 𝜽𝒢) −log 𝑁

2Dim 𝒢

= 𝑁

𝑖

𝐼 𝑃 𝑋𝑖; 𝑃𝑎𝑋𝑖− 𝑁

𝑖

𝐻 𝑃 𝑋𝑖 −log 𝑁

2Dim 𝒢

Its negation is also known as Minimum Description Length

(MDL)

For the larger 𝑁, the more emphasis is given to the fit to data

Mutual information grows linearly with 𝑁 while complexity grows

logarithmically with 𝑁

can be ignoredModel complexity

(i.e., # of independent

parameters)

Asymptotic consistency of BIC score

13

Definition: A scoring function is called consistent if as 𝑁 → ∞and probability→ 1:

𝒢∗ (or any I-equivalent structure) maximizes the score

All non-I-equivalent structures have strictly lower score

Theorem: As 𝑁 → ∞, the structure that maximizes the BIC

score is the true structure 𝒢∗ (or any I-equivalent structure)

Asymptotically, spurious dependencies will not contribute to likelihood

and will be penalized

Required dependencies will be added due to linear growth of likelihood

term compared to logarithmic growth of model complexity

Bayesian score

14

Uncertainty over parameters and structure

Try to average over all possible parameter values and

also consider the structure prior

Bayesian score:

ScoreB 𝒢 = log 𝑃 𝒟|𝒢 + log 𝑃 𝒢

We consider the whole distribution over

parameters 𝜽𝒢 for the structure 𝒢 to find the

marginal likelihood:

𝑃 𝒟|𝒢 = 𝜽𝒢

𝑃 𝒟|𝒢, 𝜽𝒢 𝑃 𝜽𝒢|𝒢 𝑑𝜽𝒢

𝑃 𝒢|𝒟 =𝑃 𝒟|𝒢 𝑃 𝒢

𝑃 𝒟

Independent of 𝒢

Structure prior

Marginal

likelihood

likelihood

Marginal likelihood

15

𝑃 𝒟|𝒢 = 𝑃 𝒙 1 , … , 𝒙 𝑁 𝒢

= 𝑃 𝒙 1 𝒢 × 𝑃 𝒙 2 𝒙 1 , 𝒢 × ⋯ × 𝑃 𝒙 𝑁 𝒙 1 , … , 𝒙 𝑁−1 , 𝒢

Dirichlet priors and single variable:

𝑃 𝑥 1 , … , 𝑥 𝑁 = 𝑘=1

𝐾 𝛼𝑘 𝛼𝑘 + 1 … 𝛼𝑘 + 𝑀[𝑘] − 1

𝛼 𝛼 + 1 … 𝛼 + 𝑁 − 1

=Γ 𝛼

Γ 𝛼 + 𝑁×

𝑘=1

𝐾Γ 𝛼𝑘 + 𝑀 𝑘

Γ 𝛼𝑘

𝑀 𝑘 =

𝑛=1

𝑁

𝐼(𝑥 𝑛 = 𝑘)

𝛼 𝛼 + 1 … 𝛼 + 𝑁 − 1 =Γ(𝛼 + 𝑁)

Γ(𝛼)

𝛼 =

𝑘=1

𝐾

𝛼𝑘

Marginal likelihood in Bayesian networks:

global & local parameter independence

16

Let 𝑃 𝜽𝒢|𝒢 satisfy global parameter independence assumptions:

𝑃 𝒟|𝒢 =

𝑖

𝜽𝑋𝑖|𝑃𝑎𝑋𝑖

𝑛=1

𝑁

𝑃 𝑥𝑖(𝑛)

|𝑃𝑎𝑋𝑖

𝒢 (𝑛), 𝜽

𝑋𝑖|𝑃𝑎𝑋𝑖

𝒢 , 𝒢 𝑃 𝜽𝑋𝑖|𝑃𝑎𝑋𝑖

𝒢 |𝒢 𝑑𝜽𝑃𝑎𝑋𝑖

𝒢

Let 𝑃 𝜽𝒢|𝒢 satisfy global & local parameter independence assumptions:

𝑃 𝒟|𝒢 =

𝑖

𝒖𝑖∈𝑉𝑎𝑙(𝑃𝑎𝑋𝑖

𝒢)

𝜽𝑋𝑖|𝒖𝑖

𝑛=1

𝑃𝑎𝑋

𝑖(𝑛)

𝒢=𝒖𝑖

𝑁

𝑃 𝑥𝑖(𝑛)

|𝒖𝑖 , 𝜽𝑋𝑖|𝒖𝑖, 𝒢 𝑃 𝜽𝑋𝑖|𝒖𝑖

|𝒢 𝑑𝜽𝑋𝑖|𝒖𝑖

If Dirichlet priors 𝑃 𝜽𝑋𝑖|𝒖𝑖|𝒢 has hyperparameters 𝛼𝑥𝑖|𝒖𝒊

𝒢|𝑥 ∈ 𝑉𝑎𝑙 𝑋𝑖 :

𝑃 𝒟|𝒢 =

𝑖

𝒖𝑖∈𝑉𝑎𝑙(𝑃𝑎𝑋𝑖

𝒢)

Γ 𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)𝛼𝑥𝑖|𝒖𝑖

𝒢

Γ 𝑥𝑖∈𝑉𝑎𝑙(𝑋𝑖)𝛼𝑥𝑖|𝒖𝑖

𝒢+ 𝑀𝑖 𝑥𝑖, 𝒖𝑖


Γ 𝛼𝑥𝑖|𝒖𝑖

𝒢+ 𝑀𝑖 𝑥𝑖 , 𝒖𝑖

Γ 𝛼𝑥𝑖|𝒖𝑖

𝒢

Asymptotic behavior of Bayesian score

17

As 𝑁 → ∞, a network 𝒢 with Dirichlet priors satisfies:

log 𝑃 𝒟|𝒢 = log 𝑃(𝒟| 𝜽𝒢) −log 𝑁

2Dim 𝒢 + 𝑂(1)

Thus, Bayesian score is asymptotically equivalent to BIC and

asymptotically consistent

𝑂(1) term does not

grow with 𝑀

Priors

18

Structure prior: 𝑃(𝒢) Uniform prior: 𝑃(𝒢) ∝ constant

Prior penalizing number of edges: 𝑃(𝒢) ∝ 𝑐 𝐸(𝒢) (0 < 𝑐 < 1)

Prior penalizing number of parameters

Parameter priors: 𝑃(𝜽|𝒢) is usually instantiated as BDe prior

𝛼 𝑋𝑖 , 𝑃𝑎𝑋𝑖

𝒢= 𝛼𝑃′ 𝑋𝑖 , 𝑃𝑎𝑋𝑖

𝒢

BDe prior A single network 𝐵′ provides priors for all candidate networks

Unique prior with the property that I-equivalent networks have the sameBayesian score

Normalizing constant can be ignored

𝛼: equivalent sample size

𝐵0: prior network representing 𝑃′

𝑃𝑎𝑋𝑖

𝒢are not the same as parents of 𝑋𝑖 in 𝐵0

Score Equivalence

19

A score function satisfies score equivalence when all I-

equivalent networks have the same score

The likelihood score and BIC satisfy score equivalence

General graph structure

20

Theorem:

The problem of learning a BN structure with at most 𝑑 (𝑑 ≥ 2)

parents is NP-hard.

Search in the space of graph structures using local search

algorithms

Algorithms like hill-climbing and simulated Annealing

Exploit score decomposition during the search to alleviate

computational overhead

Optimization by local searches

21

Graphs as states

Score function is used as the objective function

Neighbors of a state are found using these modifications:

Edge addition

Edge deletion

Edge reversal

We only consider legal neighbors

that result in DAGs

Decomposable score

22

Decomposable structure score:

Score 𝒢; 𝒟 =

𝑖

FamScore(𝑋𝑖|𝑃𝑎𝑋𝑖

𝒢; 𝒟)

Decomposability is useful to reduce the computational

overhead of evaluating different structures during search.

if we have a decomposable score, then a local change in the structure

does not change the score of the remaining parts of the structure.

FamScore(𝑋|𝑈; 𝒟) measures how well the set of

variables 𝑈 serves as parent of 𝑋 in the data set 𝒟

Decomposability of scores

23

Likelihood score is decomposable

Bayesian score is decomposable when:

parameter priors satisfy global parameter independence

and parameter modularity

Parameter modularity:

𝑃𝑎𝑋𝑖

𝒢= 𝑃𝑎𝑋𝑖

𝒢′

= 𝑈 ⇒ 𝑃 𝜽𝑋𝑖|𝑈 𝒢 = 𝑃(𝜽𝑋𝑖|𝑈|𝒢′)

and structure prior satisfies structure modularity

𝑃(𝒢) ∝

𝑖

𝑃(𝑃𝑎𝑋𝑖= 𝑃𝑎𝑋𝑖

𝒢)

Summary

24

Score functions

Likelihood score

BIC score

Bayesian score

Optimal structure for trees is found using a greedy algorithm

When hypothesis space includes general graphs, the model

selection problem will be NP-hard and we use local search

strategies to optimize the structure

References

25

D. Koller and N. Friedman, “Probabilistic Graphical Models:

Principles and Techniques”, MIT Press, 2009, Chapter 18.1, 18.3,

18.4.

Documents

Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based