View
32
Download
0
Category
Tags:
Preview:
DESCRIPTION
CS b553: Algorithms for Optimization and Learning. Structure Learning. Agenda. Learning probability distributions from example data To what extent can Bayes net structure be learned? Constraint methods (inferring conditional independence) Scoring methods (learning => optimization). - PowerPoint PPT Presentation
Citation preview
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGStructure Learning
AGENDA
Learning probability distributions from example data
To what extent can Bayes net structure be learned?
Constraint methods (inferring conditional independence)
Scoring methods (learning => optimization)
BASIC QUESTION Given examples drawn from a distribution P*
with independence relations given by the Bayesian structure G*, can we recover G*?
BASIC QUESTION Given examples drawn from a distribution P*
with independence relations given by the Bayesian structure G*, can we recover G*
construct a network that encodes the same independence relations as G*?
G* G1
G2
LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
ML parameters
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11
LEARNING IN THE FACE OF NOISY DATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
ML parameters
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are
likely to be larger!
PRINCIPLE Learning structure must trade off fit of data
vs. complexity of network Complex networks
More parameters to learn More data fragmentation = greater sensitivity to
noise
APPROACH #1: CONSTRAINT-BASED LEARNING First, identify an undirected skeleton of edges
in G*
If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent
If X-Y is not in G*, then we can find evidence variables to make X and Y independent
Then, assign directionality to preserve independences
BUILD-SKELETON ALGORITHM Given X={X1,…,Xn}, query Independent?
(X,Y,U) H = complete graph over X For all pairs Xi, Xj, test separation as follows:
Enumerate all possible separating sets U If Independent?(Xi,Xj,U) then remove Xi—Xj from
HIn practice:
• Must restrict to bounded size subsets |U|d (i.e., assume G* has bounded degree). O(n2(n-2)d) tests
• Independence can’t be tested exactly
ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a
dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X ZTriangle
Directionality is irrelevant
ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a
dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X ZTriangle
Y
X ZY separates X, Z
Not a v-structureDirectionality is irrelevant
ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a
dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X ZTriangle
Y
X ZY separates X, Z
Not a v-structure
Y
X ZYU separates X, Z
A v-structureDirectionality is irrelevant
ASSIGNING DIRECTIONALITY Note that V-structures XYZ introduce a
dependency between X and Z given Y In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y In fact Y must be given for X and Z to be
independent Idea: look at separating sets for all triples X-Y-
Z in the skeleton without edge X-Z
Y
X ZTriangle
Y
X ZY separates X, Z
Not a v-structure
Y
X ZYU separates X, Z
A v-structureDirectionality is irrelevant
STATISTICAL INDEPENDENCE TESTING Question: are X and Y independent? Null hypothesis H0: X and Y are independent Alternative hypothesis HA: X and Y are not
independent
STATISTICAL INDEPENDENCE TESTING Question: are X and Y independent? Null hypothesis H0: X and Y are independent Alternative hypothesis HA: X and Y are not
independent 2 test: use the statistic
withthe empirical probability of X Can compute (table lookup) the probability of
getting a value at least this extreme if H0 is true (p-value)
If p < some threshold, e.g. 1-0.95, H0 is rejected
APPROACH #2: SCORE-BASED METHODS Learning => optimization Define scoring function Score(G;D) that
evaluates quality of structure G, and optimize it Combinatorial optimization problem
Issues: Choice of scoring function: maximum likelihood
score, Bayesian score Efficient optimization techniques
MAXIMUM-LIKELIHOOD SCORES ScoreL(G;D) = likelihood of the BN with the
most likely parameter settings under structure G Let L(G,G;D) be the likelihood of data using
parameters G with structure G Let G
* = arg max L(,G;D) as described in last lecture
Then ScoreL(G;D) = L(G*,G;D)
ISSUE WITH ML SCORE
ISSUE WITH ML SCORE Independent coin example
X Y
G1
X Y
G2
ML parameters
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11
Likelihood score
log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20)
log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]=
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =
ISSUE WITH ML SCORE
X Y
G1
X Y
G2
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =
MUTUAL INFORMATION PROPERTIES (the mutual information between X and Y)
with Q(x,y) = P(x)P(y)
0 by nonnegativity of KL divergence
Implication: ML scores do not decrease for more connected
graphs=> Overfitting to data!
POSSIBLE SOLUTIONS Fix complexity of graphs (e.g., bounded in-
degree) See HW7
Penalize complex graphs Bayesian scores
IDEA OF BAYESIAN SCORING Note that parameters are uncertain Bayesian approach: put a prior on parameter
values and marginalize them out P(D|G) =
For example, use Beta/Dirichlet priors => marginal is manageable to compute E.g., uniform hyperparameter over network Set virtual counts to 2^-|PaXi|
LARGE SAMPLE APPROXIMATION log P(D|G) =
log L(G*;D) – ½ log M Dim[G] + O(1)
With M the number of samples, Dim[G] the number of free parameters of G
Bayesian Information Criterion (BIC) score: ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
LARGE SAMPLE APPROXIMATION log P(D|G) =
log L(G*;D) – ½ log M Dim[G] + O(1)
With M the number of samples, Dim[G] the number of free parameters of G
Bayesian Information Criterion (BIC) score: ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
Fit data set Prefer simple models
STRUCTURE OPTIMIZATION, GIVEN A SCORE… The problem is well-defined, but
combinatorially complex! Superexponential in # of variables
Idea: search locally through the space of graphs using graph operators Add edge Delete edge Reverse edge
SEARCH STRATEGIES Greedy
Pick operator that leads to greatest score Local minima? Plateaux?
Overcoming plateaux Search with basin flooding Tabu search Perturbation methods (similar to simulated
annealing, except on data weighting) Implementation details:
Evaluate ’s between structures quickly (local decomposibility)
RECAP Bayes net structure learning: from
equivalence class of networks that encode the same conditional independences
Constraint-based methods Statistical independence tests
Score-based methods Learning => optimization
Recommended