Upload
gregory-casey
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Jun Zhu
Dept. of Comp. Sci. & Tech., Tsinghua University
This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing
and Bo Zhang.
Laplace Maximum Margin Markov Networks
Outline
04/18/23ICML 2008 @ Helsinki, Finland2
Introduction– Structured Prediction– Max-margin Markov Networks
Max-Entropy Discrimination Markov Networks– Basic Theorems– Laplace Max-margin Markov Networks– Experimental Results
Summary
Classical Classification Models
04/18/23ICML 2008 @ Helsinki, Finland3
• Inputs: – a set of training samples , where and
• Outputs:– a predictive function :
• Examples ( ):
– Support Vector Machine (SVM)
• Max-margin learning
– Logistic Regression• Max-likelihood estimation
Structured Prediction
04/18/23ICML 2008 @ Helsinki, Finland4
• Complicated Prediction Examples:– Part-of-speech (POS) Tagging:
– Image segmentation
• Inputs:– a set of training samples: ,
where
• Outputs:– a predictive function :
“Do you want fries with that?” -> <verb pron verb noun prep pron>
Structured Prediction Models
04/18/23ICML 2008 @ Helsinki, Finland5
Conditional Random Fields (CRFs) (Lafferty et al., 2001)– Based on Logistic Regression– Max-likelihood estimation
(point-estimate)
• Max-margin Markov Networks (M3Ns) (Taskar et al., 2003)
– Based on SVM– Max-margin learning ( point-
estimate)
• Markov properties are encoded in the feature functions
Between MLE and max-margin learning
04/18/23ICML 2008 @ Helsinki, Finland6
Likelihood-based estimation– Probabilistic (joint/conditional
likelihood model)– Easy adapt to perform
Bayesian learning, and consider prior knowledge, missing data
• Max-margin learning– Non-probabilistic (concentrate on
input-output mapping)– Not obvious how to perform
Bayesian learning or consider prior, and missing data
– Sound theoretical guarantee with limited samples
• Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999)
– A Bayesian learning approach– The optimization problem (binary classification)
MaxEnt Discrimination Markov networks
04/18/23ICML 2008 @ Helsinki, Finland7
MaxEnt Discrimination Markov Networks (MaxEntNet):
– Generalized maximum entropy or regularized KL-divergence
– Subspace of distributions defined with expected margin constraints
Bayesian Prediction
p
Solution to MaxEntNet
04/18/23ICML 2008 @ Helsinki, Finland8
Theorem 1 (Solution to MaxEntNet):– Posterior Distribution:
– Dual Optimization Problem:
Convex conjugate (closed proper convex )– Def:– Ex:
Reduction to M3Ns
04/18/23ICML 2008 @ Helsinki, Finland9
Theorem 2 (Reduction of MaxEntNet to M3Ns):– Assume
Posterior distribution: Dual optimization:
Predictive rule:
Thus, MaxEntNet subsumes M3Ns and admits all the merits of max-margin learning
Furthermore, MaxEntNet has at least three advantages …
The Three Advantages
PAC-Bayes prediction error guaranteeIntroduce regularization effects, such as
sparsity biasProvides an elegant approach to
incorporate latent variables and structures
04/18/23ICML 2008 @ Helsinki, Finland10
The Three Advantages
PAC-Bayes prediction error guaranteeIntroduce regularization effects, such as
sparsity biasProvides an elegant approach to
incorporate latent variables and structures
04/18/23ICML 2008 @ Helsinki, Finland11
Generalization Guarantee
04/18/23ICML 2008 @ Helsinki, Finland12
MaxEntNet is an averaging model– we also call it a Bayesian Max-Margin Markov
NetworkTheorem 3 (PAC-Bayes Bound)
IfThen
Laplace M3Ns (LapM3N)
04/18/23ICML 2008 @ Helsinki, Finland13
The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias
Normal prior is not so good for sparse data …
Instead, we use the Laplace prior …
– hierarchical representation (scale mixture)
• Similar calculation for M3Ns (A standard normal prior)
Posterior shrinkage effect in LapM3N
04/18/23ICML 2008 @ Helsinki, Finland14
Exact integration in LapM3N
Alternatively,
A functional transformation
Variational Bayesian Learning
04/18/23ICML 2008 @ Helsinki, Finland15
Exact dual function is hard to optimize
Use the hierarchical representation, we get:
We optimize an upper bound:
Why is it easier?– Alternating minimization leads to nicer optimization
problems
Variational Bayesian Learning (Cont’)
04/18/23ICML 2008 @ Helsinki, Finland16
Experiments
Compare LapM3N with:CRFs: MLEL2-CRFs: L2-norm penalized MLEL1-CRFs: L1-norm penalized MLE (sparse
estimation)M3N: max-margin learning
04/18/23ICML 2008 @ Helsinki, Finland17
Experimental results on synthetic datasets
04/18/23ICML 2008 @ Helsinki, Finland18
Datasets with 100 i.i.d features of which 10, 30, 50 are relevant– For each setting, we generate 10 datasets, each having 1000
samples. True labels are assigned via a Gibbs sampler with 5000 iterations.
Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features
Experimental results on OCR datasets
04/18/23ICML 2008 @ Helsinki, Finland19
We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV.
Sensitivity to Regularization Constants
04/18/23ICML 2008 @ Helsinki, Finland20
• L1-CRFs are much sensitive to regularization constants; the others are more stable
• LapM3N is the most stable one
L1-CRF and L2-CRF:- 0.001, 0.01, 0.1, 1, 4, 9, 16
M3N and LapM3N:- 1, 4, 9, 16, 25, 36, 49, 64, 81
Summary
04/18/23ICML 2008 @ Helsinki, Finland21
• We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction– MaxEntNet subsumes the standard M3Ns– PAC-Bayes Theoretical Error Bound
• We propose Laplace max-margin Markov networks– Enjoys a posterior shrinkage effect– Can perform as well as sparse models on
synthetic data; better on real data sets– More stable to regularization constants
04/18/23ICML 2008 @ Helsinki, Finland22
Thanks!Detailed Proof:
http://166.111.138.19/junzhu/MaxEntNet_TR.pdf
http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf