Jun Zhu [email protected] Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Jun Zhu

[email protected]

Dept. of Comp. Sci. & Tech., Tsinghua University

This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing

and Bo Zhang.

Laplace Maximum Margin Markov Networks

Outline

04/18/23ICML 2008 @ Helsinki, Finland2

Introduction– Structured Prediction– Max-margin Markov Networks

Max-Entropy Discrimination Markov Networks– Basic Theorems– Laplace Max-margin Markov Networks– Experimental Results

Summary

Classical Classification Models


• Inputs: – a set of training samples , where and

• Outputs:– a predictive function :

• Examples ( ):

– Support Vector Machine (SVM)

• Max-margin learning

– Logistic Regression• Max-likelihood estimation

Structured Prediction


• Complicated Prediction Examples:– Part-of-speech (POS) Tagging:

– Image segmentation

• Inputs:– a set of training samples: ,

where

• Outputs:– a predictive function :

“Do you want fries with that?” -> <verb pron verb noun prep pron>

Structured Prediction Models


Conditional Random Fields (CRFs) (Lafferty et al., 2001)– Based on Logistic Regression– Max-likelihood estimation

(point-estimate)

• Max-margin Markov Networks (M3Ns) (Taskar et al., 2003)

– Based on SVM– Max-margin learning ( point-

estimate)

• Markov properties are encoded in the feature functions

Between MLE and max-margin learning


Likelihood-based estimation– Probabilistic (joint/conditional

likelihood model)– Easy adapt to perform

Bayesian learning, and consider prior knowledge, missing data

• Max-margin learning– Non-probabilistic (concentrate on

input-output mapping)– Not obvious how to perform

Bayesian learning or consider prior, and missing data

– Sound theoretical guarantee with limited samples

• Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999)

– A Bayesian learning approach– The optimization problem (binary classification)

MaxEnt Discrimination Markov networks


MaxEnt Discrimination Markov Networks (MaxEntNet):

– Generalized maximum entropy or regularized KL-divergence

– Subspace of distributions defined with expected margin constraints

Bayesian Prediction

p

Solution to MaxEntNet


Theorem 1 (Solution to MaxEntNet):– Posterior Distribution:

– Dual Optimization Problem:

Convex conjugate (closed proper convex )– Def:– Ex:

Reduction to M3Ns


Theorem 2 (Reduction of MaxEntNet to M3Ns):– Assume

Posterior distribution: Dual optimization:

Predictive rule:

Thus, MaxEntNet subsumes M3Ns and admits all the merits of max-margin learning

Furthermore, MaxEntNet has at least three advantages …

The Three Advantages

PAC-Bayes prediction error guaranteeIntroduce regularization effects, such as

sparsity biasProvides an elegant approach to

incorporate latent variables and structures


The Three Advantages

PAC-Bayes prediction error guaranteeIntroduce regularization effects, such as

sparsity biasProvides an elegant approach to

incorporate latent variables and structures


Generalization Guarantee


MaxEntNet is an averaging model– we also call it a Bayesian Max-Margin Markov

NetworkTheorem 3 (PAC-Bayes Bound)

IfThen

Laplace M3Ns (LapM3N)


The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias

Normal prior is not so good for sparse data …

Instead, we use the Laplace prior …

– hierarchical representation (scale mixture)

• Similar calculation for M3Ns (A standard normal prior)

Posterior shrinkage effect in LapM3N


Exact integration in LapM3N

Alternatively,

A functional transformation

Variational Bayesian Learning


Exact dual function is hard to optimize

Use the hierarchical representation, we get:

We optimize an upper bound:

Why is it easier?– Alternating minimization leads to nicer optimization

problems

Variational Bayesian Learning (Cont’)


Experiments

Compare LapM3N with:CRFs: MLEL2-CRFs: L2-norm penalized MLEL1-CRFs: L1-norm penalized MLE (sparse

estimation)M3N: max-margin learning


Experimental results on synthetic datasets


Datasets with 100 i.i.d features of which 10, 30, 50 are relevant– For each setting, we generate 10 datasets, each having 1000

samples. True labels are assigned via a Gibbs sampler with 5000 iterations.

Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features

Experimental results on OCR datasets


We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV.

Sensitivity to Regularization Constants


• L1-CRFs are much sensitive to regularization constants; the others are more stable

• LapM3N is the most stable one

L1-CRF and L2-CRF:- 0.001, 0.01, 0.1, 1, 4, 9, 16

M3N and LapM3N:- 1, 4, 9, 16, 25, 36, 49, 64, 81

Summary


• We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction– MaxEntNet subsumes the standard M3Ns– PAC-Bayes Theoretical Error Bound

• We propose Laplace max-margin Markov networks– Enjoys a posterior shrinkage effect– Can perform as well as sparse models on

synthetic data; better on real data sets– More stable to regularization constants


Thanks!Detailed Proof:

http://166.111.138.19/junzhu/MaxEntNet_TR.pdf

http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf

Documents

Jun Zhu [email protected] Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint