22
Jun Zhu [email protected] Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang. Laplace Maximum Margin Markov Networks

Jun Zhu [email protected] Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Embed Size (px)

Citation preview

Page 1: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Jun Zhu

[email protected]

Dept. of Comp. Sci. & Tech., Tsinghua University

This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing

and Bo Zhang.

Laplace Maximum Margin Markov Networks

Page 2: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Outline

04/18/23ICML 2008 @ Helsinki, Finland2

Introduction– Structured Prediction– Max-margin Markov Networks

Max-Entropy Discrimination Markov Networks– Basic Theorems– Laplace Max-margin Markov Networks– Experimental Results

Summary

Page 3: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Classical Classification Models

04/18/23ICML 2008 @ Helsinki, Finland3

• Inputs: – a set of training samples , where and

• Outputs:– a predictive function :

• Examples ( ):

– Support Vector Machine (SVM)

• Max-margin learning

– Logistic Regression• Max-likelihood estimation

Page 4: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Structured Prediction

04/18/23ICML 2008 @ Helsinki, Finland4

• Complicated Prediction Examples:– Part-of-speech (POS) Tagging:

– Image segmentation

• Inputs:– a set of training samples: ,

where

• Outputs:– a predictive function :

“Do you want fries with that?” -> <verb pron verb noun prep pron>

Page 5: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Structured Prediction Models

04/18/23ICML 2008 @ Helsinki, Finland5

Conditional Random Fields (CRFs) (Lafferty et al., 2001)– Based on Logistic Regression– Max-likelihood estimation

(point-estimate)

• Max-margin Markov Networks (M3Ns) (Taskar et al., 2003)

– Based on SVM– Max-margin learning ( point-

estimate)

• Markov properties are encoded in the feature functions

Page 6: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Between MLE and max-margin learning

04/18/23ICML 2008 @ Helsinki, Finland6

Likelihood-based estimation– Probabilistic (joint/conditional

likelihood model)– Easy adapt to perform

Bayesian learning, and consider prior knowledge, missing data

• Max-margin learning– Non-probabilistic (concentrate on

input-output mapping)– Not obvious how to perform

Bayesian learning or consider prior, and missing data

– Sound theoretical guarantee with limited samples

• Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999)

– A Bayesian learning approach– The optimization problem (binary classification)

Page 7: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

MaxEnt Discrimination Markov networks

04/18/23ICML 2008 @ Helsinki, Finland7

MaxEnt Discrimination Markov Networks (MaxEntNet):

– Generalized maximum entropy or regularized KL-divergence

– Subspace of distributions defined with expected margin constraints

Bayesian Prediction

p

Page 8: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Solution to MaxEntNet

04/18/23ICML 2008 @ Helsinki, Finland8

Theorem 1 (Solution to MaxEntNet):– Posterior Distribution:

– Dual Optimization Problem:

Convex conjugate (closed proper convex )– Def:– Ex:

Page 9: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Reduction to M3Ns

04/18/23ICML 2008 @ Helsinki, Finland9

Theorem 2 (Reduction of MaxEntNet to M3Ns):– Assume

Posterior distribution: Dual optimization:

Predictive rule:

Thus, MaxEntNet subsumes M3Ns and admits all the merits of max-margin learning

Furthermore, MaxEntNet has at least three advantages …

Page 10: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

The Three Advantages

PAC-Bayes prediction error guaranteeIntroduce regularization effects, such as

sparsity biasProvides an elegant approach to

incorporate latent variables and structures

04/18/23ICML 2008 @ Helsinki, Finland10

Page 11: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

The Three Advantages

PAC-Bayes prediction error guaranteeIntroduce regularization effects, such as

sparsity biasProvides an elegant approach to

incorporate latent variables and structures

04/18/23ICML 2008 @ Helsinki, Finland11

Page 12: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Generalization Guarantee

04/18/23ICML 2008 @ Helsinki, Finland12

MaxEntNet is an averaging model– we also call it a Bayesian Max-Margin Markov

NetworkTheorem 3 (PAC-Bayes Bound)

IfThen

Page 13: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Laplace M3Ns (LapM3N)

04/18/23ICML 2008 @ Helsinki, Finland13

The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias

Normal prior is not so good for sparse data …

Instead, we use the Laplace prior …

– hierarchical representation (scale mixture)

Page 14: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

• Similar calculation for M3Ns (A standard normal prior)

Posterior shrinkage effect in LapM3N

04/18/23ICML 2008 @ Helsinki, Finland14

Exact integration in LapM3N

Alternatively,

A functional transformation

Page 15: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Variational Bayesian Learning

04/18/23ICML 2008 @ Helsinki, Finland15

Exact dual function is hard to optimize

Use the hierarchical representation, we get:

We optimize an upper bound:

Why is it easier?– Alternating minimization leads to nicer optimization

problems

Page 16: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Variational Bayesian Learning (Cont’)

04/18/23ICML 2008 @ Helsinki, Finland16

Page 17: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Experiments

Compare LapM3N with:CRFs: MLEL2-CRFs: L2-norm penalized MLEL1-CRFs: L1-norm penalized MLE (sparse

estimation)M3N: max-margin learning

04/18/23ICML 2008 @ Helsinki, Finland17

Page 18: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Experimental results on synthetic datasets

04/18/23ICML 2008 @ Helsinki, Finland18

Datasets with 100 i.i.d features of which 10, 30, 50 are relevant– For each setting, we generate 10 datasets, each having 1000

samples. True labels are assigned via a Gibbs sampler with 5000 iterations.

Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features

Page 19: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Experimental results on OCR datasets

04/18/23ICML 2008 @ Helsinki, Finland19

We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV.

Page 20: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Sensitivity to Regularization Constants

04/18/23ICML 2008 @ Helsinki, Finland20

• L1-CRFs are much sensitive to regularization constants; the others are more stable

• LapM3N is the most stable one

L1-CRF and L2-CRF:- 0.001, 0.01, 0.1, 1, 4, 9, 16

M3N and LapM3N:- 1, 4, 9, 16, 25, 36, 49, 64, 81

Page 21: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

Summary

04/18/23ICML 2008 @ Helsinki, Finland21

• We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction– MaxEntNet subsumes the standard M3Ns– PAC-Bayes Theoretical Error Bound

• We propose Laplace max-margin Markov networks– Enjoys a posterior shrinkage effect– Can perform as well as sparse models on

synthetic data; better on real data sets– More stable to regularization constants

Page 22: Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint

04/18/23ICML 2008 @ Helsinki, Finland22

Thanks!Detailed Proof:

http://166.111.138.19/junzhu/MaxEntNet_TR.pdf

http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf