43
Scalable Algorithms for Structured Prediction Thomas Hofmann [email protected]

Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Scalable Algorithms for Structured Prediction

Thomas [email protected]

Page 2: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Scalable Algorithms for Structured Prediction

(applicable to, but without applications in Computer Vision)

Thomas [email protected]

Page 3: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Motivation & Overview

Page 4: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Structured Prediction

Generalize machine learning methods to deal with structured outputs and/or with multiple, interdependent outputs

Structured objects such as sequences, strings, trees, labeled graphs, lattices, etc.

Multiple response variables that are interdependent = collective classification

Page 5: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Natural Language ProcessingSyntactic sentence parsing, dependency parsing PoS tagging, named entity detection, language modeling

B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004. R. McDonald, K. Crammer, F. Pereira, Online large-margin training of dependency parsers. ACL 2005.

H. C. Daume III, Practical Structured Learning Techniques for Natural Language Processing, Ph.D. Thesis, Univ. Southern California, 2006

B. Roark, M. Saraclar, M. Collins: Discriminative n-gram language modeling. Computer Speech & Language 21(2): 373-392, 2007P. Blunsom, Structured Classication for Multilingual Natural Language Processing, Ph.D. Thesis, Univ. Melbourne, 2007

L. S. Zettlemoyer, Learning to Map Setences to Logical Form, Ph,D, Thesis, MIT, 2009

T. Koo, Advances in Discriminative Dependency Parsing, Ph.D. Thesis, MIT, 2010.

Page 6: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

BioinformaticsProtein secondary & tertiary structure, function predictionGene structure prediction (splicing), gene finding

Y. Liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model, ICML 2005

G. Raetsch and S. Sonneburg, Large Scale Hidden Semi-Markov SVMs, NIPS 2006.

G. Schweikert et al, mGene: Accurate SVM-based gene finding with an application to nematode genomes, Genome Res. 2009 19: 2133-2143

A. Sokolov and A. Ben-Hur, A Structured-Outputs Method for Prediction of Protein Function, In Proceedings of the 3rd International Workshop on Machine Learning in Systems Biology, 2008.

K. Astikainen et al., Towards Structured Output Prediction of Enzyme Function, BMC Proceedings 2008, 2 (Suppl. 4):S2

Page 7: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Computer Visiom

[ but I didn't want to talk about Computer Vision here ]T. Caetano & R. Hartley, ICCV 2009 Tutorial on Structured Prediction in Computer Visione.g. M. P. Kumar, P. Torr, A. Zisserman, Efficient Discriminative Learning of Parts-based Models, ICCV 2009.

Page 8: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Maximum Margin Structured Prediction

Page 9: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Scoring Functions

Linear scoring (or compatibility) functions in joint feature representation of input/output pairs

Score each input/output pair with a scoring function

Consider linear scoring functions

Joint feature map over input/output pairs

Prediction

Page 10: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Minimal Risk Structure Prediction

Goal of learning: find scoring function that minimizes expected prediction loss (= minimal risk)

Loss function for predicting incorrect output

Expected loss of prediction function for randomly generated instances

Empirical risk of prediction function on sample set

Page 11: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Hinge-style Upper Bound

Difficult to optimize empirical risk functionals directly Use upper bound on loss (w/ regularization) Which one? Choice will determine theoretical properties (e.g. consistency) and type of optimization problem

Hinge loss (for binary classification)

Margin re-scaling

Slack re-scaling

Page 12: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Proving the Upper Bound

It is easy to see that this results in an upper bound. For the margin re-scaled loss: A similar relation holds for the slack loss (in fact one can define a one-parametric family of loss functions)

Detail

Page 13: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Hinge + \|w\|_2 = structured SVM

With a L2 norm regularizer one obtains a generalization of SVM

Structured SVM

Rolling out with slack variables

Shorthand

Page 14: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Softmargin Illustration

Page 15: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Representer Theorem Insight

Special case: linear functions

T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of Statistics, 2008

Page 16: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Structured SVM

Converting one max-constraint per training instance into many linear constraints # = cardinality of output space

Assuming that the output space is large (combinatorial explosion in # of variables or parts) the problem (convex QP) is intractable as such

Challenge: devise efficient algorithms, exact or approximate

Comment

Page 17: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

SVMstruct: Iterative Strengthening

Incrementally add constraints to define a sequence of relaxed QPs. Trade-off between accuracy and computational speed.

Define chain of (sub-)sets of constraints: Alternation of two steps:(1) Solve relaxed QP with constrain set Ct

(2) Find a set of violated constraints Ct+ not in Ct yet and add:

Requires a black box mechanism for generating constraints: "separation oracle"

Page 18: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Method:Derive dual QP, by solving for w and plugging solution in (use expansion from representer theorem)Upper bound dual objective by setting weight to zero and solving for slack variables -> B

Find constraints/dual variables s.t. there is a guarantee that the dual objective will increase at least by some non-zero amount -> eta

Result: # iterations is finite

SVMstruct: Analysis

Page 19: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Separation Oracle

How to find (appropriate) violated constraints? Lemma:

What are the most violated constraints per training instance?

Loss-augmented prediction problemDo not add constraints that are not epsilon-violated (termination)

Final result:

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005

Page 20: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Approximation Quality

Final solution: all violated constraints are violated by less than epsilon (= termination condition)Increase all slack variables by epsilon => feasible solution, but no longer optimal

Define:

Then: found solution can only be worse by epsilon

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005

Page 21: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Advanced Primal Methods

Page 22: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Cutting Planes via Pooling

1-slack formulation: blowing-up constraints even more!

Problem is equivalent to structured SVM:

But: fewer constraints need to be added -> more sparsenessBottom line: 1 pooled constraints as good as n individual ones[ Gains largest when separation oracle is fast. ]

T. Joachims, T. Finley, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009.T. Joachims, Training linear SVMs in linear time. In: ACM SIGKDD 2006, 217–226

Improvement

Page 23: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Cutting Planes via Pooling

T. Joachims, T. Finely, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009

Experiments

Page 24: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Online Subgradient Method for sSVMWhy bother about the constraints in the first place? Optimize non-smooth objective directly - piecewise linear!

Compute subgradient

Perform stochastic subgradient descent (w/ learning rates) N. Ratliff, J. A. Bagnell, M. Zinkevich, (Online) Subgradient Methods for Structured Prediction, AISTATS 2007.N. Ratliff, J. A. Bagnell, M. Zinkevich, Subgradient Methods for Maximum Margin Structured Learning, 2007N. Z. Shor, Minimization methods for non-differentiable functions. Springer-Verlag, 1985

Page 25: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Subgradient Methods Background

Page 26: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Subgradient Methods Background

Convergence Proof

Recurrence leads to

Page 27: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Subgradient w/ Projection

PEGASOS Algorithm

Two key improvements: Project weight vectors back to sphere after each updateSelect a subset of k violated constraints in each subgradient update step (between batch and online)

Analysis:Very fast convergence to epsilon-good solution

Improvement

S. Shalev-Shwartz Y. Singer N. Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, MLJ 2007.

Page 28: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Non-Convex Loss Bounds

Convex upper bounds like the Hinge loss become very poor for large losses -> mismatch, sensitivity to outliersUse non-convex bound instead

Ramp loss:

C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008R. Collobert, F. Sinz, J. Weston. L. Bottou, Trading Convexity for Scalability, ICML 2006

Page 29: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Convex ConCave Procedure

Slight modification of structured SVM:

Rescaled target margin, linearization of negative max

CCCP methodTaylor expansion - upper bound

Iterate minimization and re-computation of upper bound (convergence guarantee)

A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003.

Page 30: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Non-convex Loss Optimization

C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008

Results

Page 31: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Saddle Point Formulation (1) So far: argmax for prediction or loss adjusted prediction performed in black box (outside of QP optimization)

New idea: incorporate prediction directly into QP

Class of problems for which prediction can be solved exactly by an LP relaxation

Binary MRFs with submodular potentials, matchingsTree-structured MRFS

B. Taskar, S. Lacoste-Julien, M/ Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006

Page 32: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Saddle Point Formulation (2) Combine into min/max problem

w-space: spherical constraintz-space: linear constraints (depends on problem)

Extragradient methodsSolution method for saddle point problems (game theory)Perform gradient step along w, z - then project Recompute gradient at projected wp, zp and use new gradient from w, z & projection to obtain corrected wc, zc

B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction via the Extragradient Method, NIPS 2005B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006

Page 33: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Relaxed Constraint Generation

LP relaxations can also be used for approximate constraint generation (even, if LP relaxations are not exact)!

Idea: Round solution of relaxed LP and treat as constraints (even, if not feasible output) -> constraint overgeneration

Less principled than saddle point approach, but good results in practice for intractable problems

T. Finley, T. Joachims, Training Structural SVMs when Exact Inference is Intractable, ICML 2008

Page 34: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Dual Methods

Page 35: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Dual QP for Structured SVM

Dual QP (margin re-scaling)

Dual variables can be re-scaled such that they define for each training instance a probability mass function over the possible outputs.

Page 36: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

sSVM Algorithm: Dual View

Iterative strengthening of primal corresponds to variable selection in dualAt iteration t: most dual variables are clamped to zero (sparseness)At iteration t+1: a subset of additional variables is unclamped (those corresponding to the selected constraints)

But: real power of dual view comes from incorporation of decomposition properties of feature map into the optimization problemPrimal methods: decomposition exploited in prediction a/o loss augmented prediction

Page 37: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Part-based Decomposition & MRFs

Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)

Assume similar additive decomposition holds for the loss

Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!

Page 38: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Part-based Decomposition & MRFs

Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)

Assume similar additive decomposition holds for the loss

Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!

Page 39: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Representer Theorem

Definitions

Representation

Can be directly kernelized by introducing kernels on factor level.

Page 40: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Reparameterizing the Dual

Interpret dual variables as probabilities and introduce "marginals" over factors

New QP with marginal probabilities

P. L. Bartlett, M. Collins, B. Taskar, D. McAllester. Exponentiated gradient algorithms for large-margin structured classification. NIPS 2005.

sorry for the messy notation

Page 41: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Exponentiated Gradient Descent

Essential idea: everything can be formulated in terms of variables defined over factor configurations (instead of global ones)

Simplified sketch: Exponential paramaterization

Perform gradient updates w.r.t. canonical parametersCompute marginals \mu from dual variables \alpha (assumed to be efficient)

Excellent convergence rate bounds! (here: out of scope)

M. Collins, A. Globerson, T. Koo, X. Carreras, P.L. Bartlett, Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks, JMLR 2008

Page 42: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Conculsion

Page 43: Scalable Algorithms for Structured Predictionsmicv2010.kyb.tuebingen.mpg.de/hofmann-scalable-smicv... · 2010-07-13 · Scalable Algorithms for Structured Prediction (applicable to,

Conclusion

Significant progress on scalable structured prediction problems

Constraint-generation approaches

sSVM, theoretical guaranteesdeep cuts from pooled constraintsnon-convex upper bounds and CCCPon line learning via stochastic subgradient, Pegasosover-generating constraints

Other methods

saddle point formulation and extragradientexponentiated gradient descent on dual[new work by Meshi et al, ICML 2010]