Scalable Algorithms for Structured...

Preview:

Citation preview

Scalable Algorithms for Structured Prediction

Thomas Hofmannthofmann@google.com

Scalable Algorithms for Structured Prediction

(applicable to, but without applications in Computer Vision)

Thomas Hofmannthofmann@google.com

Motivation & Overview

Structured Prediction

Generalize machine learning methods to deal with structured outputs and/or with multiple, interdependent outputs

Structured objects such as sequences, strings, trees, labeled graphs, lattices, etc.

Multiple response variables that are interdependent = collective classification

Natural Language ProcessingSyntactic sentence parsing, dependency parsing PoS tagging, named entity detection, language modeling

B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004. R. McDonald, K. Crammer, F. Pereira, Online large-margin training of dependency parsers. ACL 2005.

H. C. Daume III, Practical Structured Learning Techniques for Natural Language Processing, Ph.D. Thesis, Univ. Southern California, 2006

B. Roark, M. Saraclar, M. Collins: Discriminative n-gram language modeling. Computer Speech & Language 21(2): 373-392, 2007P. Blunsom, Structured Classication for Multilingual Natural Language Processing, Ph.D. Thesis, Univ. Melbourne, 2007

L. S. Zettlemoyer, Learning to Map Setences to Logical Form, Ph,D, Thesis, MIT, 2009

T. Koo, Advances in Discriminative Dependency Parsing, Ph.D. Thesis, MIT, 2010.

BioinformaticsProtein secondary & tertiary structure, function predictionGene structure prediction (splicing), gene finding

Y. Liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model, ICML 2005

G. Raetsch and S. Sonneburg, Large Scale Hidden Semi-Markov SVMs, NIPS 2006.

G. Schweikert et al, mGene: Accurate SVM-based gene finding with an application to nematode genomes, Genome Res. 2009 19: 2133-2143

A. Sokolov and A. Ben-Hur, A Structured-Outputs Method for Prediction of Protein Function, In Proceedings of the 3rd International Workshop on Machine Learning in Systems Biology, 2008.

K. Astikainen et al., Towards Structured Output Prediction of Enzyme Function, BMC Proceedings 2008, 2 (Suppl. 4):S2

Computer Visiom

[ but I didn't want to talk about Computer Vision here ]T. Caetano & R. Hartley, ICCV 2009 Tutorial on Structured Prediction in Computer Visione.g. M. P. Kumar, P. Torr, A. Zisserman, Efficient Discriminative Learning of Parts-based Models, ICCV 2009.

Maximum Margin Structured Prediction

Scoring Functions

Linear scoring (or compatibility) functions in joint feature representation of input/output pairs

Score each input/output pair with a scoring function

Consider linear scoring functions

Joint feature map over input/output pairs

Prediction

Minimal Risk Structure Prediction

Goal of learning: find scoring function that minimizes expected prediction loss (= minimal risk)

Loss function for predicting incorrect output

Expected loss of prediction function for randomly generated instances

Empirical risk of prediction function on sample set

Hinge-style Upper Bound

Difficult to optimize empirical risk functionals directly Use upper bound on loss (w/ regularization) Which one? Choice will determine theoretical properties (e.g. consistency) and type of optimization problem

Hinge loss (for binary classification)

Margin re-scaling

Slack re-scaling

Proving the Upper Bound

It is easy to see that this results in an upper bound. For the margin re-scaled loss: A similar relation holds for the slack loss (in fact one can define a one-parametric family of loss functions)

Detail

Hinge + \|w\|_2 = structured SVM

With a L2 norm regularizer one obtains a generalization of SVM

Structured SVM

Rolling out with slack variables

Shorthand

Softmargin Illustration

Representer Theorem Insight

Special case: linear functions

T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of Statistics, 2008

Structured SVM

Converting one max-constraint per training instance into many linear constraints # = cardinality of output space

Assuming that the output space is large (combinatorial explosion in # of variables or parts) the problem (convex QP) is intractable as such

Challenge: devise efficient algorithms, exact or approximate

Comment

SVMstruct: Iterative Strengthening

Incrementally add constraints to define a sequence of relaxed QPs. Trade-off between accuracy and computational speed.

Define chain of (sub-)sets of constraints: Alternation of two steps:(1) Solve relaxed QP with constrain set Ct

(2) Find a set of violated constraints Ct+ not in Ct yet and add:

Requires a black box mechanism for generating constraints: "separation oracle"

Method:Derive dual QP, by solving for w and plugging solution in (use expansion from representer theorem)Upper bound dual objective by setting weight to zero and solving for slack variables -> B

Find constraints/dual variables s.t. there is a guarantee that the dual objective will increase at least by some non-zero amount -> eta

Result: # iterations is finite

SVMstruct: Analysis

Separation Oracle

How to find (appropriate) violated constraints? Lemma:

What are the most violated constraints per training instance?

Loss-augmented prediction problemDo not add constraints that are not epsilon-violated (termination)

Final result:

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005

Approximation Quality

Final solution: all violated constraints are violated by less than epsilon (= termination condition)Increase all slack variables by epsilon => feasible solution, but no longer optimal

Define:

Then: found solution can only be worse by epsilon

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005

Advanced Primal Methods

Cutting Planes via Pooling

1-slack formulation: blowing-up constraints even more!

Problem is equivalent to structured SVM:

But: fewer constraints need to be added -> more sparsenessBottom line: 1 pooled constraints as good as n individual ones[ Gains largest when separation oracle is fast. ]

T. Joachims, T. Finley, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009.T. Joachims, Training linear SVMs in linear time. In: ACM SIGKDD 2006, 217–226

Improvement

Cutting Planes via Pooling

T. Joachims, T. Finely, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009

Experiments

Online Subgradient Method for sSVMWhy bother about the constraints in the first place? Optimize non-smooth objective directly - piecewise linear!

Compute subgradient

Perform stochastic subgradient descent (w/ learning rates) N. Ratliff, J. A. Bagnell, M. Zinkevich, (Online) Subgradient Methods for Structured Prediction, AISTATS 2007.N. Ratliff, J. A. Bagnell, M. Zinkevich, Subgradient Methods for Maximum Margin Structured Learning, 2007N. Z. Shor, Minimization methods for non-differentiable functions. Springer-Verlag, 1985

Subgradient Methods Background

Subgradient Methods Background

Convergence Proof

Recurrence leads to

Subgradient w/ Projection

PEGASOS Algorithm

Two key improvements: Project weight vectors back to sphere after each updateSelect a subset of k violated constraints in each subgradient update step (between batch and online)

Analysis:Very fast convergence to epsilon-good solution

Improvement

S. Shalev-Shwartz Y. Singer N. Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, MLJ 2007.

Non-Convex Loss Bounds

Convex upper bounds like the Hinge loss become very poor for large losses -> mismatch, sensitivity to outliersUse non-convex bound instead

Ramp loss:

C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008R. Collobert, F. Sinz, J. Weston. L. Bottou, Trading Convexity for Scalability, ICML 2006

Convex ConCave Procedure

Slight modification of structured SVM:

Rescaled target margin, linearization of negative max

CCCP methodTaylor expansion - upper bound

Iterate minimization and re-computation of upper bound (convergence guarantee)

A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003.

Non-convex Loss Optimization

C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008

Results

Saddle Point Formulation (1) So far: argmax for prediction or loss adjusted prediction performed in black box (outside of QP optimization)

New idea: incorporate prediction directly into QP

Class of problems for which prediction can be solved exactly by an LP relaxation

Binary MRFs with submodular potentials, matchingsTree-structured MRFS

B. Taskar, S. Lacoste-Julien, M/ Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006

Saddle Point Formulation (2) Combine into min/max problem

w-space: spherical constraintz-space: linear constraints (depends on problem)

Extragradient methodsSolution method for saddle point problems (game theory)Perform gradient step along w, z - then project Recompute gradient at projected wp, zp and use new gradient from w, z & projection to obtain corrected wc, zc

B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction via the Extragradient Method, NIPS 2005B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006

Relaxed Constraint Generation

LP relaxations can also be used for approximate constraint generation (even, if LP relaxations are not exact)!

Idea: Round solution of relaxed LP and treat as constraints (even, if not feasible output) -> constraint overgeneration

Less principled than saddle point approach, but good results in practice for intractable problems

T. Finley, T. Joachims, Training Structural SVMs when Exact Inference is Intractable, ICML 2008

Dual Methods

Dual QP for Structured SVM

Dual QP (margin re-scaling)

Dual variables can be re-scaled such that they define for each training instance a probability mass function over the possible outputs.

sSVM Algorithm: Dual View

Iterative strengthening of primal corresponds to variable selection in dualAt iteration t: most dual variables are clamped to zero (sparseness)At iteration t+1: a subset of additional variables is unclamped (those corresponding to the selected constraints)

But: real power of dual view comes from incorporation of decomposition properties of feature map into the optimization problemPrimal methods: decomposition exploited in prediction a/o loss augmented prediction

Part-based Decomposition & MRFs

Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)

Assume similar additive decomposition holds for the loss

Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!

Part-based Decomposition & MRFs

Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)

Assume similar additive decomposition holds for the loss

Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!

Representer Theorem

Definitions

Representation

Can be directly kernelized by introducing kernels on factor level.

Reparameterizing the Dual

Interpret dual variables as probabilities and introduce "marginals" over factors

New QP with marginal probabilities

P. L. Bartlett, M. Collins, B. Taskar, D. McAllester. Exponentiated gradient algorithms for large-margin structured classification. NIPS 2005.

sorry for the messy notation

Exponentiated Gradient Descent

Essential idea: everything can be formulated in terms of variables defined over factor configurations (instead of global ones)

Simplified sketch: Exponential paramaterization

Perform gradient updates w.r.t. canonical parametersCompute marginals \mu from dual variables \alpha (assumed to be efficient)

Excellent convergence rate bounds! (here: out of scope)

M. Collins, A. Globerson, T. Koo, X. Carreras, P.L. Bartlett, Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks, JMLR 2008

Conculsion

Conclusion

Significant progress on scalable structured prediction problems

Constraint-generation approaches

sSVM, theoretical guaranteesdeep cuts from pooled constraintsnon-convex upper bounds and CCCPon line learning via stochastic subgradient, Pegasosover-generating constraints

Other methods

saddle point formulation and extragradientexponentiated gradient descent on dual[new work by Meshi et al, ICML 2010]

Recommended