Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Scalable Algorithms for Structured Prediction
Thomas [email protected]
Scalable Algorithms for Structured Prediction
(applicable to, but without applications in Computer Vision)
Thomas [email protected]
Motivation & Overview
Structured Prediction
Generalize machine learning methods to deal with structured outputs and/or with multiple, interdependent outputs
Structured objects such as sequences, strings, trees, labeled graphs, lattices, etc.
Multiple response variables that are interdependent = collective classification
Natural Language ProcessingSyntactic sentence parsing, dependency parsing PoS tagging, named entity detection, language modeling
B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004. R. McDonald, K. Crammer, F. Pereira, Online large-margin training of dependency parsers. ACL 2005.
H. C. Daume III, Practical Structured Learning Techniques for Natural Language Processing, Ph.D. Thesis, Univ. Southern California, 2006
B. Roark, M. Saraclar, M. Collins: Discriminative n-gram language modeling. Computer Speech & Language 21(2): 373-392, 2007P. Blunsom, Structured Classication for Multilingual Natural Language Processing, Ph.D. Thesis, Univ. Melbourne, 2007
L. S. Zettlemoyer, Learning to Map Setences to Logical Form, Ph,D, Thesis, MIT, 2009
T. Koo, Advances in Discriminative Dependency Parsing, Ph.D. Thesis, MIT, 2010.
BioinformaticsProtein secondary & tertiary structure, function predictionGene structure prediction (splicing), gene finding
Y. Liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model, ICML 2005
G. Raetsch and S. Sonneburg, Large Scale Hidden Semi-Markov SVMs, NIPS 2006.
G. Schweikert et al, mGene: Accurate SVM-based gene finding with an application to nematode genomes, Genome Res. 2009 19: 2133-2143
A. Sokolov and A. Ben-Hur, A Structured-Outputs Method for Prediction of Protein Function, In Proceedings of the 3rd International Workshop on Machine Learning in Systems Biology, 2008.
K. Astikainen et al., Towards Structured Output Prediction of Enzyme Function, BMC Proceedings 2008, 2 (Suppl. 4):S2
Computer Visiom
[ but I didn't want to talk about Computer Vision here ]T. Caetano & R. Hartley, ICCV 2009 Tutorial on Structured Prediction in Computer Visione.g. M. P. Kumar, P. Torr, A. Zisserman, Efficient Discriminative Learning of Parts-based Models, ICCV 2009.
Maximum Margin Structured Prediction
Scoring Functions
Linear scoring (or compatibility) functions in joint feature representation of input/output pairs
Score each input/output pair with a scoring function
Consider linear scoring functions
Joint feature map over input/output pairs
Prediction
Minimal Risk Structure Prediction
Goal of learning: find scoring function that minimizes expected prediction loss (= minimal risk)
Loss function for predicting incorrect output
Expected loss of prediction function for randomly generated instances
Empirical risk of prediction function on sample set
Hinge-style Upper Bound
Difficult to optimize empirical risk functionals directly Use upper bound on loss (w/ regularization) Which one? Choice will determine theoretical properties (e.g. consistency) and type of optimization problem
Hinge loss (for binary classification)
Margin re-scaling
Slack re-scaling
Proving the Upper Bound
It is easy to see that this results in an upper bound. For the margin re-scaled loss: A similar relation holds for the slack loss (in fact one can define a one-parametric family of loss functions)
Detail
Hinge + \|w\|_2 = structured SVM
With a L2 norm regularizer one obtains a generalization of SVM
Structured SVM
Rolling out with slack variables
Shorthand
Softmargin Illustration
Representer Theorem Insight
Special case: linear functions
T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of Statistics, 2008
Structured SVM
Converting one max-constraint per training instance into many linear constraints # = cardinality of output space
Assuming that the output space is large (combinatorial explosion in # of variables or parts) the problem (convex QP) is intractable as such
Challenge: devise efficient algorithms, exact or approximate
Comment
SVMstruct: Iterative Strengthening
Incrementally add constraints to define a sequence of relaxed QPs. Trade-off between accuracy and computational speed.
Define chain of (sub-)sets of constraints: Alternation of two steps:(1) Solve relaxed QP with constrain set Ct
(2) Find a set of violated constraints Ct+ not in Ct yet and add:
Requires a black box mechanism for generating constraints: "separation oracle"
Method:Derive dual QP, by solving for w and plugging solution in (use expansion from representer theorem)Upper bound dual objective by setting weight to zero and solving for slack variables -> B
Find constraints/dual variables s.t. there is a guarantee that the dual objective will increase at least by some non-zero amount -> eta
Result: # iterations is finite
SVMstruct: Analysis
Separation Oracle
How to find (appropriate) violated constraints? Lemma:
What are the most violated constraints per training instance?
Loss-augmented prediction problemDo not add constraints that are not epsilon-violated (termination)
Final result:
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005
Approximation Quality
Final solution: all violated constraints are violated by less than epsilon (= termination condition)Increase all slack variables by epsilon => feasible solution, but no longer optimal
Define:
Then: found solution can only be worse by epsilon
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005
Advanced Primal Methods
Cutting Planes via Pooling
1-slack formulation: blowing-up constraints even more!
Problem is equivalent to structured SVM:
But: fewer constraints need to be added -> more sparsenessBottom line: 1 pooled constraints as good as n individual ones[ Gains largest when separation oracle is fast. ]
T. Joachims, T. Finley, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009.T. Joachims, Training linear SVMs in linear time. In: ACM SIGKDD 2006, 217–226
Improvement
Cutting Planes via Pooling
T. Joachims, T. Finely, C.-N. J. Yu, Cutting-Plane Training of Structural SVMs, MLJ 2009
Experiments
Online Subgradient Method for sSVMWhy bother about the constraints in the first place? Optimize non-smooth objective directly - piecewise linear!
Compute subgradient
Perform stochastic subgradient descent (w/ learning rates) N. Ratliff, J. A. Bagnell, M. Zinkevich, (Online) Subgradient Methods for Structured Prediction, AISTATS 2007.N. Ratliff, J. A. Bagnell, M. Zinkevich, Subgradient Methods for Maximum Margin Structured Learning, 2007N. Z. Shor, Minimization methods for non-differentiable functions. Springer-Verlag, 1985
Subgradient Methods Background
Subgradient Methods Background
Convergence Proof
Recurrence leads to
Subgradient w/ Projection
PEGASOS Algorithm
Two key improvements: Project weight vectors back to sphere after each updateSelect a subset of k violated constraints in each subgradient update step (between batch and online)
Analysis:Very fast convergence to epsilon-good solution
Improvement
S. Shalev-Shwartz Y. Singer N. Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, MLJ 2007.
Non-Convex Loss Bounds
Convex upper bounds like the Hinge loss become very poor for large losses -> mismatch, sensitivity to outliersUse non-convex bound instead
Ramp loss:
C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008R. Collobert, F. Sinz, J. Weston. L. Bottou, Trading Convexity for Scalability, ICML 2006
Convex ConCave Procedure
Slight modification of structured SVM:
Rescaled target margin, linearization of negative max
CCCP methodTaylor expansion - upper bound
Iterate minimization and re-computation of upper bound (convergence guarantee)
A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003.
Non-convex Loss Optimization
C. B. Do, Q. Le, C.H. Teo, O. Chapelle, A. Smola, Tighter Bounds for Structured Estimation, NIPS 2008
Results
Saddle Point Formulation (1) So far: argmax for prediction or loss adjusted prediction performed in black box (outside of QP optimization)
New idea: incorporate prediction directly into QP
Class of problems for which prediction can be solved exactly by an LP relaxation
Binary MRFs with submodular potentials, matchingsTree-structured MRFS
B. Taskar, S. Lacoste-Julien, M/ Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006
Saddle Point Formulation (2) Combine into min/max problem
w-space: spherical constraintz-space: linear constraints (depends on problem)
Extragradient methodsSolution method for saddle point problems (game theory)Perform gradient step along w, z - then project Recompute gradient at projected wp, zp and use new gradient from w, z & projection to obtain corrected wc, zc
B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction via the Extragradient Method, NIPS 2005B. Taskar, S. Lacoste-Julien, M. Jordan, Structured Prediction, Dual Extragradient and Bregman Projections, JMLR 2006
Relaxed Constraint Generation
LP relaxations can also be used for approximate constraint generation (even, if LP relaxations are not exact)!
Idea: Round solution of relaxed LP and treat as constraints (even, if not feasible output) -> constraint overgeneration
Less principled than saddle point approach, but good results in practice for intractable problems
T. Finley, T. Joachims, Training Structural SVMs when Exact Inference is Intractable, ICML 2008
Dual Methods
Dual QP for Structured SVM
Dual QP (margin re-scaling)
Dual variables can be re-scaled such that they define for each training instance a probability mass function over the possible outputs.
sSVM Algorithm: Dual View
Iterative strengthening of primal corresponds to variable selection in dualAt iteration t: most dual variables are clamped to zero (sparseness)At iteration t+1: a subset of additional variables is unclamped (those corresponding to the selected constraints)
But: real power of dual view comes from incorporation of decomposition properties of feature map into the optimization problemPrimal methods: decomposition exploited in prediction a/o loss augmented prediction
Part-based Decomposition & MRFs
Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)
Assume similar additive decomposition holds for the loss
Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!
Part-based Decomposition & MRFs
Often the feature function decomposes into contributions over parts (or factors, or cliques in an MRF)
Assume similar additive decomposition holds for the loss
Then: one can rewrite the dual QP in terms of marginals over factor (sub-)configurations - much more compact!
Representer Theorem
Definitions
Representation
Can be directly kernelized by introducing kernels on factor level.
Reparameterizing the Dual
Interpret dual variables as probabilities and introduce "marginals" over factors
New QP with marginal probabilities
P. L. Bartlett, M. Collins, B. Taskar, D. McAllester. Exponentiated gradient algorithms for large-margin structured classification. NIPS 2005.
sorry for the messy notation
Exponentiated Gradient Descent
Essential idea: everything can be formulated in terms of variables defined over factor configurations (instead of global ones)
Simplified sketch: Exponential paramaterization
Perform gradient updates w.r.t. canonical parametersCompute marginals \mu from dual variables \alpha (assumed to be efficient)
Excellent convergence rate bounds! (here: out of scope)
M. Collins, A. Globerson, T. Koo, X. Carreras, P.L. Bartlett, Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks, JMLR 2008
Conculsion
Conclusion
Significant progress on scalable structured prediction problems
Constraint-generation approaches
sSVM, theoretical guaranteesdeep cuts from pooled constraintsnon-convex upper bounds and CCCPon line learning via stochastic subgradient, Pegasosover-generating constraints
Other methods
saddle point formulation and extragradientexponentiated gradient descent on dual[new work by Meshi et al, ICML 2010]