Blitz: A Principled Meta-Algorithm for Scaling Sparse
Optimization Tyler B. Johnson and Carlos Guestrin University of
Washington
Slide 2
Very important to machine learning Our focus is constrained
convex optimization Number of constraints can be very large!
Optimization Models of data Optimal model
Slide 3
Sparse regression Many constraints in dual problem
Classification Example
Slide 4
Choices for Scaling Optimization Stochastic methods
Parallelization Subject of this talk: Active Sets
Slide 5
Active Set Motivation Important fact
Slide 6
Convex objective f Convex Optimization with Active Sets
Feasible set
Slide 7
1. Choose set of constraints 2. Set x to minimize objective
subject to chosen constraints Convex Optimization with Active Sets
x Then repeat...
Slide 8
2. Set x to minimize objective subject to chosen constraints 1.
Choose set of constraints Algorithm converges when x is feasible! x
Convex Optimization with Active Sets
Slide 9
Limitations of Active Sets Until x is feasible do Propose
active set of important constraints x Minimizer of objective s.t.
only active set How many iterations to expect? x is infeasible
until convergence Which constraints are important? How many
constraints to choose? When to terminate subproblem?
Slide 10
Blitz 1. Update y to be extreme feasible point on segment [y,x]
Feasible point y Minimizer subject to no constraints x
Slide 11
x y 2. Select top k constraints with boundaries closest to y
Blitz
Slide 12
x y 3. Set x to minimize objective subject to selected
constraints And repeat
Slide 13
x 1. Update y to be extreme feasible point on segment [y,x] y
Blitz
Slide 14
2. Choose top k constraints with boundaries closest to y y 3.
Set x to minimize objective subject to selected constraints x When
x = y, Blitz converges! Blitz
Slide 15
Blitz Intuition The key to Blitz is its y-update yx
Slide 16
Blitz Intuition The key to Blitz is its y-update If y update is
large, Blitz is near convergence If y update is small, xyy
Slide 17
If y update is large, Blitz is near convergence If y update is
small, then violated constraint greatly improves x next iteration
xy Blitz Intuition y x must improve significantly
Slide 18
Main Theorem Theorem 2.1
Slide 19
Active Set Size for Linear Convergence Corollary 2.2
Slide 20
Constraint Screening Corollary 2.3
Slide 21
Tuning Algorithmic Parameters Theory guides choice of: Active
set size Subproblem termination criteria Best fixed Tuned using
theory
Slide 22
Recap Blitz is an active set algorithm that: Selects
theoretically justified active sets to maximize guaranteed progress
Applies theoretical analysis to guide choice of algorithm
parameters Discards constraints proven to be irrelevant during
optimization
Slide 23
Empirical Evaluation
Slide 24
Experiment Overview Apply Blitz to L1-regularized loss
minimization Dual is a constrained problem Optimizing subject to
active set corresponds to solving primal problem over subset of
variables
Slide 25
Single Machine, Data in Memory Relative Suboptimality Time (s)
Active Sets Experiment with high-dimensional RCV1 dataset No
Prioritization ProxNewt CD L1_LR LIBLINEAR GLMNET Blitz
Slide 26
Limited Memory Setting Data cannot always fit in memory Active
set methods require only a subset of data at each iteration to
solve subproblem Set-up: 1 pass over data to load active set Solve
subproblem with active set in memory Repeat
Slide 27
Limited Memory Setting Relative Suboptimality Time (s)
Experiment with12 GB Webspam dataset and 1 GB memory Prioritized
Memory Usage No Prioritization AdaGrad_1.0 AdaGrad_10.0
AdaGrad_100.0 CD Strong Rule Blitz
Slide 28
Distributed Setting With > 1 machine, communication is
costly Blitz subproblems require communication for only active set
features Set-up: Solve with synchronous bulk gradient descent
Prioritize communication using active sets
Slide 29
Distributed Setting Experiment with Criteo CTR dataset and 16
machines Relative Suboptimality No Prioritization Prioritized
Communication Time (min) Gradient Descent KKT Filter Blitz
Slide 30
Takeaways Active sets are effective at exploiting structure! We
have introduced Blitz, an active sets algorithm that Provides
novel, useful theoretical guarantees Is very fast in practice
Future work Extensions to larger variety of problems Modifications
such as constraint sampling Thanks!
Slide 31
References Bach, F., Jenatton, R., Mairal, J., and Obozinski,
G. Optimization with sparsity-inducing penalties. Foundations and
Trends in Machine Learning, 4(1):1106, 2012. Duchi, J., Hazan, E.,
and Singer, Y. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research,
12:21212159, 2011. Fan, R. E., Chen, P. H., and Lin, C. J. Working
set selection using second order information for training support
vector machines. Journal of Machine Learning Research, 6: 18891918,
2005. Fercoq, O. and Richtrik, P. Accelerated, parallel and
proximal coordinate descent. Technical Report arXiv:1312.5799,
2013. Friedman, J., Hastie, T., and Tibshirani, R. Regularization
paths for generalized linear models via coordinate descent. Journal
of Statistical Software, 33(1):122, 2010. Ghaoui, L. E., Viallon,
V., and Rabbani, T. Safe feature elimination for the lasso and
sparse supervised learning problems. Pacific Journal of
Optimization, 8(4):667 698, 2012. Kim, H. and Park, H. Nonnegative
matrix factorization based on alternating nonnegativity constrained
least squares and active set method. SIAM Journal on Matrix
Analysis and Applications, 30(2):713730, 2008. Kim, S. J., Koh, K.,
Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method
for large-scale L1-regularized least squares. IEEE Journal on
Selected Top- ics in Signal Processing, 1(4):606617, 2007. Koh, K.,
Kim, S. J., and Boyd, S. An interior-point method for large-scale
L1-regularized logistic regression. Journal of Machine Learning
Research, 8:5191555, 2007. Li, M., Smola, A., and Andersen, D. G.
Communication efficient distributed machine learning with the
parameter server. In Advances in Neural Information Processing
Systems 27, 2014. Tibshirani, R., Bien, J., Friedman, J., Hastie,
T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for
discarding predictors in lasso-type problems. Journal of the Royal
Statistical Society, Series B, 74(2):245266, 2012. Tsochantaridis,
I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods
for structured and interdependent output variables. Journal of
Machine Learning Re- search, 6:14531484, 2005. Xiao, L. Dual
averaging methods for regularized stochastic learning and online
optimization. Journal of Machine Learning Research, 11:25432596,
2010. Yuan, G. X., Ho, C. H., and Lin, C. J. An improved GLMNET for
L1-regularized logistic regression. Journal of Machine Learning
Research, 13:19992030, 2012.
Slide 32
Active Set Algorithm 1.Until x is feasible do 2.Propose active
set of important constraints 3.x Minimizer of objective s.t. only
active set
Slide 33
Computing y Update Computing y update = 1D optimization problem
Worst case, can be solved with bisection method For linear case,
solution is simpler Requires considering all constraints
Slide 34
Single Machine, Data in Memory Support Set Recall Active Sets
Experiment with high-dimensional RCV1 dataset No Prioritization
ProxNewt CD L1_LR LIBLINEAR GLMNET Blitz Time (s) Support Set
Precision