44
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 A P PRO X Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Embed Size (px)

Citation preview

Page 1: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Accelerated, Parallel and PROXimal coordinate descent

IPAMFebruary 2014

A P PROXPeter Richtárik

(Joint work with Olivier Fercoq - arXiv:1312.5799)

Page 2: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Contributions

Page 3: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Variants of Randomized Coordinate Descent Methods

• Block– can operate on “blocks” of

coordinates – as opposed to just on individual

coordinates

• General – applies to “general” (=smooth

convex) functions – as opposed to special ones such as

quadratics

• Proximal– admits a “nonsmooth regularizer”

that is kept intact in solving subproblems

– regularizer not smoothed, nor approximated

• Parallel – operates on multiple blocks /

coordinates in parallel– as opposed to just 1 block /

coordinate at a time

• Accelerated– achieves O(1/k^2) convergence rate

for convex functions– as opposed to O(1/k)

• Efficient– avoids adding two full feature

vectors

Page 4: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Brief History of Randomized Coordinate Descent Methods

+ new long stepsizes

Page 5: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Introduction

Page 6: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block

Structure

II. Block

Sampling

IV. Fast or

Normal?

III. Proximal

Setup

Page 7: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block Structure

Page 8: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block Structure

Page 9: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block Structure

Page 10: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block Structure

Page 11: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block Structure

Page 12: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

I. Block StructureN = # coordinates

(variables)

n = # blocks

Page 13: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

II. Block Sampling

Block sampling

Average # blocks selected by the sampling

Page 14: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

III. Proximal Setup

Convex & Smooth Convex & Nonsmooth

Loss Regularizer

Page 15: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

III. Proximal SetupLoss Functions: Examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Page 16: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

III. Proximal SetupRegularizers: Examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Page 17: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

The Algorithm

Page 18: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

APPROX

Olivier Fercoq and P.R. Accelerated, parallel and proximal coordinate descent, arXiv:1312.5799, December 2013

Page 19: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Part CRANDOMIZED

COORDINATE DESCENT

Part BGRADIENT METHODS

B1GRADIENT DESCENT

B2PROJECTED

GRADIENT DESCENT

B3PROXIMAL

GRADIENT DESCENT

B4FAST PROXIMAL

GRADIENT DESCENT

C1PROXIMAL

COORDINATE DESCENT

C2PARALLEL

COORDINATE DESCENT

C3DISTRIBUTED

COORDINATE DESCENT

C4FAST PARALLEL

COORDINATE DESCENT

new FISTAISTA

Olivier Fercoq and P.R. Accelerated, parallel and proximal coordinate descent, arXiv:1312.5799, Dec 2013

Page 20: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PCDM

P.R. and Martin Takac. Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, December 2012IMA Fox Prize in Numerical Analysis, 2013

Page 21: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

2D Example

Page 22: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Convergence Rate

Page 23: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Convergence Rate

average # coordinates updated / iteration

# blocks# iterations

implies

Theorem [Fercoq & R. 12/2013]

Page 24: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Special Case: Fully Parallel Variantall blocks are updated in each iteration

# normalized weights (summing to n)

# iterations

implies

Page 25: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

New Stepsizes

Page 26: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Expected Separable Overapproximation (ESO):How to Choose Block Stepsizes?

P.R. and Martin Takac. Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, December 2012Olivier Fercoq and P.R. Smooth minimization of nonsmooth functions by parallel coordinate descent methods, arXiv:1309.5885, September 2013P.R. and Martin Takac. Distributed coordinate descent methods for learning with big data, arXiv:1310.2059, October 2013

SPCDM

Page 27: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Assumptions: Function f

Example:

(a)

(b)

(c)

Page 28: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Visualizing Assumption (c)

Page 29: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

New ESO

Theorem (Fercoq & R. 12/2013)

(i)

(ii)

Page 30: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Comparison with Other Stepsizes for Parallel Coordinate Descent Methods

Example:

Page 31: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Complexity for New Stepsizes

Average degree of separability

“Average” of the Lipschitz constants

With the new stepsizes, we have:

Page 32: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Work in 1 Iteration

Page 33: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Cost of 1 Iteration of APPROX

Assume N = n (all blocks are of size 1)and that

Sparse matrixThen the average cost of 1 iteration of APPROX is

Scalar function: derivative = O(1)

arithmetic ops

= average # nonzeros in a column of A

Page 34: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Bottleneck: Computation of Partial Derivatives

maintained

Page 35: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PreliminaryExperiments

Page 36: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

L1 Regularized L1 Regression

Dorothea dataset:

Gradient Method

Nesterov’s Accelerated Gradient Method

SPCDM

APPROX

Page 37: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

L1 Regularized L1 Regression

Page 38: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

L1 Regularized Least Squares (LASSO)

KDDB dataset:

PCDM

APPROX

Page 39: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Training Linear SVMs

Malicious URL dataset:

Page 40: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Importance Sampling

Page 41: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

with Importance Sampling

Zheng Qu and P.R. Accelerated coordinate descent with importance sampling, Manuscript 2014P.R. and Martin Takac. On optimal probabilities in stochastic coordinate descent methods, aXiv:1310.3438, 2013

Page 42: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Convergence Rate

Theorem [Qu & R. 2014]

Page 43: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Serial Case: Optimal ProbabilitiesNonuniform serial sampling:

Optimal ProbabilitiesUniform Probabilities

Page 44: Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Extra 40 Slides