On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

On Dropping Convexity for Faster Optimization

Sujay SanghaviUT Austin

SrinadhBhojanapalli

UT Austinà TTI Chicago

AnastasiosKyrillidis

UT Austin

Dohyung Park

UT Austin

Motivation

U

V 0

users

Sample problem: matrix completion

eO(nr)

eO(n2)

eO(nr)

data size

output size

Convex optimization - sizeAltMin Sensing Completion Proof Summary References

A Comparision

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fraction of observations

Ratio

of su

ccess

AltMin

Nuclear norm approach

Nuclear norm approach : a leading theoretical approach.

Empirically, AltMin hassimilar sample complexity andbetter computational complexity.

Praneeth Netrapalli Provable Matrix Completion using Alternating Minimization

… and empirically often statistically worse …

Similar stories in phase retrieval, matrix regression, …

Step 1: Semidefinite Optimization

minX

f(X)

s.t. X ⌫ 0

convex, nice ..

Natural method: projected gradient descent

X+ P+ (X � ⌘rf(X) )

“First-order oracle access to f ”

Projection onto psd coneComputationally intensive

Step size

First order oracle access

Access to the function is only as follows:

Oracle access is a standard abstraction in the study of methods in convex optimization

Typical result: if f satisfies <properties> then convergence rate of <method that uses first order oracle> is <…>

Classic Result 1: Smoothness

for all

Then, for (projected) gradient descent with step size

Classic Result 2: Strong Convexity

Suppose f is strongly convex, i.e. hessian satisfies

Then for gradient descent with step size

The error in every step reduces by factor

So: “best” choice of step size gives reduction by factor

Condition number of f

“linear convergence”

Effect of Condition Number

87.9201

87.9201

87.9201

87.9201

175.7305

175.7305

175.7305

175.7305

175.7305

263.5408

263.5408

263.5408263.5408

263.5408

263.5408

351.3512

351.3512

351.3512

351.3512

439.1616

439.1616

439.1616

439.1616

526.972

526.972

526.972

526.972

614.7824

614.7824

614.7824

702.5928

702.5928

702.5928

790.4032

790.4032

790.4032

878.2136

878.2136

966.024

966.024

1053.8344

1141.6448

1229.4552

1317.2655

x1

-10 -5 0 5 10

x2

-10

-8

-6

-4

-2

0

2

4

6

8

10

81.9674

81.9674

81.9674

81.9674

163.8632

163.8632

163.8632

163.8632

245.7589

245.7589

245.7589

245.7589

327.6546

327.6546

327.6546

327.6546

409.5504

409.5504

409.5504

409.5504

491.4461

491.4461

491.4461

573.3418

573.3418

573.3418

655.2375

655.2375

737.1333

737.1333

819.029

819.029

900.9247

900.9247

982.8204

982.8204

1064.7162

1146.6119

x1

-10 -5 0 5 10

x2

-10

-8

-6

-4

-2

0

2

4

6

8

10

( 1� 1/ )Error decreases by in every iteration (with best step size)

Low“well conditioned”

High “badly conditioned”

Dropping Convexity

X ⌫ 0 , 9 U s.t.X = UU 0

minU

f(UU 0)

This problem is “equivalent” to original problem because

non-convex, but “only” due to UU’ parameterization

n x n matrix

[Bruer & Monteiro] with linear f, and constraints, eventual convergence to correct answer - no indication of how fast

Factored Gradient Descent

= Gradient descent on

By chain rule, so

(Factored) Gradient descent:

U+ U � ⌘rf(UU 0)U

Again, first-order oracle access to f

No projection step …

Non-convexity: Issue 1

2

2

3

3

4

4

5

5

5 5

6

6

6

6

6

6

7

7

7

7

7

7

78

8

8

8

88

9

9

U-2 -1 0 1 2

U

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

210

U

-1-2-2-1

0U

1

100

101

102

103

104

105

2

f(U

U>)

For any rotation matrix , i.e. a matrix such that , we have that

Idea: new definition of distance:

“Only the contour level matters”


has spurious stationary points – even for strongly convex original

- saddle points, local minima, local maxima …

e.g. is always a stationary point :

More generally, can have but

Does it have bad local minima when U is n x n ?- we don’t know …

-1.8763

-1.8763

-1.0257

-1.0257

-0.17517

-0.175170.67537

0.67537

0.67537

0.67537

1.5259

1.5259

1.5259

1.5259

1.52591.5259

1.52592.3765

2.3765

2.3765

2.3765

2.3765

2.3765

3.227

3.227

U-2 -1 0 1 2

U

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-4.4279

-3.5773

-3.5773

-2.7268

-2.7268

-2.7268

-2.7268

-1.8763

-1.8763

-1.8763

-1.8763

-1.8763

-1.8763

-1.0257

-1.0257

-1.0257

-1.0257

-1.0257

U-1.2 -1.1 -1 -0.9 -0.8

U

-1.2

-1.15

-1.1

-1.05

-1

-0.95

-0.9

-0.85

-0.8

-0.75


Idea 1: look for local convergence i.e.

once Note: still not “locally convex”in U space

Idea 2: find a way to initialize (using first-order oracle)

Step size

Idea: let us find

Hessian of g with respect to U

Special case (only for intuition): separable function

Such that

… after some algebra … Depends on X and the gradient of f

And then set

Step size

initial point

Effect: in this example

Comparison with step size

(Sa et al., 2014, Zheng and Lafferty 2015,

Tu et al., 2015)

Summary so far …

minX

f(X) s.t. X ⌫ 0

minU

f(UU 0)

Given

Convert to

U+ U � ⌘rf(UU 0)UDo factored gradient descent

Idea: use step size

Pushing further …

Artificially restrict the size of U to be n x r

U+ U � ⌘rf(UU 0)U

Reason 1: Computational

Smaller r = less variables, faster in every iteration

Reason 2: Statistical

Prevent over-fitting (in cases where f is a data-dependent loss function)

Issue 0: What does it converge to ?

And consider the matrix of its top r eigen-components

We will show convergence of

In the following: let be such that

Restricted Strong Convexity

(Regular) strong convexity: for all

Restricted strong convexity (RSC): above holds only for low-rank X,Y[Negahban et. al.]

Weaker assumption on f – common in high-dimensional machine learning

E.g. Matrix regression:

Main Result

Theorem: With step size choice as above, and (m,M) RSC,

Provided:

Next iterate

Linear Convergence onceclose enough.

Provided r appropriate chosen

In practice: increase r in stages

Initialization for Strongly Convex f

We propose:

1. Find negative gradient at 0

2. Keep at most r most positive eigen-components (i.e. values and their Corresponding vectors). Remove all negative eigen-components.

Requires one SVD.

Initialization

2

Increasing

Theorem:

Specializations of this have already been used in matrix completion, phaseretrieval etc.

A strange phenomenon

Different convergence rates for

and

Number of iterations0 200 400 600 800 1000

kb X!

X?k F

=kX

?k F

10-6

10-5

10-4

10-3

10-2

10-1

100

<1

<3= 100

<1

<3= 10

<1

<3= 5

Shift the function, get a differentconvergence behavior (!)

Smooth Convex Functions

Theorem: Local 1/k convergence rate:

where

Summary so far …

minX

f(X)

s.t. X ⌫ 0

minU

f(UU 0)

U+ U � ⌘rf(UU 0)UFactored gradient descent

With step size

Restricted strongly convex f : 1. Local linear convergence to

2. Initialization

Smooth f: Local 1/k convergence

General (unconstrained) FGD


FGD:

Now, bigger uncertainty sets:


Immediate corollary: 1/k convergence for smooth f

But, cannot use this trick for strongly convex …

Strongly Convex

Smooth, strongly convex,global min at 0

( Borrowing from [Tu et.al., 2016] )

Theorem: Local linear convergence to neighborhood of

where

Open Problems

1. Constraints

2. Acceleration

Summary

This work: factored gradient descent under first-order oracle model- new step size rule - local convergence rates for smooth, and for restricted strongly

convex functions- new initialization scheme

Implication: Correctness + convergence rates for

phase recoverymatrix regressionmatrix sensing… and almost, for matrix completion …

Under similar statistical settings already in analysis of convex optimization and Alt-Min

Claim: convex optimization a bad idea for statistical inferenceproblems involving low-rank matrix estimation …

All of these already use (special cases of) our initialization …

Documents

On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#