31
On Dropping Convexity for Faster Optimization Sujay Sanghavi UT Austin

On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Embed Size (px)

Citation preview

Page 1: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

On  Dropping  Convexity  for  Faster  Optimization

Sujay SanghaviUT  Austin

Page 2: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

SrinadhBhojanapalli

UT  Austinà TTI  Chicago

AnastasiosKyrillidis

UT  Austin

Dohyung Park

UT  Austin

Page 3: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Motivation

U

V 0

users

Sample  problem:  matrix  completion

eO(nr)

eO(n2)

eO(nr)

data  size

output  size

Convex  optimization  -­ sizeAltMin Sensing Completion Proof Summary References

A Comparision

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fraction of observations

Ratio

of su

ccess

AltMin

Nuclear norm approach

Nuclear norm approach : a leading theoretical approach.

Empirically, AltMin hassimilar sample complexity andbetter computational complexity.

Praneeth Netrapalli Provable Matrix Completion using Alternating Minimization

…  and  empirically  often statistically  worse  …

Similar  stories  in  phase  retrieval,  matrix  regression,  …

Page 4: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Step  1:  Semidefinite  Optimization

minX

f(X)

s.t. X ⌫ 0

convex,  nice  ..

Natural  method:  projected  gradient  descent

X+ P+ (X � ⌘rf(X) )

“First-­order  oracle  access  to  f  ”

Projection  onto  psd coneComputationally  intensive

Step  size

Page 5: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

First  order  oracle  access

Access  to  the  function  is  only  as  follows:

Oracle  access  is  a  standard  abstraction  in  the  study  of  methods  in  convex  optimization

Typical  result:  if  f  satisfies  <properties>  then  convergence  rate  of  <method  that  uses  first  order  oracle> is  <…>

Page 6: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Classic  Result  1:  Smoothness

for  all

Then,  for  (projected)  gradient  descent  with  step  size  

Page 7: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Classic  Result  2:  Strong  Convexity

Suppose  f  is  strongly  convex,  i.e.  hessian  satisfies  

Then  for  gradient  descent  with  step  size

The  error                                            in  every  step  reduces  by  factor  

So:  “best”  choice  of  step  size  gives  reduction  by  factor

Condition  number  of  f

“linear  convergence”

Page 8: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Effect  of  Condition  Number

87.9201

87.9201

87.9201

87.9201

175.7305

175.7305

175.7305

175.7305

175.7305

263.5408

263.5408

263.5408263.5408

263.5408

263.5408

351.3512

351.3512

351.3512

351.3512

439.1616

439.1616

439.1616

439.1616

526.972

526.972

526.972

526.972

614.7824

614.7824

614.7824

702.5928

702.5928

702.5928

790.4032

790.4032

790.4032

878.2136

878.2136

966.024

966.024

1053.8344

1141.6448

1229.4552

1317.2655

x1

-10 -5 0 5 10

x2

-10

-8

-6

-4

-2

0

2

4

6

8

10

81.9674

81.9674

81.9674

81.9674

163.8632

163.8632

163.8632

163.8632

245.7589

245.7589

245.7589

245.7589

327.6546

327.6546

327.6546

327.6546

409.5504

409.5504

409.5504

409.5504

491.4461

491.4461

491.4461

573.3418

573.3418

573.3418

655.2375

655.2375

737.1333

737.1333

819.029

819.029

900.9247

900.9247

982.8204

982.8204

1064.7162

1146.6119

x1

-10 -5 0 5 10

x2

-10

-8

-6

-4

-2

0

2

4

6

8

10

( 1� 1/ )Error  decreases  by                                            in  every  iteration  (with  best  step  size)

Low“well  conditioned”

High  “badly  conditioned”

Page 9: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Dropping  Convexity

X ⌫ 0 , 9 U s.t.X = UU 0

minU

f(UU 0)

This  problem  is  “equivalent”  to  original  problem  because  

non-­convex,  but  “only”  due  to  UU’ parameterization

n  x  n  matrix

[Bruer & Monteiro]  with  linear  f,  and  constraints,  eventual  convergence  to  correct  answer -­ no  indication  of  how  fast

Page 10: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Factored  Gradient  Descent

=  Gradient  descent  on

By  chain  rule, so

(Factored)  Gradient  descent:

U+ U � ⌘rf(UU 0)U

Again,  first-­order  oracle  access  to  f

No  projection  step  …

Page 11: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Non-­convexity:  Issue  1

2

2

3

3

4

4

5

5

5 5

6

6

6

6

6

6

7

7

7

7

7

7

78

8

8

8

88

9

9

U-2 -1 0 1 2

U

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

210

U

-1-2-2-1

0U

1

100

101

102

103

104

105

2

f(U

U>)

For  any  rotation  matrix              ,  i.e.  a  matrix  such  that                                      ,  we  have  that

Idea:  new  definition  of  distance:

“Only  the  contour  level  matters”

Page 12: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Non-­convexity:  Issue  2

has  spurious  stationary  points – even  for  strongly  convex  original  

-­ saddle  points,  local  minima,  local  maxima  …

e.g.                                  is  always  a  stationary  point  :  

More  generally,  can  have   but  

Does  it  have  bad  local  minima  when  U  is  n  x  n  ?-­ we  don’t  know  …

Page 13: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

-1.8763

-1.8763

-1.0257

-1.0257

-0.17517

-0.175170.67537

0.67537

0.67537

0.67537

1.5259

1.5259

1.5259

1.5259

1.52591.5259

1.52592.3765

2.3765

2.3765

2.3765

2.3765

2.3765

3.227

3.227

U-2 -1 0 1 2

U

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-4.4279

-3.5773

-3.5773

-2.7268

-2.7268

-2.7268

-2.7268

-1.8763

-1.8763

-1.8763

-1.8763

-1.8763

-1.8763

-1.0257

-1.0257

-1.0257

-1.0257

-1.0257

U-1.2 -1.1 -1 -0.9 -0.8

U

-1.2

-1.15

-1.1

-1.05

-1

-0.95

-0.9

-0.85

-0.8

-0.75

Non-­convexity:  Issue  2

Idea  1: look  for  local  convergence  i.e.

once Note:  still  not  “locally  convex”in  U space

Idea  2: find  a  way  to  initialize (using  first-­order  oracle)

Page 14: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Step  size

Idea:  let  us  find  

Hessian  of  g  with  respect  to  U

Special  case  (only  for  intuition):  separable  function

Such  that  

…  after  some  algebra  …  Depends  on  X  and  the  gradient  of  f

And  then  set  

Page 15: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Step  size

initial  point

Effect:  in  this  example

Comparison  with  step  size  

(Sa  et  al.,  2014,  Zheng  and  Lafferty  2015,  

Tu  et  al.,  2015)

Page 16: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Summary  so  far  …

minX

f(X) s.t. X ⌫ 0

minU

f(UU 0)

Given

Convert  to

U+ U � ⌘rf(UU 0)UDo  factored  gradient  descent

Idea:  use  step  size

Page 17: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Pushing  further  …

Artificially restrict  the  size  of  U  to  be  n  x  r

U+ U � ⌘rf(UU 0)U

Reason  1:  Computational

Smaller  r    =    less  variables,  faster  in  every  iteration

Reason  2:  Statistical

Prevent  over-­fitting  (in  cases  where  f  is  a  data-­dependent  loss  function)

Page 18: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Issue  0:  What  does  it  converge  to  ?

And  consider  the  matrix                    of  its  top  r  eigen-­components  

We  will  show  convergence  of  

In  the  following:  let                    be  such  that

Page 19: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Restricted  Strong  Convexity

(Regular)  strong  convexity:  for  all  

Restricted  strong  convexity  (RSC): above  holds  only  for  low-­rank  X,Y[Negahban et.  al.]

Weaker  assumption  on  f  – common  in  high-­dimensional  machine  learning

E.g.  Matrix  regression:

Page 20: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Main  Result

Theorem:  With  step  size  choice  as  above,  and  (m,M)  RSC,

Provided:

Next  iterate

Linear  Convergence  onceclose  enough.

Provided  r  appropriate  chosen

In  practice:  increase  r  in  stages

Page 21: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Initialization  for  Strongly  Convex  f

We  propose:

1. Find  negative  gradient  at  0

2. Keep  at  most  r  most  positive  eigen-­components  (i.e.  values  and  their  Corresponding  vectors).  Remove  all  negative  eigen-­components.

Requires  one  SVD.

Page 22: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Initialization

2

Increasing  

Theorem:

Specializations  of  this  have  already  been  used  in  matrix  completion,  phaseretrieval  etc.

Page 23: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

A  strange  phenomenon

Different  convergence  rates  for

and

Number of iterations0 200 400 600 800 1000

kb X!

X?k F

=kX

?k F

10-6

10-5

10-4

10-3

10-2

10-1

100

<1

<3= 100

<1

<3= 10

<1

<3= 5

Shift  the  function,  get  a  differentconvergence  behavior  (!)

Page 24: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Smooth  Convex  Functions

Theorem:  Local  1/k  convergence  rate:  

where

Page 25: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Summary  so  far  …

minX

f(X)

s.t. X ⌫ 0

minU

f(UU 0)

U+ U � ⌘rf(UU 0)UFactored  gradient  descent

With  step  size

Restricted  strongly  convex  f :  1.  Local  linear  convergence  to  

2.  Initialization

Smooth  f: Local  1/k  convergence  

Page 26: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

General  (unconstrained)  FGD

Page 27: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

General  (unconstrained)  FGD

FGD:

Now,  bigger  uncertainty  sets:  

Page 28: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

General  (unconstrained)  FGD

Immediate  corollary:  1/k  convergence  for  smooth  f

But,  cannot  use  this  trick  for  strongly  convex  …

Page 29: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Strongly  Convex

Smooth,  strongly  convex,global  min  at  0

(  Borrowing  from  [Tu et.al.,  2016]  )

Theorem:  Local  linear  convergence  to    neighborhood  of  

where

Page 30: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Open  Problems

1. Constraints

2. Acceleration

Page 31: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#

Summary

This  work:  factored  gradient  descent  under  first-­order  oracle  model-­ new  step  size  rule  -­ local  convergence  rates  for  smooth,  and  for  restricted  strongly  

convex  functions-­ new  initialization  scheme

Implication:  Correctness  +  convergence  rates  for

phase  recoverymatrix  regressionmatrix  sensing…  and  almost,  for  matrix  completion  …

Under  similar  statistical  settings  already  in  analysis  of  convex  optimization  and  Alt-­Min

Claim:  convex  optimization  a  bad  idea  for  statistical  inferenceproblems  involving  low-­rank  matrix  estimation  …

All  of  these  already  use  (special  cases  of)  our  initialization  …