Upload
lenguyet
View
218
Download
5
Embed Size (px)
Citation preview
On Dropping Convexity for Faster Optimization
Sujay SanghaviUT Austin
SrinadhBhojanapalli
UT Austinà TTI Chicago
AnastasiosKyrillidis
UT Austin
Dohyung Park
UT Austin
Motivation
U
V 0
users
Sample problem: matrix completion
eO(nr)
eO(n2)
eO(nr)
data size
output size
Convex optimization - sizeAltMin Sensing Completion Proof Summary References
A Comparision
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fraction of observations
Ratio
of su
ccess
AltMin
Nuclear norm approach
Nuclear norm approach : a leading theoretical approach.
Empirically, AltMin hassimilar sample complexity andbetter computational complexity.
Praneeth Netrapalli Provable Matrix Completion using Alternating Minimization
… and empirically often statistically worse …
Similar stories in phase retrieval, matrix regression, …
Step 1: Semidefinite Optimization
minX
f(X)
s.t. X ⌫ 0
convex, nice ..
Natural method: projected gradient descent
X+ P+ (X � ⌘rf(X) )
“First-order oracle access to f ”
Projection onto psd coneComputationally intensive
Step size
First order oracle access
Access to the function is only as follows:
Oracle access is a standard abstraction in the study of methods in convex optimization
Typical result: if f satisfies <properties> then convergence rate of <method that uses first order oracle> is <…>
Classic Result 1: Smoothness
for all
Then, for (projected) gradient descent with step size
Classic Result 2: Strong Convexity
Suppose f is strongly convex, i.e. hessian satisfies
Then for gradient descent with step size
The error in every step reduces by factor
So: “best” choice of step size gives reduction by factor
Condition number of f
“linear convergence”
Effect of Condition Number
87.9201
87.9201
87.9201
87.9201
175.7305
175.7305
175.7305
175.7305
175.7305
263.5408
263.5408
263.5408263.5408
263.5408
263.5408
351.3512
351.3512
351.3512
351.3512
439.1616
439.1616
439.1616
439.1616
526.972
526.972
526.972
526.972
614.7824
614.7824
614.7824
702.5928
702.5928
702.5928
790.4032
790.4032
790.4032
878.2136
878.2136
966.024
966.024
1053.8344
1141.6448
1229.4552
1317.2655
x1
-10 -5 0 5 10
x2
-10
-8
-6
-4
-2
0
2
4
6
8
10
81.9674
81.9674
81.9674
81.9674
163.8632
163.8632
163.8632
163.8632
245.7589
245.7589
245.7589
245.7589
327.6546
327.6546
327.6546
327.6546
409.5504
409.5504
409.5504
409.5504
491.4461
491.4461
491.4461
573.3418
573.3418
573.3418
655.2375
655.2375
737.1333
737.1333
819.029
819.029
900.9247
900.9247
982.8204
982.8204
1064.7162
1146.6119
x1
-10 -5 0 5 10
x2
-10
-8
-6
-4
-2
0
2
4
6
8
10
( 1� 1/ )Error decreases by in every iteration (with best step size)
Low“well conditioned”
High “badly conditioned”
Dropping Convexity
X ⌫ 0 , 9 U s.t.X = UU 0
minU
f(UU 0)
This problem is “equivalent” to original problem because
non-convex, but “only” due to UU’ parameterization
n x n matrix
[Bruer & Monteiro] with linear f, and constraints, eventual convergence to correct answer - no indication of how fast
Factored Gradient Descent
= Gradient descent on
By chain rule, so
(Factored) Gradient descent:
U+ U � ⌘rf(UU 0)U
Again, first-order oracle access to f
No projection step …
Non-convexity: Issue 1
2
2
3
3
4
4
5
5
5 5
6
6
6
6
6
6
7
7
7
7
7
7
78
8
8
8
88
9
9
U-2 -1 0 1 2
U
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
210
U
-1-2-2-1
0U
1
100
101
102
103
104
105
2
f(U
U>)
For any rotation matrix , i.e. a matrix such that , we have that
Idea: new definition of distance:
“Only the contour level matters”
Non-convexity: Issue 2
has spurious stationary points – even for strongly convex original
- saddle points, local minima, local maxima …
e.g. is always a stationary point :
More generally, can have but
Does it have bad local minima when U is n x n ?- we don’t know …
-1.8763
-1.8763
-1.0257
-1.0257
-0.17517
-0.175170.67537
0.67537
0.67537
0.67537
1.5259
1.5259
1.5259
1.5259
1.52591.5259
1.52592.3765
2.3765
2.3765
2.3765
2.3765
2.3765
3.227
3.227
U-2 -1 0 1 2
U
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-4.4279
-3.5773
-3.5773
-2.7268
-2.7268
-2.7268
-2.7268
-1.8763
-1.8763
-1.8763
-1.8763
-1.8763
-1.8763
-1.0257
-1.0257
-1.0257
-1.0257
-1.0257
U-1.2 -1.1 -1 -0.9 -0.8
U
-1.2
-1.15
-1.1
-1.05
-1
-0.95
-0.9
-0.85
-0.8
-0.75
Non-convexity: Issue 2
Idea 1: look for local convergence i.e.
once Note: still not “locally convex”in U space
Idea 2: find a way to initialize (using first-order oracle)
Step size
Idea: let us find
Hessian of g with respect to U
Special case (only for intuition): separable function
Such that
… after some algebra … Depends on X and the gradient of f
And then set
Step size
initial point
Effect: in this example
Comparison with step size
(Sa et al., 2014, Zheng and Lafferty 2015,
Tu et al., 2015)
Summary so far …
minX
f(X) s.t. X ⌫ 0
minU
f(UU 0)
Given
Convert to
U+ U � ⌘rf(UU 0)UDo factored gradient descent
Idea: use step size
Pushing further …
Artificially restrict the size of U to be n x r
U+ U � ⌘rf(UU 0)U
Reason 1: Computational
Smaller r = less variables, faster in every iteration
Reason 2: Statistical
Prevent over-fitting (in cases where f is a data-dependent loss function)
Issue 0: What does it converge to ?
And consider the matrix of its top r eigen-components
We will show convergence of
In the following: let be such that
Restricted Strong Convexity
(Regular) strong convexity: for all
Restricted strong convexity (RSC): above holds only for low-rank X,Y[Negahban et. al.]
Weaker assumption on f – common in high-dimensional machine learning
E.g. Matrix regression:
Main Result
Theorem: With step size choice as above, and (m,M) RSC,
Provided:
Next iterate
Linear Convergence onceclose enough.
Provided r appropriate chosen
In practice: increase r in stages
Initialization for Strongly Convex f
We propose:
1. Find negative gradient at 0
2. Keep at most r most positive eigen-components (i.e. values and their Corresponding vectors). Remove all negative eigen-components.
Requires one SVD.
Initialization
2
Increasing
Theorem:
Specializations of this have already been used in matrix completion, phaseretrieval etc.
A strange phenomenon
Different convergence rates for
and
Number of iterations0 200 400 600 800 1000
kb X!
X?k F
=kX
?k F
10-6
10-5
10-4
10-3
10-2
10-1
100
<1
<3= 100
<1
<3= 10
<1
<3= 5
Shift the function, get a differentconvergence behavior (!)
Smooth Convex Functions
Theorem: Local 1/k convergence rate:
where
Summary so far …
minX
f(X)
s.t. X ⌫ 0
minU
f(UU 0)
U+ U � ⌘rf(UU 0)UFactored gradient descent
With step size
Restricted strongly convex f : 1. Local linear convergence to
2. Initialization
Smooth f: Local 1/k convergence
General (unconstrained) FGD
General (unconstrained) FGD
FGD:
Now, bigger uncertainty sets:
General (unconstrained) FGD
Immediate corollary: 1/k convergence for smooth f
But, cannot use this trick for strongly convex …
Strongly Convex
Smooth, strongly convex,global min at 0
( Borrowing from [Tu et.al., 2016] )
Theorem: Local linear convergence to neighborhood of
where
Open Problems
1. Constraints
2. Acceleration
Summary
This work: factored gradient descent under first-order oracle model- new step size rule - local convergence rates for smooth, and for restricted strongly
convex functions- new initialization scheme
Implication: Correctness + convergence rates for
phase recoverymatrix regressionmatrix sensing… and almost, for matrix completion …
Under similar statistical settings already in analysis of convex optimization and Alt-Min
Claim: convex optimization a bad idea for statistical inferenceproblems involving low-rank matrix estimation …
All of these already use (special cases of) our initialization …