Geometry and Regularizationin Nonconvex Quadratic Inverse Problems
Yuejie Chi
Department of Electrical and Computer Engineering
Princeton Workshop, May 2018
Acknowledgements
Thanks to my collaborators:
Yuxin Chen Cong Ma Kaizheng Wang Yuanxin LiPrinceton Princeton Princeton CMU/OSU
This research is supported by NSF, ONR and AFOSR.
1
Nonconvex optimization in theory
There may be exponentially many local optima
e.g. a single neuron model (Auer, Herbster, Warmuth ’96)
2
Nonconvex optimization in theory
There may be exponentially many local optima
e.g. a single neuron model (Auer, Herbster, Warmuth ’96)
2
Nonconvex optimization in practice
Using simple algorithms such as gradient descent, e.g., “backpropagation” for training deep neural networks...
3
Statistical context is important
Data/measurements follow certain statistical models and henceare not worst-case instances.
minimizex f(x) =1
m
m∑
i=1
`(yi;x)
m→∞=⇒ E[`(y;x)]
empirical risk ≈ population risk (often nice!)
θ1
-3 -2 -1 0 1 2 3
θ2
-3
-2
-1
0
1
2
3θ0 = [1, 0]
θn = [0.816,−0.268]
θ1
-3 -2 -1 0 1 2 3
θ2
-3
-2
-1
0
1
2
3
θ0
Figure credit: Mei, Bai and Montanari
4
Statistical context is important
Data/measurements follow certain statistical models and henceare not worst-case instances.
minimizex f(x) =1
m
m∑
i=1
`(yi;x)m→∞=⇒ E[`(y;x)]
empirical risk ≈ population risk (often nice!)
θ1
-3 -2 -1 0 1 2 3
θ2
-3
-2
-1
0
1
2
3θ0 = [1, 0]
θn = [0.816,−0.268]
θ1
-3 -2 -1 0 1 2 3
θ2
-3
-2
-1
0
1
2
3
θ0
Figure credit: Mei, Bai and Montanari
4
Statistical context is important
Data/measurements follow certain statistical models and henceare not worst-case instances.
minimizex f(x) =1
m
m∑
i=1
`(yi;x)m→∞=⇒ E[`(y;x)]
empirical risk ≈ population risk (often nice!)
θ1
-3 -2 -1 0 1 2 3
θ2
-3
-2
-1
0
1
2
3θ0 = [1, 0]
θn = [0.816,−0.268]
θ1
-3 -2 -1 0 1 2 3
θ2
-3
-2
-1
0
1
2
3
θ0
Figure credit: Mei, Bai and Montanari
4
Putting together...
global convergence
statistical models
benign landscape
5
Computational efficiency?
global convergence
statistical models
benign landscape
But how fast?
6
Overview and question we aim to address
samplecomplexity
critical points
efficient
inefficient(saddle point, nonsmooth)
Statistical:
Computational:
Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?
7
Overview and question we aim to address
samplecomplexity
critical points smoothness
efficient
inefficient(saddle point, nonsmooth)
Statistical:
Computational:
Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?
7
Overview and question we aim to address
samplecomplexity
critical points smoothness
efficient
inefficient(saddle point, nonsmooth)
Statistical:
Computational:
inefficient
efficient
Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?
7
Overview and question we aim to address
samplecomplexity
critical points smoothness
efficient
inefficient(saddle point, nonsmooth)
Statistical:
Computational:
inefficient
efficient
efficient
regularized unregularized
?
Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?
7
Overview and question we aim to address
samplecomplexity
critical points smoothness
efficient
inefficient(saddle point, nonsmooth)
Statistical:
Computational:
inefficient
efficient
efficient
regularized unregularized
?
Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?
7
Quadratic inverse problem
A
x
Ax
1
1
-3
2
-1
4
2-2
-1
3
-1
3
10
8
3
18
125
2
27
18
Y likelihood: fY |X(y | x unknown signal: X observation: Y
X y n p
1
Wirtinger flow (Candes, Li, Soltanolkotabi ’14)
Empirical loss minimization
minimizex f(x) = 1m
mÿ
k=1
Ë!a€k x
"2 ≠ ykÈ2
• Initialization by spectral method
• Gradient iterations: for t = 0, 1, . . .
xt+1 = xt ≠ ÷t Òf(xt)
10/ 29
X
1
0
2
-1
1
20
0
3
4
1
1
0
-1
-1
21
-1
3
1
r
AX yi = ka>i Xk2
2
Recover X\ ∈ Rn×r from m “random” quadratic measurements
yi =∥∥∥a>i X\
∥∥∥2
2, i = 1, . . . ,m
The rank-1 case is the famous phase retrieval problem.
8
Geometric interpretation
If X\ is orthonormal...
aiaj
X\
ka>i X\k2
ka>j X\k2
We lose a generalized notion of “phase” in Rr:
a>i X\
∥∥a>i X\∥∥
2
9
Geometric interpretation
If X\ is orthonormal...
aiaj
X\
ka>i X\k2
ka>j X\k2
We lose a generalized notion of “phase” in Rr:
a>i X\
∥∥a>i X\∥∥
2
9
Shallow neural network
hidden layer input layer output layer
1
hidden layer input layer output layer
1
hidden layer input layer output layer
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y +
1
a
X\
Set X\ = [x1,x2, . . . ,xr], then
y =r∑
i=1
σ(a>xi).
10
Shallow neural network with quadratic activation
hidden layer input layer output layer
1
hidden layer input layer output layer
1
hidden layer input layer output layer
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y
1
hidden layer input layer output layer
x y W y +
1
a
X\
Set X\ = [x1,x2, . . . ,xr], then
y =r∑
i=1
σ(a>xi)σ(z)=z2
:=r∑
i=1
(a>xi)2 =
∥∥∥a>X\∥∥∥
2
2.
11
A natural least squares formulation
given: yk =∥∥∥a>kX\
∥∥∥2
2, 1 ≤ k ≤ m
⇓
minimizeX∈Rn×r f(X) =1
4m
m∑
k=1
(∥∥∥a>kX∥∥∥
2
2− yk
)2
• Use r = 1 as a running example.
12
A natural least squares formulation
given: yk =∥∥∥a>kX\
∥∥∥2
2, 1 ≤ k ≤ m
⇓
minimizeX∈Rn×r f(X) =1
4m
m∑
k=1
(∥∥∥a>kX∥∥∥
2
2− yk
)2
• Use r = 1 as a running example.
12
Wirtinger flow (Candes, Li, Soltanolkotabi ’14)
Empirical risk minimization
minimizex f(x) =1
4m
m∑
k=1
[(a>k x
)2 − yk]2
• Initialization by spectral method
• Gradient iterations: for t = 0, 1, . . .
xt+1 = xt − η∇f(xt)
13
Wirtinger flow (Candes, Li, Soltanolkotabi ’14)
Empirical risk minimization
minimizex f(x) =1
4m
m∑
k=1
[(a>k x
)2 − yk]2
• Initialization by spectral method
• Gradient iterations: for t = 0, 1, . . .
xt+1 = xt − η∇f(xt)
13
Geometry of loss surface
-1 -0.5 0 0.5 1-1.5
-1
-0.5
0
0.5
1
1.5
1
2
3
4
5
6
7
8
P(0)PX|(x | 0)
H0
><H1
P(1)PX|(x | 1)
L(x) =PX|(x | 1)
PX|(x | 0)
H1
><H0
H0 ! N (0, 1)
H1 ! N (1, 1)
L(x) =fX(x | H1)
fX(x | H0)=
1p2
exp (x1)2
2
1p2
expx2
2
= exp
x 1
2
L(x)
H1
><H0
() x
H1
><H0
1
2+ log
Pe,MAP = P(0)↵ + P(1)
↵
S1,k = (Wk + c1Fk)ei1,k
S2,k = (Wk + c2Fk)ei2,k
S1,k = (Wk + c1Fk)ei1,k
S2,k = (Wk + c2Fk)ei2,k, 8pixel k
Wk = f1(S1,k, S2,k)
Fk = f2(S1,k, S2,k)
x\
1
saddle point spectral init
1
saddle point spectral initialization
1
14
Gradient descent theory
f is said to be α-strongly convex and β-smooth if
0 αI ∇2f(x) βI, ∀x
`2 error contraction: GD with η = 1/β obeys
‖xt+1 − x\‖2 ≤(
1− β
α
α
β
)‖xt − x\‖2
• Condition number βα determines rate of convergence
• Attains ε-accuracy within O(βα log 1
ε
)iterations
15
Gradient descent theory
f is said to be α-strongly convex and β-smooth if
0 αI ∇2f(x) βI, ∀x
`2 error contraction: GD with η = 1/β obeys
‖xt+1 − x\‖2 ≤(
1− β
α
)‖xt − x\‖2
• Condition number βα determines rate of convergence
• Attains ε-accuracy within O(βα log 1
ε
)iterations
15
What does this optimization theory say about WF?
Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m
Population level (infinite samples)
E[∇2f(x)
]= 3
(‖x‖22 I + 2xx>
)−(∥∥x\
∥∥2
2I + 2x\x\>
)
︸ ︷︷ ︸locally positive definite and well-conditioned
In E[∇2f(x)
] 10In
Consequence: WF converges within O(
log 1ε
)iterations if
m→∞
Finite-sample level (m n log n)
∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n
(even locally)
1
2In ∇2f(x) O(n)In
Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1
ε
)iterations with η 1
n if m n log n
Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?
16
What does this optimization theory say about WF?
Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m
Population level (infinite samples)
E[∇2f(x)
]= 3
(‖x‖22 I + 2xx>
)−(∥∥x\
∥∥2
2I + 2x\x\>
)
︸ ︷︷ ︸locally positive definite and well-conditioned
In E[∇2f(x)
] 10In
Consequence: WF converges within O(
log 1ε
)iterations if
m→∞
Finite-sample level (m n log n)
∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n
(even locally)
1
2In ∇2f(x) O(n)In
Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1
ε
)iterations with η 1
n if m n log n
Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?
16
What does this optimization theory say about WF?
Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m
Population level (infinite samples)
E[∇2f(x)
]= 3
(‖x‖22 I + 2xx>
)−(∥∥x\
∥∥2
2I + 2x\x\>
)
︸ ︷︷ ︸locally positive definite and well-conditioned
In E[∇2f(x)
] 10In
Consequence: WF converges within O(
log 1ε
)iterations if
m→∞
Finite-sample level (m n log n)
∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n
(even locally)
1
2In ∇2f(x) O(n)In
Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1
ε
)iterations with η 1
n if m n log n
Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?
16
What does this optimization theory say about WF?
Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m
Population level (infinite samples)
E[∇2f(x)
]= 3
(‖x‖22 I + 2xx>
)−(∥∥x\
∥∥2
2I + 2x\x\>
)
︸ ︷︷ ︸locally positive definite and well-conditioned
In E[∇2f(x)
] 10In
Consequence: WF converges within O(
log 1ε
)iterations if
m→∞
Finite-sample level (m n log n)
∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n
(even locally)
1
2In ∇2f(x) O(n)In
Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1
ε
)iterations with η 1
n if m n log n
Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?
16
What does this optimization theory say about WF?
Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m
Population level (infinite samples)
E[∇2f(x)
]= 3
(‖x‖22 I + 2xx>
)−(∥∥x\
∥∥2
2I + 2x\x\>
)
︸ ︷︷ ︸locally positive definite and well-conditioned
In E[∇2f(x)
] 10In
Consequence: WF converges within O(
log 1ε
)iterations if
m→∞
Finite-sample level (m n log n)
∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n
(even locally)
1
2In ∇2f(x) O(n)In
Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1
ε
)iterations with η 1
n if m n log n
Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?
16
What does this optimization theory say about WF?
Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m
Population level (infinite samples)
E[∇2f(x)
]= 3
(‖x‖22 I + 2xx>
)−(∥∥x\
∥∥2
2I + 2x\x\>
)
︸ ︷︷ ︸locally positive definite and well-conditioned
In E[∇2f(x)
] 10In
Consequence: WF converges within O(
log 1ε
)iterations if
m→∞
Finite-sample level (m n log n)
∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n
(even locally)
1
2In ∇2f(x) O(n)In
Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1
ε
)iterations with η 1
n if m n log n
Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?
16
Numerical efficiency with ηt = 0.1
0 100 200 300 400 50010-15
10-10
10-5
100
Vanilla GD (WF) can proceed much more aggressively!
17
Numerical efficiency with ηt = 0.1
0 100 200 300 400 50010-15
10-10
10-5
100
Generic optimization theory is too pessimistic!
18
A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?
∇2f(x) =1
m
m∑
k=1
[3(a>k x
)2 −(a>k x
\)2]
aka>k
• Not smooth if x and ak are too close (coherent)
• x is not far away from x\
• x is incoherent w.r.t. sampling vectors (incoherence region)
(1/2) · In ∇2f(x) O(log n) · In
19
A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?
∇2f(x) =1
m
m∑
k=1
[3(a>k x
)2 −(a>k x
\)2]
aka>k
• Not smooth if x and ak are too close (coherent)
• x is not far away from x\
• x is incoherent w.r.t. sampling vectors (incoherence region)
(1/2) · In ∇2f(x) O(log n) · In
19
A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?
∇2f(x) =1
m
m∑
k=1
[3(a>k x
)2 −(a>k x
\)2]
aka>k
• Not smooth if x and ak are too close (coherent)
·∙
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
. log n
a1 a2 x\
a>2 (x x\)
kx x\k2
. log n
1
• x is not far away from x\
• x is incoherent w.r.t. sampling vectors (incoherence region)
(1/2) · In ∇2f(x) O(log n) · In
19
A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?
∇2f(x) =1
m
m∑
k=1
[3(a>k x
)2 −(a>k x
\)2]
aka>k
• Not smooth if x and ak are too close (coherent)
·∙
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
. log n
a1 a2 x\
a>2 (x x\)
kx x\k2
. log n
1
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
. log n
a1 a2 x\
a>2 (x x\)
kx x\k2
. log n
1
a>1 (x x\)
.p
log n
a>2 (x x\)
.p
log n
1
• x is not far away from x\
• x is incoherent w.r.t. sampling vectors (incoherence region)
(1/2) · In ∇2f(x) O(log n) · In
19
A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?
∇2f(x) =1
m
m∑
k=1
[3(a>k x
)2 −(a>k x
\)2]
aka>k
• Not smooth if x and ak are too close (coherent)
·∙
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
. log n
a1 a2 x\
a>2 (x x\)
kx x\k2
. log n
1
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
. log n
a1 a2 x\
a>2 (x x\)
kx x\k2
. log n
1
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
. log n
a1 a2 x\
a>2 (x x\)
kx x\k2
. log n
1
a>1 (x x\)
.p
log n
a>2 (x x\)
.p
log n
1
a>1 (x x\)
.p
log n
a>2 (x x\)
.p
log n
1
• x is not far away from x\
• x is incoherent w.r.t. sampling vectors (incoherence region)
(1/2) · In ∇2f(x) O(log n) · In
19
A second look at gradient descent theory
region of local strong convexity + smoothness
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
··
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
··
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
··
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
··
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
A second look at gradient descent theory
region of local strong convexity + smoothness
··
• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region
• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence
20
Our findings: GD is implicitly regularized
region of local strong convexity + smoothness
··
GD implicitly forces iterates to remain incoherenteven without regularization
21
Our findings: GD is implicitly regularized
region of local strong convexity + smoothness
··
GD implicitly forces iterates to remain incoherenteven without regularization
21
Our findings: GD is implicitly regularized
region of local strong convexity + smoothness
··
GD implicitly forces iterates to remain incoherenteven without regularization
21
Our findings: GD is implicitly regularized
region of local strong convexity + smoothness
··
GD implicitly forces iterates to remain incoherenteven without regularization
21
Our findings: GD is implicitly regularized
region of local strong convexity + smoothness
··
GD implicitly forces iterates to remain incoherenteven without regularization
21
Theoretical guarantees
Theorem (Phase retrieval)
Under i.i.d. Gaussian design, WF achieves
• maxk∣∣a>k (xt − x\)
∣∣ . √log n ‖x\‖2 (incoherence)
• ‖xt − x\‖2 .(1− η
2
)t ‖x\‖2 (near-linear convergence)
provided that step size η 1logn and sample size m & n log n.
Big computational saving: WF attains ε-accuracy withinO(
log n log 1ε
)iterations with η 1/ log n if m n log n
22
Theoretical guarantees
Theorem (Phase retrieval)
Under i.i.d. Gaussian design, WF achieves
• maxk∣∣a>k (xt − x\)
∣∣ . √log n ‖x\‖2 (incoherence)
• ‖xt − x\‖2 .(1− η
2
)t ‖x\‖2 (near-linear convergence)
provided that step size η 1logn and sample size m & n log n.
Big computational saving: WF attains ε-accuracy withinO(
log n log 1ε
)iterations with η 1/ log n if m n log n
22
Theoretical guarantees
Theorem (Phase retrieval)
Under i.i.d. Gaussian design, WF achieves
• maxk∣∣a>k (xt − x\)
∣∣ . √log n ‖x\‖2 (incoherence)
• ‖xt − x\‖2 .(1− η
2
)t ‖x\‖2 (near-linear convergence)
provided that step size η 1logn and sample size m & n log n.
Big computational saving: WF attains ε-accuracy withinO(
log n log 1ε
)iterations with η 1/ log n if m n log n
22
Theoretical guarantees for the low-rank case
Theorem (Quadratic sampling)
Under i.i.d. Gaussian design, GD achieves
• maxl∥∥a>l (XtQt −X\)
∥∥ .√
log n σ2r(X\)‖X\‖F
(incoherence)
• ‖XtQt−X\‖F .(
1− σ2r(X\)η
2
)t‖X\‖F (linear convergence)
provided that η 1(logn∨r)2σ2
r(X\)and m & nr4 log n.
Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1
ε
)iterations if m nr4 log n
Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1
ε
)
iterations if m nr6 log2 n.
23
Theoretical guarantees for the low-rank case
Theorem (Quadratic sampling)
Under i.i.d. Gaussian design, GD achieves
• maxl∥∥a>l (XtQt −X\)
∥∥ .√
log n σ2r(X\)‖X\‖F
(incoherence)
• ‖XtQt−X\‖F .(
1− σ2r(X\)η
2
)t‖X\‖F (linear convergence)
provided that η 1(logn∨r)2σ2
r(X\)and m & nr4 log n.
Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1
ε
)iterations if m nr4 log n
Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1
ε
)
iterations if m nr6 log2 n.
23
Theoretical guarantees for the low-rank case
Theorem (Quadratic sampling)
Under i.i.d. Gaussian design, GD achieves
• maxl∥∥a>l (XtQt −X\)
∥∥ .√
log n σ2r(X\)‖X\‖F
(incoherence)
• ‖XtQt−X\‖F .(
1− σ2r(X\)η
2
)t‖X\‖F (linear convergence)
provided that η 1(logn∨r)2σ2
r(X\)and m & nr4 log n.
Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1
ε
)iterations if m nr4 log n
Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1
ε
)
iterations if m nr6 log2 n.
23
Theoretical guarantees for the low-rank case
Theorem (Quadratic sampling)
Under i.i.d. Gaussian design, GD achieves
• maxl∥∥a>l (XtQt −X\)
∥∥ .√
log n σ2r(X\)‖X\‖F
(incoherence)
• ‖XtQt−X\‖F .(
1− σ2r(X\)η
2
)t‖X\‖F (linear convergence)
provided that η 1(logn∨r)2σ2
r(X\)and m & nr4 log n.
Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1
ε
)iterations if m nr4 log n
Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1
ε
)
iterations if m nr6 log2 n.
23
Theoretical guarantees for the low-rank case
Theorem (Quadratic sampling)
Under i.i.d. Gaussian design, GD achieves
• maxl∥∥a>l (XtQt −X\)
∥∥ .√
log n σ2r(X\)‖X\‖F
(incoherence)
• ‖XtQt−X\‖F .(
1− σ2r(X\)η
2
)t‖X\‖F (linear convergence)
provided that η 1(logn∨r)2σ2
r(X\)and m & nr4 log n.
Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1
ε
)iterations if m nr4 log n
Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1
ε
)
iterations if m nr6 log2 n.
23
Key ingredient: leave-one-out analysis
How to establish∣∣a>l (xt − x\)
∣∣ . √log n ‖x\‖2?
Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA
x
Ax
1
1
-3
2
-1
4
-2
-1
3
4
1
9
4
1
16
4
1
9
16
24
Key ingredient: leave-one-out analysis
How to establish∣∣a>l (xt − x\)
∣∣ . √log n ‖x\‖2?
Technical difficulty: xt is statistically dependent with al;
Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA
x
Ax
1
1
-3
2
-1
4
-2
-1
3
4
1
9
4
1
16
4
1
9
16
24
Key ingredient: leave-one-out analysis
How to establish∣∣a>l (xt − x\)
∣∣ . √log n ‖x\‖2?
Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA
x
Ax
1
1
-3
2
-1
4
-2
-1
3
4
1
9
4
1
16
4
1
9
16
24
Key ingredient: leave-one-out analysis
·
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
.p
log n
a1 a2 x\
a>2 (x x\)
kx x\k2
.p
log n
incoherence region w.r.t. a1
1
xt,(l) al w.r.t. al
1
xt,(l) al w.r.t. al
1
xt,(l) al w.r.t. al
1
• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.
• Leave-one-out iterates xt,(l) ≈ true iterates xt
• Finish by triangle inequality∣∣a>l (xt − x\)
∣∣ ≤∣∣a>l (xt,(l) − x\)
∣∣+∣∣a>l (xt − xt,(l))
∣∣.
25
Key ingredient: leave-one-out analysis
·
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
.p
log n
a1 a2 x\
a>2 (x x\)
kx x\k2
.p
log n
incoherence region w.r.t. a1
1
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
.p
log n
a1 a2 x\
a>2 (x x\)
kx x\k2
.p
log n
incoherence region w.r.t. a1
xt,(1) xt
1
xt,(l) al w.r.t. al
1
xt,(l) al w.r.t. al
1
xt,(l) al w.r.t. al
1
• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.
• Leave-one-out iterates xt,(l) ≈ true iterates xt
• Finish by triangle inequality∣∣a>l (xt − x\)
∣∣ ≤∣∣a>l (xt,(l) − x\)
∣∣+∣∣a>l (xt − xt,(l))
∣∣.
25
Key ingredient: leave-one-out analysis
·
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
.p
log n
a1 a2 x\
a>2 (x x\)
kx x\k2
.p
log n
incoherence region w.r.t. a1
1
minimizex f(x) =1
m
mX
i=1
(a>
i x)2 yi
2
s
x 2 Rns.t.a>
i x2
=a>
i x\2
, 1 i m
minimizeX f(X) =X
(i,j)2
e>
i XX>ej e>i X\X\>ej
2
minimizeh,x f(h, x) =mX
i=1
e>
i XX>ej e>i X\X\>ej
2
find X 2 Rnr s.t. e>j XX>ei = e>
j X\X\>ei, (i, j) 2
find h, x 2 Cn s.t. bi hix
i ai = b
i h\ix
\i ai, 1 i m
max(r2f(x))
max(r2f(x))
a1 a2 x\
a>1 (x x\)
kx x\k2
.p
log n
a1 a2 x\
a>2 (x x\)
kx x\k2
.p
log n
incoherence region w.r.t. a1
xt,(1) xt
1
xt,(l) al w.r.t. al
1
xt,(l) al w.r.t. al
1
xt,(l) al w.r.t. al
1
• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.
• Leave-one-out iterates xt,(l) ≈ true iterates xt
• Finish by triangle inequality∣∣a>l (xt − x\)
∣∣ ≤∣∣a>l (xt,(l) − x\)
∣∣+∣∣a>l (xt − xt,(l))
∣∣.
25
Incoherence region in high dimensions
·· · ··
2-dimensional high-dimensional︸ ︷︷ ︸incoherence region is vanishingly small
26
This recipe is quite general
Low-rank matrix completion
X ? ? ? X ?? ? X X ? ?X ? ? X ? ?? ? X ? ? XX ? ? ? ? ?? X ? ? X ?? ? X X ? ?
? ? ? ?
?
?
??
??
???
?
?
Fig. credit: Candes
Given partial samples of a low-rank matrix M in an index set Ω,fill in missing entries.
Applications: recommendation systems, ...
28
Incoherence
1 0 0 · · · 00 0 0 · · · 0...
......
0 0 0 · · · 0
︸ ︷︷ ︸hard µ=n
vs.
1 1 1 · · · 11 1 1 · · · 1...
......
1 1 1 · · · 1
︸ ︷︷ ︸easy µ=1
Definition (Incoherence for matrix completion)
A rank-r matrix M \ with eigendecomposition M \ = U \Σ\U \> issaid to be µ-incoherent if
∥∥∥U \∥∥∥
2,∞≤√µ
n
∥∥∥U \∥∥∥
F=
õr
n.
29
Prior theory
minimizeX∈Rn×r f(X) =∑
(j,k)∈Ω
(e>j XX>ek −Mj,k
)2
Existing theory promotes incoherence explicitly:
• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma
’16
• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16
Our theory provides guarantees on vanilla / unregularized gradientdescent.
30
Prior theory
minimizeX∈Rn×r f(X) =∑
(j,k)∈Ω
(e>j XX>ek −Mj,k
)2
Existing theory promotes incoherence explicitly:
• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma
’16
• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16
Our theory provides guarantees on vanilla / unregularized gradientdescent.
30
Prior theory
minimizeX∈Rn×r f(X) =∑
(j,k)∈Ω
(e>j XX>ek −Mj,k
)2
Existing theory promotes incoherence explicitly:
• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma
’16
• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16
Our theory provides guarantees on vanilla / unregularized gradientdescent.
30
Prior theory
minimizeX∈Rn×r f(X) =∑
(j,k)∈Ω
(e>j XX>ek −Mj,k
)2
Existing theory promotes incoherence explicitly:
• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma
’16
• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16
Our theory provides guarantees on vanilla / unregularized gradientdescent.
30
Conclusions
optimization theory + statistical model: vanilla gradientdescent is “implicitly regularized” and runs fast!
Computational:near dimension-freeiteration complexity
Statistical:near-optimal
sample complexity
• works for a few other problems such as low-rank matrixcompletion and blind deconvolution;
• “leave-one-out” arguments are useful for decoupling weakdependency to allow finer characterization of GD trajectories.
31
References
1. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma,K. Wang, Y. Chi and Y. Chen, arXiv:1711.10467.
2. Nonconvex Matrix Factorization from Rank-One Measurements, Y.Li, C. Ma, Y. Chen, and Y. Chi, arXiv:1802.06286.
3. Gradient Descent with Random Initialization: Fast GlobalConvergence for Nonconvex Phase Retrieval, Y. Chen, Y. Chi, J.Fan and C. Ma, arXiv:1803.07726.
Thank you!
32