Download pdf - 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Geometry and Regularizationin Nonconvex Quadratic Inverse Problems

Yuejie Chi

Department of Electrical and Computer Engineering

Princeton Workshop, May 2018

Acknowledgements

Thanks to my collaborators:

Yuxin Chen Cong Ma Kaizheng Wang Yuanxin LiPrinceton Princeton Princeton CMU/OSU

This research is supported by NSF, ONR and AFOSR.

1

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

2

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

2

Nonconvex optimization in practice

Using simple algorithms such as gradient descent, e.g., “backpropagation” for training deep neural networks...

3

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)

m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4



minimizex f(x) =1

m

m∑

i=1

`(yi;x)m→∞=⇒ E[`(y;x)]


θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0


4



minimizex f(x) =1

m

m∑

i=1

`(yi;x)m→∞=⇒ E[`(y;x)]


θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0


4

Putting together...

global convergence

statistical models

benign landscape

5

Computational efficiency?

global convergence

statistical models

benign landscape

But how fast?

6

Overview and question we aim to address

samplecomplexity

critical points

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7


samplecomplexity

critical points smoothness

efficient


Statistical:

Computational:


7


samplecomplexity


efficient


Statistical:

Computational:

inefficient

efficient


7


samplecomplexity


efficient


Statistical:

Computational:

inefficient

efficient

efficient

regularized unregularized

?


7


samplecomplexity


efficient


Statistical:

Computational:

inefficient

efficient

efficient

regularized unregularized

?


7

Quadratic inverse problem

A

x

Ax

1

1

-3

2

-1

4

2-2

-1

3

-1

3

10

8

3

18

125

2

27

18

Y likelihood: fY |X(y | x unknown signal: X observation: Y

X y n p

1

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical loss minimization

minimizex f(x) = 1m

mÿ

k=1

Ë!a€k x

"2 ≠ ykÈ2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt ≠ ÷t Òf(xt)

10/ 29

X

1

0

2

-1

1

20

0

3

4

1

1

0

-1

-1

21

-1

3

1

r

AX yi = ka>i Xk2

2

Recover X\ ∈ Rn×r from m “random” quadratic measurements

yi =∥∥∥a>i X\

∥∥∥2

2, i = 1, . . . ,m

The rank-1 case is the famous phase retrieval problem.

8

Geometric interpretation

If X\ is orthonormal...

aiaj

X\

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

2

9

Geometric interpretation

If X\ is orthonormal...

aiaj

X\

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

2

9

Shallow neural network

hidden layer input layer output layer

1


1


1


x y W y

1


x y W y

1


x y W y

1


x y W y

1


x y W y +

1

a

X\

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

i=1

σ(a>xi).

10

Shallow neural network with quadratic activation


1


1


1


x y W y

1


x y W y

1


x y W y

1


x y W y

1


x y W y +

1

a

X\

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

i=1

σ(a>xi)σ(z)=z2

:=r∑

i=1

(a>xi)2 =

∥∥∥a>X\∥∥∥

2

2.

11

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

⇓

minimizeX∈Rn×r f(X) =1

4m

m∑

k=1

(∥∥∥a>kX∥∥∥

2

2− yk

)2

• Use r = 1 as a running example.

12

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

⇓

minimizeX∈Rn×r f(X) =1

4m

m∑

k=1

(∥∥∥a>kX∥∥∥

2

2− yk

)2

• Use r = 1 as a running example.

12


Empirical risk minimization

minimizex f(x) =1

4m

m∑

k=1

[(a>k x

)2 − yk]2



xt+1 = xt − η∇f(xt)

13


Empirical risk minimization

minimizex f(x) =1

4m

m∑

k=1

[(a>k x

)2 − yk]2



xt+1 = xt − η∇f(xt)

13

Geometry of loss surface

-1 -0.5 0 0.5 1-1.5

-1

-0.5

0

0.5

1

1.5

1

2

3

4

5

6

7

8

P(0)PX|(x | 0)

H0

><H1

P(1)PX|(x | 1)

L(x) =PX|(x | 1)

PX|(x | 0)

H1

><H0

H0 ! N (0, 1)

H1 ! N (1, 1)

L(x) =fX(x | H1)

fX(x | H0)=

1p2

exp (x1)2

2

1p2

expx2

2

= exp

x 1

2

L(x)

H1

><H0

() x

H1

><H0

1

2+ log

Pe,MAP = P(0)↵ + P(1)

↵

S1,k = (Wk + c1Fk)ei1,k



S2,k = (Wk + c2Fk)ei2,k, 8pixel k

Wk = f1(S1,k, S2,k)

Fk = f2(S1,k, S2,k)

x\

1

saddle point spectral init

1

saddle point spectral initialization

1

14

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

α

α

β

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

ε

)iterations

15

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

α

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

ε

)iterations

15

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸︷︷︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸︷︷︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16




E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)


In E[∇2f(x)

] 10In


log 1ε

)iterations if

m→∞



(even locally)

1

2In ∇2f(x) O(n)In


ε


n if m n log n


16




E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)


In E[∇2f(x)

] 10In


log 1ε

)iterations if

m→∞



(even locally)

1

2In ∇2f(x) O(n)In


ε


n if m n log n


16




E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)


In E[∇2f(x)

] 10In


log 1ε

)iterations if

m→∞



(even locally)

1

2In ∇2f(x) O(n)In


ε


n if m n log n


16




E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)


In E[∇2f(x)

] 10In


log 1ε

)iterations if

m→∞



(even locally)

1

2In ∇2f(x) O(n)In


ε


n if m n log n


16




E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)


In E[∇2f(x)

] 10In


log 1ε

)iterations if

m→∞



(even locally)

1

2In ∇2f(x) O(n)In


ε


n if m n log n


16

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

10-10

10-5

100

Vanilla GD (WF) can proceed much more aggressively!

17

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

10-10

10-5

100

Generic optimization theory is too pessimistic!

18

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19


∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k




(1/2) · In ∇2f(x) O(log n) · In

19


∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k


·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1



(1/2) · In ∇2f(x) O(log n) · In

19


∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k


·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1



(1/2) · In ∇2f(x) O(log n) · In

19


∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k


·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1



(1/2) · In ∇2f(x) O(log n) · In

19

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20





20





20





20



··



20



··



20



··



20



··



20



··



20

Our findings: GD is implicitly regularized


··

GD implicitly forces iterates to remain incoherenteven without regularization

21



··


21



··


21



··


21



··


21

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22






• ‖xt − x\‖2 .(1− η

2




log n log 1ε


22






• ‖xt − x\‖2 .(1− η

2




log n log 1ε


22

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23





∥∥ .√


(incoherence)


1− σ2r(X\)η

2





ε



ε

)


23





∥∥ .√


(incoherence)


1− σ2r(X\)η

2





ε



ε

)


23





∥∥ .√


(incoherence)


1− σ2r(X\)η

2





ε



ε

)


23





∥∥ .√


(incoherence)


1− σ2r(X\)η

2





ε



ε

)


23

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24



∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;

Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24



∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24


·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25


·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n


1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n


xt,(1) xt

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1




∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25


·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n


1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2


i=1

e>

i XX>ej e>i X\X\>ej

2


j X\X\>ei, (i, j) 2


i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n


xt,(1) xt

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1




∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Incoherence region in high dimensions

·· · ··

2-dimensional high-dimensional︸︷︷︸incoherence region is vanishingly small

26

This recipe is quite general

Low-rank matrix completion

X ? ? ? X ?? ? X X ? ?X ? ? X ? ?? ? X ? ? XX ? ? ? ? ?? X ? ? X ?? ? X X ? ?

? ? ? ?

?

?

??

??

???

?

?

Fig. credit: Candes

Given partial samples of a low-rank matrix M in an index set Ω,fill in missing entries.

Applications: recommendation systems, ...

28

Incoherence

1 0 0 · · · 00 0 0 · · · 0...

......

0 0 0 · · · 0

︸︷︷︸hard µ=n

vs.

1 1 1 · · · 11 1 1 · · · 1...

......

1 1 1 · · · 1

︸︷︷︸easy µ=1

Definition (Incoherence for matrix completion)

A rank-r matrix M \ with eigendecomposition M \ = U \Σ\U \> issaid to be µ-incoherent if

∥∥∥U \∥∥∥

2,∞≤√µ

n

∥∥∥U \∥∥∥

F=

√µr

n.

29

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Prior theory


(j,k)∈Ω

(e>j XX>ek −Mj,k

)2



’16



30

Prior theory


(j,k)∈Ω

(e>j XX>ek −Mj,k

)2



’16



30

Prior theory


(j,k)∈Ω

(e>j XX>ek −Mj,k

)2



’16



30

Conclusions

optimization theory + statistical model: vanilla gradientdescent is “implicitly regularized” and runs fast!

Computational:near dimension-freeiteration complexity

Statistical:near-optimal

sample complexity

• works for a few other problems such as low-rank matrixcompletion and blind deconvolution;

• “leave-one-out” arguments are useful for decoupling weakdependency to allow finer characterization of GD trajectories.

31

References

1. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma,K. Wang, Y. Chi and Y. Chen, arXiv:1711.10467.

2. Nonconvex Matrix Factorization from Rank-One Measurements, Y.Li, C. Ma, Y. Chen, and Y. Chi, arXiv:1802.06286.

3. Gradient Descent with Random Initialization: Fast GlobalConvergence for Nonconvex Phase Retrieval, Y. Chen, Y. Chi, J.Fan and C. Ma, arXiv:1803.07726.

Thank you!

32