Random Matrices in Machine Learning
Romain COUILLET
CentraleSupelec, University of ParisSaclay, FranceGSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble–Alpes, France.
June 21, 2018
1 / 113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
2 / 113
Basics of Random Matrix Theory/ 3/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
3 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
4 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:
I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).
I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
Eigenvalues of Cp
Den
sity
Empirical eigenvalue distribution
Marcenko–Pastur Law
Figure: Histogram of the eigenvalues of Cp for p = 500, n = 2000, Cp = Ip.
6 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113
The Marcenko–Pastur law
Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is
µp =1
p
p∑i=1
δλi(Ap).
Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1
nXpX∗p satisfies
µpa.s.−→ µc
weakly, where
I µc(0) = max0, 1− c−1I on (0,∞), µc has continuous density fc supported on [(1−
√c)2, (1 +
√c)2]
fc(x) =1
2πcx
√(x− (1−
√c)2)((1 +
√c)2 − x).
7 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113
The Marcenko–Pastur law
Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is
µp =1
p
p∑i=1
δλi(Ap).
Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1
nXpX∗p satisfies
µpa.s.−→ µc
weakly, where
I µc(0) = max0, 1− c−1
I on (0,∞), µc has continuous density fc supported on [(1−√c)2, (1 +
√c)2]
fc(x) =1
2πcx
√(x− (1−
√c)2)((1 +
√c)2 − x).
7 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113
The Marcenko–Pastur law
Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is
µp =1
p
p∑i=1
δλi(Ap).
Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1
nXpX∗p satisfies
µpa.s.−→ µc
weakly, where
I µc(0) = max0, 1− c−1I on (0,∞), µc has continuous density fc supported on [(1−
√c)2, (1 +
√c)2]
fc(x) =1
2πcx
√(x− (1−
√c)2)((1 +
√c)2 − x).
7 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1.2
x
Den
sityfc(x
)
c = 0.1
Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.
8 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1.2
x
Den
sityfc(x
)
c = 0.1
c = 0.2
Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.
8 / 113
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1.2
x
Den
sityfc(x
)
c = 0.1
c = 0.2
c = 0.5
Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.
8 / 113
Basics of Random Matrix Theory/Spiked Models 9/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
9 / 113
Basics of Random Matrix Theory/Spiked Models 10/113
Spiked Models
Small rank perturbation: Cp = Ip + P , P of low rank.
1 + ω1 + c1+ω1ω1
, 1 + ω2 + c1+ω2ω2
0
0.2
0.4
0.6
0.8
λini=1
µ
Figure: Eigenvalues of 1nYpY
∗p , Cp = diag(1, . . . , 1︸ ︷︷ ︸
p−4
, 2, 2, 3, 3), p = 500, n = 1500.
10 / 113
Basics of Random Matrix Theory/Spiked Models 11/113
Spiked Models
Theorem (Eigenvalues [Baik,Silverstein’06])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗, where, for K fixed,
Ω = diag (ω1, . . . , ωK) ∈ RK×K , with ω1 ≥ . . . ≥ ωK > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), denoting λm = λm( 1nYpY ∗p ) (λm > λm+1),
λma.s.−→
1 + ωm + c 1+ωm
ωm> (1 +
√c)2 , ωm >
√c
(1 +√c)2 , ωm ∈ (0,
√c].
11 / 113
Basics of Random Matrix Theory/Spiked Models 11/113
Spiked Models
Theorem (Eigenvalues [Baik,Silverstein’06])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗, where, for K fixed,
Ω = diag (ω1, . . . , ωK) ∈ RK×K , with ω1 ≥ . . . ≥ ωK > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), denoting λm = λm( 1nYpY ∗p ) (λm > λm+1),
λma.s.−→
1 + ωm + c 1+ωm
ωm> (1 +
√c)2 , ωm >
√c
(1 +√c)2 , ωm ∈ (0,
√c].
11 / 113
Basics of Random Matrix Theory/Spiked Models 12/113
Spiked Models
Theorem (Eigenvectors [Paul’07])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗ =∑Ki=1 ωiuiu
∗i , ω1 > . . . > ωM > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), for a, b ∈ Cp deterministic and ui eigenvectorof λi(
1nYpY ∗p ),
a∗uiu∗i b−
1− cω−2i
1 + cω−1i
a∗uiu∗i b · 1ωi>
√c
a.s.−→ 0
In particular,
|u∗i ui|2a.s.−→
1− cω−2i
1 + cω−1i
· 1ωi>√c.
12 / 113
Basics of Random Matrix Theory/Spiked Models 12/113
Spiked Models
Theorem (Eigenvectors [Paul’07])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗ =∑Ki=1 ωiuiu
∗i , ω1 > . . . > ωM > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), for a, b ∈ Cp deterministic and ui eigenvectorof λi(
1nYpY ∗p ),
a∗uiu∗i b−
1− cω−2i
1 + cω−1i
a∗uiu∗i b · 1ωi>
√c
a.s.−→ 0
In particular,
|u∗i ui|2a.s.−→
1− cω−2i
1 + cω−1i
· 1ωi>√c.
12 / 113
Basics of Random Matrix Theory/Spiked Models 13/113
Spiked Models
0 1 2 3 40
0.2
0.4
0.6
0.8
1
Population spike ω1
|u∗ 1u
1|2
p = 100
p = 200
p = 400
1−c/ω21
1+c/ω1
Figure: Simulated versus limiting |u∗1u1|2 for Yp = C12p Xp, Cp = Ip + ω1u1u
∗1 , p/n = 1/3,
varying ω1.
13 / 113
Basics of Random Matrix Theory/Spiked Models 14/113
Other Spiked Models
Similar results for multiple matrix models:
I Yp = 1n
(I + P )12XpX∗p (I + P )
12
I Yp = 1nXpX∗p + P
I Yp = 1nX∗p (I + P )X
I Yp = 1n
(Xp + P )∗(Xp + P )
I etc.
14 / 113
Applications/ 15/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
15 / 113
Applications/Reminder on Spectral Clustering Methods 16/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
16 / 113
Applications/Reminder on Spectral Clustering Methods 17/113
Reminder on Spectral Clustering Methods
Context: Two-step classification of n objects based on similarity A ∈ Rn×n:
0 spikes
⇓ Eigenvectors ⇓(in practice, shuffled)
Eig
env.
1E
igen
v.2
17 / 113
Applications/Reminder on Spectral Clustering Methods 17/113
Reminder on Spectral Clustering Methods
Context: Two-step classification of n objects based on similarity A ∈ Rn×n:
0 spikes
⇓ Eigenvectors ⇓(in practice, shuffled)
Eig
env.
1E
igen
v.2
17 / 113
Applications/Reminder on Spectral Clustering Methods 18/113
Reminder on Spectral Clustering Methods
Eig
env.
1E
igen
v.2
⇓ `-dimensional representation ⇓(shuffling no longer matters)
Eigenvector 1
Eig
enve
ctor
2
⇓EM or k-means clustering.
18 / 113
Applications/Reminder on Spectral Clustering Methods 18/113
Reminder on Spectral Clustering Methods
Eig
env.
1E
igen
v.2
⇓ `-dimensional representation ⇓(shuffling no longer matters)
Eigenvector 1
Eig
enve
ctor
2
⇓EM or k-means clustering.
18 / 113
Applications/Reminder on Spectral Clustering Methods 18/113
Reminder on Spectral Clustering Methods
Eig
env.
1E
igen
v.2
⇓ `-dimensional representation ⇓(shuffling no longer matters)
Eigenvector 1
Eig
enve
ctor
2
⇓EM or k-means clustering.
18 / 113
Applications/Kernel Spectral Clustering 19/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
19 / 113
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)
I Refinements:I instead of K, use D −K, In −D−1K, In −D−
12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
Applications/Kernel Spectral Clustering 23/113
Model and Assumptions
Kernel Matrix:
I Kernel matrix of interest:
K =
f
(1
p‖xi − xj‖2
)ni,j=1
for some sufficiently smooth nonnegative f (f( 1pxTi xj) simpler).
I We study the normalized Laplacian:
L = nD−12
(K −
ddT
dT1n
)D−
12
with d = K1n, D = diag(d).(more stable both theoretically and in practice)
23 / 113
Applications/Kernel Spectral Clustering 23/113
Model and Assumptions
Kernel Matrix:
I Kernel matrix of interest:
K =
f
(1
p‖xi − xj‖2
)ni,j=1
for some sufficiently smooth nonnegative f (f( 1pxTi xj) simpler).
I We study the normalized Laplacian:
L = nD−12
(K −
ddT
dT1n
)D−
12
with d = K1n, D = diag(d).(more stable both theoretically and in practice)
23 / 113
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
Applications/Kernel Spectral Clustering 26/113
Isolated eigenvalues: Gaussian inputs
0 1 2 3 4
Eigenvalues of L
0 1 2 3 4
Eigenvalues of L
Figure: Eigenvalues of L and L, k = 3, p = 2048, n = 512, c1 = c2 = 1/4, c3 = 1/2,[µa]j = 4δaj , Ca = (1 + 2(a− 1)/
√p)Ip, f(x) = exp(−x/2).
26 / 113
Applications/Kernel Spectral Clustering 27/113
Theoretical Findings versus MNIST
0 10 20 30 40 500
5 · 10−2
0.1
0.15
0.2
Eigenvalues of L
Figure: Eigenvalues of L (red) and (equivalent Gaussian model) L (white), MNIST data, p = 784,n = 192.
27 / 113
Applications/Kernel Spectral Clustering 27/113
Theoretical Findings versus MNIST
0 10 20 30 40 500
5 · 10−2
0.1
0.15
0.2
Eigenvalues of L
Eigenvalues of L as if Gaussian model
Figure: Eigenvalues of L (red) and (equivalent Gaussian model) L (white), MNIST data, p = 784,n = 192.
27 / 113
Applications/Kernel Spectral Clustering 28/113
Theoretical Findings versus MNIST
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data (red) and theoretical findings
(blue).
28 / 113
Applications/Kernel Spectral Clustering 28/113
Theoretical Findings versus MNIST
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data (red) and theoretical findings
(blue).
28 / 113
Applications/Kernel Spectral Clustering 29/113
Theoretical Findings versus MNIST
−.08 −.07 −.06
−0.1
0
0.1
Eigenvector 2/Eigenvector 1
−0.1 0 0.1
−0.1
0
0.1
0.2
Eigenvector 3/Eigenvector 2
Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1-and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.
29 / 113
Applications/Kernel Spectral Clustering 29/113
Theoretical Findings versus MNIST
−.08 −.07 −.06
−0.1
0
0.1
Eigenvector 2/Eigenvector 1
−0.1 0 0.1
−0.1
0
0.1
0.2
Eigenvector 3/Eigenvector 2
Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1-and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.
29 / 113
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
p = 1024
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
p = 1024
Theory
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
p = 1024
Theory
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 31/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
31 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113
Position of the Problem
Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.
Method:
I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.
I Zero-mean Gaussian model for the data: for xi ∈ Ck,
xi ∼ N (0, Ck).
I Performance of L = nD−12
(K − 1n1T
n
1TnD1n
)D−
12 , with
K =f(‖xi − xj‖2
)1≤i,j≤n
, x =x
‖x‖
in the regime n, p→∞.(alternatively, we can ask 1
ptrCi = 1 for all 1 ≤ i ≤ k)
32 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113
Position of the Problem
Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.
Method:
I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.
I Zero-mean Gaussian model for the data: for xi ∈ Ck,
xi ∼ N (0, Ck).
I Performance of L = nD−12
(K − 1n1T
n
1TnD1n
)D−
12 , with
K =f(‖xi − xj‖2
)1≤i,j≤n
, x =x
‖x‖
in the regime n, p→∞.(alternatively, we can ask 1
ptrCi = 1 for all 1 ≤ i ≤ k)
32 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113
Position of the Problem
Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.
Method:
I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.
I Zero-mean Gaussian model for the data: for xi ∈ Ck,
xi ∼ N (0, Ck).
I Performance of L = nD−12
(K − 1n1T
n
1TnD1n
)D−
12 , with
K =f(‖xi − xj‖2
)1≤i,j≤n
, x =x
‖x‖
in the regime n, p→∞.(alternatively, we can ask 1
ptrCi = 1 for all 1 ≤ i ≤ k)
32 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon
, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113
The case f ′(2) = 0
Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Remark: [Neyman–Pearson optimality]
I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p
tr (C1 − C2)2 ≥ O(p−12 ).
Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and
L ≡ √pf(2)
2f ′′(2)
[L−
f(0)− f(2)
f(2)P
], P = In −
1
n1n1T
n.
Then, under Assumptions 2b,
L = PΦP +
1√p
tr (CaCb )
1na1Tnb
p
ka,b=1
+ o‖·‖(1)
where Φij = δi 6=j√p[(xTi xj)
2 − E[(xTi xj)
2]].
34 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113
The case f ′(2) = 0
Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(
√p), with Ca = Ca − C, C =
∑kb=1 cbCb.
(in this regime, previous kernels clearly fail)
Remark: [Neyman–Pearson optimality]
I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p
tr (C1 − C2)2 ≥ O(p−12 ).
Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and
L ≡ √pf(2)
2f ′′(2)
[L−
f(0)− f(2)
f(2)P
], P = In −
1
n1n1T
n.
Then, under Assumptions 2b,
L = PΦP +
1√p
tr (CaCb )
1na1Tnb
p
ka,b=1
+ o‖·‖(1)
where Φij = δi 6=j√p[(xTi xj)
2 − E[(xTi xj)
2]].
34 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113
The case f ′(2) = 0
Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(
√p), with Ca = Ca − C, C =
∑kb=1 cbCb.
(in this regime, previous kernels clearly fail)
Remark: [Neyman–Pearson optimality]
I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p
tr (C1 − C2)2 ≥ O(p−12 ).
Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and
L ≡ √pf(2)
2f ′′(2)
[L−
f(0)− f(2)
f(2)P
], P = In −
1
n1n1T
n.
Then, under Assumptions 2b,
L = PΦP +
1√p
tr (CaCb )
1na1Tnb
p
ka,b=1
+ o‖·‖(1)
where Φij = δi 6=j√p[(xTi xj)
2 − E[(xTi xj)
2]].
34 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 35/113
The case f ′(2) = 0
−2 −1.5 −1 −0.5 00
1
2
3
λ1(L)
λ2(L)
Eigenvalues of L
Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c1 = c2 = 1/4, c3 = 1/2,
Ci ∝ Ip + (p/8)−54WiW
Ti , Wi ∈ Rp×(p/8) of i.i.d. N (0, 1) entries, f(t) = exp(−(t− 2)2).
⇒ No longer a Marcenko–Pastur like bulk, but rather a semi-circle bulk!
35 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 36/113
The case f ′(2) = 0
36 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 38/113
The case f ′(2) = 0
−2 −1.5 −1 −0.5 00
1
2
3
λ1(L)
λ2(L)
Eigenvalues of L
Semi-circle law
Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c1 = c2 = 1/4, c3 = 1/2,
Ci ∝ Ip + (p/8)−54WiW
Ti , Wi ∈ Rp×(p/8) of i.i.d. N (0, 1) entries, f(t) = exp(−(t− 2)2).
38 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 39/113
The case f ′(2) = 0
Denote now
T ≡ limp→∞
√cacb√p
trCaCb
ka,b=1
.
39 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 39/113
The case f ′(2) = 0
Denote now
T ≡ limp→∞
√cacb√p
trCaCb
ka,b=1
.
Theorem (Isolated Eigenvalues)Let ν1 ≥ . . . ≥ νk eigenvalues of T . Then, if
√c0|νi| > ω, L has an isolated
eigenvalue λi satisfying
λia.s.−→ ρi ≡ c0νi +
ω2
νi.
39 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 40/113
The case f ′(2) = 0
Theorem (Isolated Eigenvectors)For each isolated eigenpair (λi, ui) of L corresponding to (νi, vi) of T , write
ui =k∑a=1
αaija√na
+ σai wai
with ja = [0Tn1, . . . , 1T
na, . . . , 0T
nk]T, (wai )Tja = 0, supp(wai ) = supp(ja), ‖wai ‖ = 1.
Then, under Assumptions 1–2b,
αai αbi
a.s.−→(
1−1
c0
ω2
ν2i
)[viv
Ti ]ab
(σai )2 a.s.−→ca
c0
ω2
ν2i
and the fluctuations of ui, uj , i 6= j, are asymptotically uncorrelated.
40 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 41/113
The case f ′(2) = 0
Eig
enve
ctor
1
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Eig
enve
ctor
2
Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ
ai .
41 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 41/113
The case f ′(2) = 0
1√n1
(α11 − σ
11)
1√n2
(α21 + σ2
1)
1√n2α2
1Eig
enve
ctor
1
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Eig
enve
ctor
2
Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ
ai .
41 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 42/113
The case f ′(2) = 0
−4 −2 0 2 4
·10−2
−6
−4
−2
0
2
4
6
·10−2
1σ
2σ
Eigenvector 1
Eig
enve
ctor
2
Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ
ai .
42 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 43/113
Application to Massive MIMO UE Clustering
43 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 45/113
Massive MIMO UE Clustering
−0.1 0 0.1
−0.2
−0.1
0
0.1
Eigenvector 1
Eig
enve
ctor
2
−5 0 5
·10−2
−5
0
5
·10−2
Eigenvector 1
Figure: Leading two eigenvectors before (left figure) and after (right figure) T -averaging. Setting:p = 400, n = 40, T = 10, k = 3, c1 = c3 = 1/4, c2 = 1/2, angular spread model with angles−π/30± π/20, 0± π/20, and π/30± π/20. Kernel function f(t) = exp(−(t− 2)2).
45 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 46/113
Massive MIMO UE Clustering
2 4 6 8 100.4
0.6
0.8
1
Random guess
T
Cor
rect
clu
ster
ing
pro
ba
bilit
y
k-means (oracle)
k-means
EM (oracle)
EM
Figure: Overlap for different T , using the k-means or EM starting from actual centroid solutions(oracle) or randomly.
46 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 47/113
Massive MIMO UE Clustering
2 4 6 8 10
0.4
0.6
0.8
1
Random guess
T
Cor
rect
clu
ster
ing
pro
ba
bilit
y
Optimal kernel (k-means)
Optimal kernel (EM)
Gaussian kernel (k-means)
Gaussian kernel (EM)
Figure: Overlap for optimal kernel f(t) (here f(t) = exp(−(t− 2)2)) and Gaussian kernelf(t) = exp(−t2), for different T , using the k-means or EM.
47 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
48/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
48 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
50/113
New assumption setting
I We consider now a fully optimal growth rate setting
Assumption (Optimal Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(
√p).
Kernel:
I For technical simplicity, we consider
K = PKP = P
f
(1
p(x)T(xj )
)ni,j=1
P P = In −1
n1n1T
n.
i.e., τ replaced by 0.
50 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
50/113
New assumption setting
I We consider now a fully optimal growth rate setting
Assumption (Optimal Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(
√p).
Kernel:
I For technical simplicity, we consider
K = PKP = P
f
(1
p(x)T(xj )
)ni,j=1
P P = In −1
n1n1T
n.
i.e., τ replaced by 0.
50 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
51/113
Main Results
TheoremAs n→∞, ∥∥∥√p (PKP +
(f(0) + τf ′(0)
)P)− K
∥∥∥ a.s.−→ 0
with, for α =√pf ′(0) = O(1) and β = 1
2f ′′(0) = O(1),
K = αPWTWP + βPΦP + UAUT
A =
[αMTM + βT αIk
αIk 0
]U =
[J√p, PWTM
]Φ√p
=
((ωi )Tωj )2δi6=j
ni,j=1
−
tr (CaCb)
p21na1Tnb
ka,b=1
.
Role of α, β:
I Weighs Marcenko–Pastur versus semi-circle parts.
51 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
51/113
Main Results
TheoremAs n→∞, ∥∥∥√p (PKP +
(f(0) + τf ′(0)
)P)− K
∥∥∥ a.s.−→ 0
with, for α =√pf ′(0) = O(1) and β = 1
2f ′′(0) = O(1),
K = αPWTWP + βPΦP + UAUT
A =
[αMTM + βT αIk
αIk 0
]U =
[J√p, PWTM
]Φ√p
=
((ωi )Tωj )2δi6=j
ni,j=1
−
tr (CaCb)
p21na1Tnb
ka,b=1
.
Role of α, β:
I Weighs Marcenko–Pastur versus semi-circle parts.
51 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
52/113
Limiting eigenvalue distribution
Theorem (Eigenvalues Bulk)As p→∞,
νn ,1
n
n∑i=1
δλi(K)
a.s.−→ ν
with ν having Stieltjes transform m(z) solution of
1
m(z)= −z +
α
ptrC
(Ik +
αm(z)
c0C)−1
−2β2
c0ω2m(z)
where ω = limp→∞1p
tr (C)2.
52 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
53/113
Limiting eigenvalue distribution
Figure: Eigenvalues of K (up to recentering) versus limiting law, p = 2048, n = 4096, k = 2,
n1 = n2, µi = 3δi, f(x) = 12β(x+ 1√
pαβ
)2. (Top left): α = 8, β = 1, (Top right):
α = 4, β = 3, (Bottom left): α = 3, β = 4, (Bottom right): α = 1, β = 8.
53 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
54/113
Asymptotic performances: MNIST
I MNIST is “means-dominant” but not that much!
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
MNIST (digits 1, 7) 612.7 71.1 2.5MNIST (digits 3, 6) 441.3 39.9 1.4MNIST (digits 3, 8) 212.3 23.5 0.8
−15 −10 −5 0 5 10 15
0.6
0.8
1
αβ
Ove
rla
p
Digits 1, 7
Digits 3, 6
Digits 3, 8
Figure: Spectral clustering of the MNIST database for varying αβ .
54 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
54/113
Asymptotic performances: MNIST
I MNIST is “means-dominant” but not that much!
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
MNIST (digits 1, 7) 612.7 71.1 2.5MNIST (digits 3, 6) 441.3 39.9 1.4MNIST (digits 3, 8) 212.3 23.5 0.8
−15 −10 −5 0 5 10 15
0.6
0.8
1
αβ
Ove
rla
p
Digits 1, 7
Digits 3, 6
Digits 3, 8
Figure: Spectral clustering of the MNIST database for varying αβ .
54 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
55/113
Asymptotic performances: EEG data
I EEG data are “variance-dominant”
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
EEG (sets A,E) 2.4 10.9 1.1
−60 −40 −20 0 20 40
0.6
0.8
1
αβ
Ove
rla
p
Figure: Spectral clustering of the EEG database for varying αβ .
55 / 113
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
55/113
Asymptotic performances: EEG data
I EEG data are “variance-dominant”
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
EEG (sets A,E) 2.4 10.9 1.1
−60 −40 −20 0 20 40
0.6
0.8
1
αβ
Ove
rla
p
Figure: Spectral clustering of the EEG database for varying αβ .
55 / 113
Applications/Semi-supervised Learning 56/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
56 / 113
Applications/Semi-supervised Learning 57/113
Problem Statement
Context: Similar to clustering:
I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.
I Problem statement: give scores Fia (di = [K1n]i)
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2
such that Fia = δxi∈Ca, for all labelled xi.
I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
where we naturally decompose
K =
[K(l,l) K(l,u)
K(u,l) K(u,u)
]D =
[D(l) 0
0 D(u)
]= diag K1n .
57 / 113
Applications/Semi-supervised Learning 57/113
Problem Statement
Context: Similar to clustering:
I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.
I Problem statement: give scores Fia (di = [K1n]i)
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2
such that Fia = δxi∈Ca, for all labelled xi.
I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
where we naturally decompose
K =
[K(l,l) K(l,u)
K(u,l) K(u,u)
]D =
[D(l) 0
0 D(u)
]= diag K1n .
57 / 113
Applications/Semi-supervised Learning 57/113
Problem Statement
Context: Similar to clustering:
I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.
I Problem statement: give scores Fia (di = [K1n]i)
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2
such that Fia = δxi∈Ca, for all labelled xi.
I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
where we naturally decompose
K =
[K(l,l) K(l,u)
K(u,l) K(u,u)
]D =
[D(l) 0
0 D(u)
]= diag K1n .
57 / 113
Applications/Semi-supervised Learning 58/113
The finite-dimensional intuition: What we expect
Figure: Typical expected performance output
58 / 113
Applications/Semi-supervised Learning 58/113
The finite-dimensional intuition: What we expect
Figure: Typical expected performance output
58 / 113
Applications/Semi-supervised Learning 58/113
The finite-dimensional intuition: What we expect
Figure: Typical expected performance output
58 / 113
Applications/Semi-supervised Learning 59/113
MNIST Data Example
0 50 100 150
0.8
1
1.2
Index
F(u
)·,a
[F(u)]·,1 (Zeros)
Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
59 / 113
Applications/Semi-supervised Learning 59/113
MNIST Data Example
0 50 100 150
0.8
1
1.2
Index
F(u
)·,a
[F(u)]·,1 (Zeros)
[F(u)]·,2 (Ones)
Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
59 / 113
Applications/Semi-supervised Learning 59/113
MNIST Data Example
0 50 100 150
0.8
1
1.2
Index
F(u
)·,a
[F(u)]·,1 (Zeros)
[F(u)]·,2 (Ones)
[F(u)]·,3 (Twos)
Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
59 / 113
Applications/Semi-supervised Learning 60/113
MNIST Data Example
0 50 100 150
−4
−2
0
2
4
·10−2
Index
F(u
)·,a−
1 k
∑ k b=1F
(u
)·,b
[F(u)
]·,1 (Zeros)
Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T
k]·,a, 3-class MNIST data (zeros, ones,
twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.
60 / 113
Applications/Semi-supervised Learning 60/113
MNIST Data Example
0 50 100 150
−4
−2
0
2
4
·10−2
Index
F(u
)·,a−
1 k
∑ k b=1F
(u
)·,b
[F(u)
]·,1 (Zeros)
[F(u)
]·,2 (Ones)
Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T
k]·,a, 3-class MNIST data (zeros, ones,
twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.
60 / 113
Applications/Semi-supervised Learning 60/113
MNIST Data Example
0 50 100 150
−4
−2
0
2
4
·10−2
Index
F(u
)·,a−
1 k
∑ k b=1F
(u
)·,b
[F(u)
]·,1 (Zeros)
[F(u)
]·,2 (Ones)
[F(u)
]·,3 (Twos)
Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T
k]·,a, 3-class MNIST data (zeros, ones,
twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.
60 / 113
Applications/Semi-supervised Learning 61/113
Theoretical Findings
Method: Assume nl/n→ cl ∈ (0, 1)
I We aim at characterizing
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
I Taylor expansion of K as n, p→∞,
K(u,u) = f(τ)1nu1Tnu
+O‖·‖(n− 1
2 )
D(u) = nf(τ)Inu +O(n12 )
and similarly for K(u,l), D(l).
I So that
(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1=
(Inu −
1nu1Tnu
n+O‖·‖(n
− 12 )
)−1
easily Taylor expanded.
61 / 113
Applications/Semi-supervised Learning 61/113
Theoretical Findings
Method: Assume nl/n→ cl ∈ (0, 1)
I We aim at characterizing
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
I Taylor expansion of K as n, p→∞,
K(u,u) = f(τ)1nu1Tnu
+O‖·‖(n− 1
2 )
D(u) = nf(τ)Inu +O(n12 )
and similarly for K(u,l), D(l).
I So that
(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1=
(Inu −
1nu1Tnu
n+O‖·‖(n
− 12 )
)−1
easily Taylor expanded.
61 / 113
Applications/Semi-supervised Learning 61/113
Theoretical Findings
Method: Assume nl/n→ cl ∈ (0, 1)
I We aim at characterizing
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
I Taylor expansion of K as n, p→∞,
K(u,u) = f(τ)1nu1Tnu
+O‖·‖(n− 1
2 )
D(u) = nf(τ)Inu +O(n12 )
and similarly for K(u,l), D(l).
I So that
(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1=
(Inu −
1nu1Tnu
n+O‖·‖(n
− 12 )
)−1
easily Taylor expanded.
61 / 113
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:
I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
Applications/Semi-supervised Learning 63/113
Main Results
As a consequence of the remarks above, we take
α =β√p
and define
F(u)i,a =
np
nl,aF
(u)ia .
TheoremFor xi ∈ Cb unlabelled,
Fi,· −Gb → 0, Gb ∼ N (mb,Σb)
where mb ∈ Rk, Σb ∈ Rk×k given by
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
63 / 113
Applications/Semi-supervised Learning 63/113
Main Results
As a consequence of the remarks above, we take
α =β√p
and define
F(u)i,a =
np
nl,aF
(u)ia .
TheoremFor xi ∈ Cb unlabelled,
Fi,· −Gb → 0, Gb ∼ N (mb,Σb)
where mb ∈ Rk, Σb ∈ Rk×k given by
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
63 / 113
Applications/Semi-supervised Learning 64/113
Main Results
Corollary (Asymptotic Classification Error)For k = 2 classes and a 6= b,
P (Fi,a > Fib | xi ∈ Cb)−Q(
(mb)b − (mb)a√[1,−1]Σb[1,−1]T
)→ 0.
Some consequences:
I non obvious choices of appropriate kernels
I non obvious choice of optimal β (induces a possibly beneficial bias)
I importance of nl versus nu.
64 / 113
Applications/Semi-supervised Learning 64/113
Main Results
Corollary (Asymptotic Classification Error)For k = 2 classes and a 6= b,
P (Fi,a > Fib | xi ∈ Cb)−Q(
(mb)b − (mb)a√[1,−1]Σb[1,−1]T
)→ 0.
Some consequences:
I non obvious choices of appropriate kernels
I non obvious choice of optimal β (induces a possibly beneficial bias)
I importance of nl versus nu.
64 / 113
Applications/Semi-supervised Learning 65/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.4
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
65 / 113
Applications/Semi-supervised Learning 65/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.4
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Theory as if Gaussian
Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
65 / 113
Applications/Semi-supervised Learning 66/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784,nl/n = 1/16, Gaussian kernel.
66 / 113
Applications/Semi-supervised Learning 66/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Theory as if Gaussian
Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784,nl/n = 1/16, Gaussian kernel.
66 / 113
Applications/Semi-supervised Learning 67/113
Is semi-supervised learning really semi-supervised?
Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
The problem with unlabelled data:
I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.
I Even best Laplacian regularizer brings SSL to be merely supervised learning.
67 / 113
Applications/Semi-supervised Learning 67/113
Is semi-supervised learning really semi-supervised?
Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
The problem with unlabelled data:
I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.
I Even best Laplacian regularizer brings SSL to be merely supervised learning.
67 / 113
Applications/Semi-supervised Learning 67/113
Is semi-supervised learning really semi-supervised?
Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
The problem with unlabelled data:
I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.
I Even best Laplacian regularizer brings SSL to be merely supervised learning.
67 / 113
Applications/Semi-supervised Learning improved 68/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
68 / 113
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
Applications/Semi-supervised Learning improved 70/113
Theoretical results
Setting
I K = 2, xi ∼ N (±µ, Ip)
I scores fu = (αInu − Kuu)−1Kulfl.
Theorem (Asymptotic mean and variance)As n→∞,
j(u)Ti fu
nui−mi
a.s.−→ 0,(fu −mi1nu )T D
(u)i (fu −mi1nu )
nui− σ2
ia.s.−→ 0
where, for i = 1, 2,
mi ≡ −cl
cusi
(1−
[1 +
cuc1c2‖µ‖2
c0
δ
1 + δ
]−1)
σ2i ≡
s2i c2l c
2i ‖µ‖2δ2
c20(1 + δ)2 − cuc0δ2
1 +cuc1c2‖µ‖2
c0
δ2
(1+δ)2(1 +
cuc1c2‖µ‖2c0
δ1+δ
)2+s2i clci
1− ciδ2
c0(1 + δ)2 − cuδ2
with δ defined as
δ ≡ −1
2+cu − c0 + sign(α)
√(α− α−)(α− α+)
2α.
70 / 113
Applications/Semi-supervised Learning improved 70/113
Theoretical results
Setting
I K = 2, xi ∼ N (±µ, Ip)
I scores fu = (αInu − Kuu)−1Kulfl.
Theorem (Asymptotic mean and variance)As n→∞,
j(u)Ti fu
nui−mi
a.s.−→ 0,(fu −mi1nu )T D
(u)i (fu −mi1nu )
nui− σ2
ia.s.−→ 0
where, for i = 1, 2,
mi ≡ −cl
cusi
(1−
[1 +
cuc1c2‖µ‖2
c0
δ
1 + δ
]−1)
σ2i ≡
s2i c2l c
2i ‖µ‖2δ2
c20(1 + δ)2 − cuc0δ2
1 +cuc1c2‖µ‖2
c0
δ2
(1+δ)2(1 +
cuc1c2‖µ‖2c0
δ1+δ
)2+s2i clci
1− ciδ2
c0(1 + δ)2 − cuδ2
with δ defined as
δ ≡ −1
2+cu − c0 + sign(α)
√(α− α−)(α− α+)
2α.
70 / 113
Applications/Semi-supervised Learning improved 71/113
Performance as a function of nu, nl
20 40 60 80 100
0.6
0.8
1
‖µ‖ = 1
‖µ‖ = 2
‖µ‖ = 3
cu/cl (blue), cl/cu (black)
Figure: Correct classification rate, at optimal α, as a function of (i) nu for fixed p/nl = 5 (blue)and (ii) nl for fixed p/nu = 5 (black); c1 = c2 = 1
2 ; different values for ‖µ‖. Comparison tooptimal Neyman–Pearson performance for known µ (in red).
71 / 113
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).
72 / 113
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).
72 / 113
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).
72 / 113
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).72 / 113
Applications/Semi-supervised Learning improved 73/113
The spike case or not (2)
α+ 2.745 2.750 2.755 2.760
0.4
0.5
0.6
0.7
δ(α) = −(1 + cuc1c2c−10 ‖µ‖2)−1
α
‖µ‖ = 1.7 (< phase tr.)
‖µ‖ = 1.8 (> phase tr.)
Figure: Asymptotic correct classification probability Φ(m1σ1
)as a function of α for cu = 9
10 ,
c0 = 12 , c1 = 1
2 , two different values of ‖µ‖, below and above phase transition.
73 / 113
Applications/Semi-supervised Learning improved 74/113
SSL: the road from supervised to unsupervised
α+ 3 3.5 4 4.5 50.59
0.6
0.61
0.62
α
3 3.5 4 4.5 5
0.75
0.8
0.85
α
Figure: Theory (solid) versus practice (dashed; from right to left: n = 400, 1000, 4000): correctclassification probability as a function of α for cu = 9
10 , c0 = 12 , c1 = 1
2 , and left: ‖µ‖ = 1.5(below phase transition); right: ‖µ‖ = 2.5 (above phase transition). Different values of n.
74 / 113
Applications/Semi-supervised Learning improved 75/113
Experimental evidence: MNIST
Digits (0,8) (2,7) (6,9)
nu = 100
Centered kernel 89.5±3.6 89.5±3.4 85.3±5.9Iterated centered kernel 89.5±3.6 89.5±3.4 85.3±5.9
Laplacian 75.5±5.6 74.2±5.8 70.0±5.5Iterated Laplacian 87.2±4.7 86.0±5.2 81.4±6.8
Manifold 88.0±4.7 88.4±3.9 82.8±6.5
nu = 1000
Centered kernel 92.2±0.9 92.5±0.8 92.6±1.6Iterated centered kernel 92.3±0.9 92.5± 0.8 92.9±1.4
Laplacian 65.6±4.1 74.4±4.0 69.5±3.7Iterated Laplacian 92.2±0.9 92.4±0.9 92.0±1.6
Manifold 91.1±1.7 91.4±1.9 91.4±2.0
Table: Comparison of classification accuracy (%) on MNIST datasets with nl = 10. Computed over1000 random iterations for nu = 100 and 100 for nu = 1000.
75 / 113
Applications/Semi-supervised Learning improved 76/113
Experimental evidence: Traffic signs (HOG features)
Class ID (2,7) (9,10) (11,18)
nu = 100
Centered kernel 79.0±10.4 77.5±9.2 78.5±7.1Iterated centered kernel 85.3±5.9 89.2±5.6 90.1±6.7
Laplacian 73.8±9.8 77.3±9.5 78.6±7.2Iterated Laplacian 83.7±7.2 88.0±6.8 87.1±8.8
Manifold 77.6±8.9 81.4±10.4 82.3±10.8
nu = 1000
Centered kernel 83.6±2.4 84.6±2.4 88.7±9.4Iterated centered kernel 84.8±3.8 88.0±5.5 96.4±3.0
Laplacian 72.7±4.2 88.9±5.7 95.8±3.2Iterated Laplacian 83.0±5.5 88.2±6.0 92.7±6.1
Manifold 77.7±5.8 85.0±9.0 90.6±8.1
Table: Comparison of classification accuracy (%) on German Traffic Sign datasets with nl = 10.Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.
76 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 77/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
77 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 78/113
Random Feature Maps and Extreme Learning Machines
Context: Random Feature MapI (large) input x1, . . . , xT ∈ Rp
I random W =
wT1. . .wTn
∈ Rn×p
I non-linear activation function σ.
Neural Network Model (extreme learning machine): Ridge-regression learningI small output y1, . . . , yT ∈ RdI ridge-regression output β ∈ Rn×d
78 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 78/113
Random Feature Maps and Extreme Learning Machines
Context: Random Feature MapI (large) input x1, . . . , xT ∈ Rp
I random W =
wT1. . .wTn
∈ Rn×p
I non-linear activation function σ.
Neural Network Model (extreme learning machine): Ridge-regression learningI small output y1, . . . , yT ∈ RdI ridge-regression output β ∈ Rn×d
78 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113
Random Feature Maps and Extreme Learning Machines
Objectives: evaluate training and testing MSE performance as n, p, T →∞
I Training MSE:
Etrain =1
T
T∑i=1
‖yi − βTσ(Wxi)‖2 =1
T‖Y − βTΣ‖2F
with
Σ = σ(WX) =σ(wT
i xj)
1≤i≤n1≤j≤T
β=1
TΣ
(1
TΣTΣ + γIT
)−1
Y .
I Testing MSE: upon new pair (X, Y ) of length T ,
Etest =1
T‖Y − βTΣ‖2F .
where Σ = σ(WX).
79 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113
Random Feature Maps and Extreme Learning Machines
Objectives: evaluate training and testing MSE performance as n, p, T →∞I Training MSE:
Etrain =1
T
T∑i=1
‖yi − βTσ(Wxi)‖2 =1
T‖Y − βTΣ‖2F
with
Σ = σ(WX) =σ(wT
i xj)
1≤i≤n1≤j≤T
β=1
TΣ
(1
TΣTΣ + γIT
)−1
Y .
I Testing MSE: upon new pair (X, Y ) of length T ,
Etest =1
T‖Y − βTΣ‖2F .
where Σ = σ(WX).
79 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113
Random Feature Maps and Extreme Learning Machines
Objectives: evaluate training and testing MSE performance as n, p, T →∞I Training MSE:
Etrain =1
T
T∑i=1
‖yi − βTσ(Wxi)‖2 =1
T‖Y − βTΣ‖2F
with
Σ = σ(WX) =σ(wT
i xj)
1≤i≤n1≤j≤T
β=1
TΣ
(1
TΣTΣ + γIT
)−1
Y .
I Testing MSE: upon new pair (X, Y ) of length T ,
Etest =1
T‖Y − βTΣ‖2F .
where Σ = σ(WX).
79 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 80/113
Technical Aspects
Preliminary observations:
I Link to resolvent of 1T
ΣTΣ:
Etrain =γ2
TtrY TY Q2 = −γ2 ∂
∂γ
1
TtrY TY Q
where Q = Q(γ) is the resolvent
Q ≡(
1
TΣTΣ + γIT
)−1
with Σij = σ(wTi xj).
Central object: resolvent E[Q].
80 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 80/113
Technical Aspects
Preliminary observations:
I Link to resolvent of 1T
ΣTΣ:
Etrain =γ2
TtrY TY Q2 = −γ2 ∂
∂γ
1
TtrY TY Q
where Q = Q(γ) is the resolvent
Q ≡(
1
TΣTΣ + γIT
)−1
with Σij = σ(wTi xj).
Central object: resolvent E[Q].
80 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113
Main Technical Result
Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q
∥∥ < Cnε−12
for some C > 0, where
Q =
(n
T
Φ
1 + δ+ γIT
)−1
Φ ≡ E[σ(XTw)σ(wTX)
]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to
δ =1
Ttr ΦQ.
Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1
pwTXAXTw ' 1
ptrXAXT)
Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1
ptr ΦA
∣∣∣ > t)≤ Ce−cnmin(t,t2)
81 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113
Main Technical Result
Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q
∥∥ < Cnε−12
for some C > 0, where
Q =
(n
T
Φ
1 + δ+ γIT
)−1
Φ ≡ E[σ(XTw)σ(wTX)
]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to
δ =1
Ttr ΦQ.
Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1
pwTXAXTw ' 1
ptrXAXT)
Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1
ptr ΦA
∣∣∣ > t)≤ Ce−cnmin(t,t2)
81 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113
Main Technical Result
Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q
∥∥ < Cnε−12
for some C > 0, where
Q =
(n
T
Φ
1 + δ+ γIT
)−1
Φ ≡ E[σ(XTw)σ(wTX)
]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to
δ =1
Ttr ΦQ.
Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1
pwTXAXTw ' 1
ptrXAXT)
Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1
ptr ΦA
∣∣∣ > t)≤ Ce−cnmin(t,t2)
81 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 82/113
Main Technical Result
I Values of Φ(a, b) for w ∼ N (0, Ip),
σ(t) Φ(a, b)
max(t, 0) 12π‖a‖‖b‖
(∠(a, b) acos(−∠(a, b)) +
√1− ∠(a, b)2
)|t| 2
π‖a‖‖b‖
(∠(a, b) asin(∠(a, b)) +
√1− ∠(a, b)2
)erf(t) 2
πasin
(2aTb√
(1+2‖a‖2)(1+2‖b‖2)
)1t>0
12− 1
2πacos(∠(a, b))
sign(t) 1− 2π
acos(∠(a, b))
cos(t) exp(− 12
(‖a‖2 + ‖b‖2)) cosh(aTb).
where ∠(a, b) ≡ aTb‖a‖‖b‖ .
I Value of Φ(a, b) for wi i.i.d. with E[wki ] = mk (m1 = 0), σ(t) = ζ2t2 + ζ1t+ ζ0
Φ(a, b) = ζ22
[m2
2
(2(aTb)2 + ‖a‖2‖b‖2
)+ (m4 − 3m2
2)(a2)T(b2)]
+ ζ21m2a
Tb
+ ζ2ζ1m3
[(a2)Tb+ aT(b2)
]+ ζ2ζ0m2
[‖a‖2 + ‖b‖2
]+ ζ2
0
where (a2) ≡ [a21, . . . , a
2p]T.
82 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 82/113
Main Technical Result
I Values of Φ(a, b) for w ∼ N (0, Ip),
σ(t) Φ(a, b)
max(t, 0) 12π‖a‖‖b‖
(∠(a, b) acos(−∠(a, b)) +
√1− ∠(a, b)2
)|t| 2
π‖a‖‖b‖
(∠(a, b) asin(∠(a, b)) +
√1− ∠(a, b)2
)erf(t) 2
πasin
(2aTb√
(1+2‖a‖2)(1+2‖b‖2)
)1t>0
12− 1
2πacos(∠(a, b))
sign(t) 1− 2π
acos(∠(a, b))
cos(t) exp(− 12
(‖a‖2 + ‖b‖2)) cosh(aTb).
where ∠(a, b) ≡ aTb‖a‖‖b‖ .
I Value of Φ(a, b) for wi i.i.d. with E[wki ] = mk (m1 = 0), σ(t) = ζ2t2 + ζ1t+ ζ0
Φ(a, b) = ζ22
[m2
2
(2(aTb)2 + ‖a‖2‖b‖2
)+ (m4 − 3m2
2)(a2)T(b2)]
+ ζ21m2a
Tb
+ ζ2ζ1m3
[(a2)Tb+ aT(b2)
]+ ζ2ζ0m2
[‖a‖2 + ‖b‖2
]+ ζ2
0
where (a2) ≡ [a21, . . . , a
2p]T.
82 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 83/113
Main Results
Theorem [Asymptotic Etrain]For all ε > 0,
n12−ε (Etrain − Etrain
)→ 0
almost surely, where
Etrain =1
T
∥∥∥Y T − ΣTβ∥∥∥2
F=γ2
TtrY TY Q2
Etrain =γ2
TtrY TY Q
[1n
tr ΨQ2
1− 1n
tr (ΨQ)2Ψ + IT
]Q
with Ψ ≡ nT
Φ1+δ
.
83 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 84/113
Main Results
I Letting X ∈ Rp×T , Y ∈ Rd×T satisfy “similar properties” as (X,Y ),
Claim [Asymptotic Etest]For all ε > 0,
n12−ε (Etest − Etest
)→ 0
almost surely, where
Etest =1
T
∥∥∥Y T − ΣTβ∥∥∥2
F
Etest =1
T
∥∥∥Y T −ΨTXX
QY T∥∥∥2
F
+1n
trY TY QΨQ
1− 1n
tr (ΨQ)2
[1
Ttr ΨXX −
1
Ttr (IT + γQ)(ΨXXΨXXQ)
]
with ΨAB = nT
ΦAB1+δ
, ΦAB = E[σ(ATw)σ(wTB)].
84 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = erf(t)
σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = erf(t)
σ(t) = |t|σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = max(t, 0)
σ(t) = erf(t)
σ(t) = |t|σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 86/113
Simulations on MNIST: non Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = sign(t)
σ(t) = 1t>0
σ(t) = 1− 12t2
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for σ(·) either discontinuous or non Lipschitz, as a function of
γ, for 2-class MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
86 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 87/113
Deeper investigation on Φ
Statistical Assumptions on XI Gaussian mixture model
xi ∈ Ca ⇔ xi ∼ N (1√pµa,
1
pCa).
I Growth rate: ‖µa‖ = O(1), 1√p
trCa = O(1).
TheoremAs p, T →∞, for all σ(·) given in next table,
‖PΦP − P ΦP‖ a.s.−→ 0
with
Φ ≡ d1
(Ω +M
JT
√p
)T (Ω +M
JT
√p
)+ d2UBU
T + d0IT
U ≡[J√p, φ]
B ≡[ttT + 2T t
tT 1
]and d0, d1, d2 given in next table (φi = ‖wi‖2 − E[‖wi‖2] for xi = 1√
pµa + wi).
87 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 87/113
Deeper investigation on Φ
Statistical Assumptions on XI Gaussian mixture model
xi ∈ Ca ⇔ xi ∼ N (1√pµa,
1
pCa).
I Growth rate: ‖µa‖ = O(1), 1√p
trCa = O(1).
TheoremAs p, T →∞, for all σ(·) given in next table,
‖PΦP − P ΦP‖ a.s.−→ 0
with
Φ ≡ d1
(Ω +M
JT
√p
)T (Ω +M
JT
√p
)+ d2UBU
T + d0IT
U ≡[J√p, φ]
B ≡[ttT + 2T t
tT 1
]and d0, d1, d2 given in next table (φi = ‖wi‖2 − E[‖wi‖2] for xi = 1√
pµa + wi).
87 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 88/113
Deeper investigation on Φ
Figure: Coefficients di in Φ for different σ(·).
σ(t) d0 d1 d2
t 0 1 0ReLU(t)
(14 −
12π
)τ 1
41
8πτ|t|
(1− 2
π
)τ 0 1
2πτ
LReLU(t) π−24π (ς+ + ς−)2τ 1
4 (ς+ − ς−)2 18τπ (ς+ + ς−)2
1t>014 −
12π
12πτ 0
sign(t) 1− 2π
2πτ 0
ς2t2 + ς1t+ ς0 2τ2ς22 ς21 ς22
cos(t) 12 + e−2τ
2 − e−τ 0 e−τ4
sin(t) 12 −
e−2τ
2 − τe−τ e−τ 0
erf(t) 2π
(arccos
(2τ
2τ+1
)− 2τ
2τ+1
)4π
12τ+1 0
exp(− t22 ) 1√2τ+1
− 1τ+1 0 1
4(τ+1)3
where
I ReLU(t) = max(t, 0)
I LReLU(t) = ς+ max(t, 0) + ς−max(−t, 0).
88 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113
Deeper investigation on Φ
Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0
Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)
ς+ = ς− = 1
C1 C2 C3 C4 C1 C2 C3 C4
ς+ = ς− = 1
ς+ = 1, ς− = 0
Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)
89 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113
Deeper investigation on Φ
Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0
Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)
ς+ = ς− = 1
C1 C2 C3 C4 C1 C2 C3 C4
ς+ = ς− = 1
ς+ = 1, ς− = 0
Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)
89 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113
Deeper investigation on Φ
Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0
Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)
ς+ = ς− = 1
C1 C2 C3 C4 C1 C2 C3 C4
ς+ = ς− = 1
ς+ = 1, ς− = 0
Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)89 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 90/113
Depper investigation on Φ: Simulation results
Table: Clustering accuracies for different σ(t) on MNIST dataset (n = 32).
σ(t) T = 32 T = 64 T = 128
mean-oriented
t 85.31% 88.94% 87.30%1t>0 86.00% 82.94% 85.56%
sign(t) 81.94% 83.34% 85.22%sin(t) 85.31% 87.81% 87.50%erf(t) 86.50% 87.28% 86.59%
cov-oriented
|t| 62.81% 60.41% 57.81%cos(t) 62.50% 59.56% 57.72%
exp(− t22 ) 64.00% 60.44% 58.67%
balanced (t) 82.87% 85.72% 82.27%
90 / 113
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 91/113
Depper investigation on Φ: Simulation results
Table: Clustering accuracies for different σ(t) on epileptic EEG dataset (n = 32).
σ(t) T = 32 T = 64 T = 128
mean-oriented
t 71.81% 70.31% 69.58%1t>0 65.19% 65.87% 63.47%
sign(t) 67.13% 64.63% 63.03%sin(t) 71.94% 70.34% 68.22%erf(t) 69.44% 70.59% 67.70%
cov-oriented
|t| 99.69% 99.69% 99.50%cos(t) 99.00% 99.38% 99.36%
exp(− t22 ) 99.81% 99.81% 99.77%
balanced (t) 84.50% 87.91% 90.97%
91 / 113
Applications/Community Detection on Graphs 92/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
92 / 113
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.
I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, withpreferential attachment Cab between Ca and Cb
I edge probability for nodes i ∈ Cgi :P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )
93 / 113
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with
preferential attachment Cab between Ca and Cb
I edge probability for nodes i ∈ Cgi :P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )
93 / 113
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with
preferential attachment Cab between Ca and CbI edge probability for nodes i ∈ Cgi :
P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )
93 / 113
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with
preferential attachment Cab between Ca and CbI edge probability for nodes i ∈ Cgi :
P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )93 / 113
Applications/Community Detection on Graphs 94/113
Limitations of Classical Methods
I 3 classes with µ bi-modal (µ = 34δ0.1 + 1
4δ0.5)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
94 / 113
Applications/Community Detection on Graphs 94/113
Limitations of Classical Methods
I 3 classes with µ bi-modal (µ = 34δ0.1 + 1
4δ0.5)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
94 / 113
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
Applications/Community Detection on Graphs 96/113
Asymptotic Equivalence
Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖
a.s.−→ 0, where
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α
Lα =1√nD−αq XD−αq + UΛUT
with Dq = diag(qi), X zero-mean random matrix with variance profile,
U =[D1−αq
J√n
D−αq X1n], rank k + 1
Λ =
[(Ik − 1kc
T)M(Ik − c1Tk) −1k
1Tk 0
]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T
na, 0, . . . , 0]T ∈ Rn.
Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”
Optimal choice αopt of α from study of limiting spectrum.
I eigenvectors correlated to D1−αq J
Necessary regularization by Dα−1.
96 / 113
Applications/Community Detection on Graphs 96/113
Asymptotic Equivalence
Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖
a.s.−→ 0, where
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α
Lα =1√nD−αq XD−αq + UΛUT
with Dq = diag(qi), X zero-mean random matrix with variance profile,
U =[D1−αq
J√n
D−αq X1n], rank k + 1
Λ =
[(Ik − 1kc
T)M(Ik − c1Tk) −1k
1Tk 0
]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T
na, 0, . . . , 0]T ∈ Rn.
Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”
Optimal choice αopt of α from study of limiting spectrum.
I eigenvectors correlated to D1−αq J
Necessary regularization by Dα−1.
96 / 113
Applications/Community Detection on Graphs 96/113
Asymptotic Equivalence
Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖
a.s.−→ 0, where
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α
Lα =1√nD−αq XD−αq + UΛUT
with Dq = diag(qi), X zero-mean random matrix with variance profile,
U =[D1−αq
J√n
D−αq X1n], rank k + 1
Λ =
[(Ik − 1kc
T)M(Ik − c1Tk) −1k
1Tk 0
]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T
na, 0, . . . , 0]T ∈ Rn.
Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”
Optimal choice αopt of α from study of limiting spectrum.
I eigenvectors correlated to D1−αq J
Necessary regularization by Dα−1.
96 / 113
Applications/Community Detection on Graphs 97/113
Eigenvalue Spectrum
−6 −4 −2 0 2 4 6
−Sα Sα
Sα
spikes
Eigenvalues of L1 (n = 2000)
Limiting law
Figure: 3 classes, c1 = c2 = 0.3, c3 = 0.4, µ = 12δ0.4 + 1
2δ0.9, M = 4
3 −1 −1−1 3 −1−1 −1 3
.
97 / 113
Applications/Community Detection on Graphs 98/113
Phase Transition
Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where
τα = limx↓Sα+
−1
gα(x), phase transition threshold
with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of
fα(x) =
∫q1−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)
gα(x) =
∫q2−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).
In this case, λi(Lα)a.s.−→ (gα)−1
(−1/λi(M)
).
Clustering possible when λi(M) > (minα τα):
I “Optimal” αopt ≡ argminα τα.I From qi ≡ di√
dT1n
a.s.−→ qi, µ ' µ ≡ 1n
∑ni=1 δqi and thus:
Consistent estimator αopt of αopt.
98 / 113
Applications/Community Detection on Graphs 98/113
Phase Transition
Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where
τα = limx↓Sα+
−1
gα(x), phase transition threshold
with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of
fα(x) =
∫q1−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)
gα(x) =
∫q2−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).
In this case, λi(Lα)a.s.−→ (gα)−1
(−1/λi(M)
).
Clustering possible when λi(M) > (minα τα):
I “Optimal” αopt ≡ argminα τα.
I From qi ≡ di√dT1n
a.s.−→ qi, µ ' µ ≡ 1n
∑ni=1 δqi and thus:
Consistent estimator αopt of αopt.
98 / 113
Applications/Community Detection on Graphs 98/113
Phase Transition
Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where
τα = limx↓Sα+
−1
gα(x), phase transition threshold
with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of
fα(x) =
∫q1−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)
gα(x) =
∫q2−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).
In this case, λi(Lα)a.s.−→ (gα)−1
(−1/λi(M)
).
Clustering possible when λi(M) > (minα τα):
I “Optimal” αopt ≡ argminα τα.I From qi ≡ di√
dT1n
a.s.−→ qi, µ ' µ ≡ 1n
∑ni=1 δqi and thus:
Consistent estimator αopt of αopt.
98 / 113
Applications/Community Detection on Graphs 99/113
Simulated Performance Results (2 masses of qi)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
(Proposed, α = 1) (Proposed, αopt)
Figure: 3 classes, µ = 34δ0.1 + 1
4δ0.5, c1 = c2 = 14 , c3 = 1
2 , M = 100I3.
99 / 113
Applications/Community Detection on Graphs 99/113
Simulated Performance Results (2 masses of qi)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
(Proposed, α = 1)
(Proposed, αopt)
Figure: 3 classes, µ = 34δ0.1 + 1
4δ0.5, c1 = c2 = 14 , c3 = 1
2 , M = 100I3.
99 / 113
Applications/Community Detection on Graphs 99/113
Simulated Performance Results (2 masses of qi)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
(Proposed, α = 1) (Proposed, αopt)
Figure: 3 classes, µ = 34δ0.1 + 1
4δ0.5, c1 = c2 = 14 , c3 = 1
2 , M = 100I3.
99 / 113
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 1
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 0
α = 12
α = 1
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 0
α = 12
α = 1
α = αopt
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 0
α = 12
α = 1
α = αopt
Bethe Hessian
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
Applications/Community Detection on Graphs 101/113
Simulated Performance Results (2 masses for qi)
0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
q2 (q1 = 0.1)
Ove
rla
p
α = 0
α = 0.5
α = 1
α = αopt
Bethe Hessian
Figure: Overlap performance for n = 3000, K = 3, µ = 34 δq(1)
+ 14 δq(2)
with q(1) = 0.1 and
q(2) ∈ [0.1, 0.9], M = 10(2I3 − 131T3), ci = 1
3 .
101 / 113
Applications/Community Detection on Graphs 102/113
Real Graph Example: PolBlogs (n = 1490, two classes)
−1 −0.5 0 0.5 1
L0
−100 0 100
L1
(λmax ' 1.75) (λmax ' 483)
Algorithms Overlap Modularityαopt (' 0) 0.897 0.4246α = 0.5 0.035 ' 0α = 1 0.040 ' 0BH 0.304 0.2723
102 / 113
Perspectives/ 103/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
103 / 113
Perspectives/ 104/113
Summary of Results and Perspectives I
Random Neural Networks.
4 Extreme learning machines (one-layer random NN)
4 Linear echo-state networks (ESN)
. Logistic regression and classification error in extreme learning machines (ELM)
. Further random feature maps characterization
. Generalized random NN (multiple layers, multiple activations)
. Random convolutional networks for image processing
Non-linear ESN
Deep Neural Networks (DNN).
. Backpropagation in NN (σ(WX) for random X, backprop. on W )
Statistical physics-inspired approaches (spin-glass models, Hamiltonian-basedmodels)
Non-linear ESN
DNN performance of physics-realistic models (4th-order Hamiltonian, locality)
104 / 113
Perspectives/ 105/113
Summary of Results and Perspectives II
References.
H. W. Lin, M. Tegmark, “Why does deep and cheap learning work so well?”,
arXiv:1608.08225v2, 2016.
C. Williams, “Computation with infinite neural networks”, Neural Computation, 10(5),
1203-1216, 1998.
Herbert Jaeger. Short term memory in echo state networks. GMD-Forschungszentrum
Informationstechnik, 2001.
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, “Extreme learning machine : theory
and applications”, Neurocomputing, 70(1) :489501, 2006.
N. El Karoui, “Concentration of measure and spectra of random matrices: applications to
correlation matrices, elliptical distributions and beyond”, The Annals of Applied Probability,19(6), 2362-2405, 2009.
C. Louart, Z. Liao, R. Couillet, ”A Random Matrix Approach to Neural Networks”, (submitted
to) Annals of Applied Probability, 2017.
R. Couillet, G. Wainrib, H. Sevi, H. Tiomoko Ali, “The asymptotic performance of linear echo
state neural networks”, Journal of Machine Learning Research, vol. 17, no. 178, pp. 1-35, 2016.
Choromanska, Anna, et al. ”The Loss Surfaces of Multilayer Networks.” AISTATS. 2015.
Rahimi, Ali, and Benjamin Recht. ”Random Features for Large-Scale Kernel Machines.” NIPS.
Vol. 3. No. 4. 2007.
105 / 113
Perspectives/ 106/113
Summary of Results and Perspectives I
Kernel methods.
4 Spectral clustering
4 Subspace spectral clustering (f ′(τ) = 0)
. Spectral clustering with outer product kernel f(xTy)
4 Semi-supervised learning, kernel approaches.
4 Least square support vector machines (LS-SVM).
. Support vector machines (SVM).
Kernel matrices based on Kendall τ , Spearman ρ.
Applications.
4 Massive MIMO user subspace clustering (patent proposed)
Kernel correlation matrices for biostats, heterogeneous datasets.
Kernel PCA.
Kendall τ in biostats.
References.
N. El Karoui, “The spectrum of kernel random matrices”, The Annals of Statistics, 38(1),
1-50, 2010.
106 / 113
Perspectives/ 107/113
Summary of Results and Perspectives II
R. Couillet, F. Benaych-Georges, “Kernel Spectral Clustering of Large Dimensional Data”,
Electronic Journal of Statistics, vol. 10, no. 1, pp. 1393-1454, 2016.
R. Couillet, A. Kammoun, “Random Matrix Improved Subspace Clustering”, Asilomar
Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2016.
Z. Liao, R. Couillet, ”A Large Dimensional Analysis of Least Squares Support Vector
Machines”, (submitted to) Journal of Machine Learning Research, 2017.
X. Mai, R. Couillet, “The counterintuitive mechanism of graph-based semi-supervised learning
in the big data regime”, IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP’17), New Orleans, USA, 2017.
107 / 113
Perspectives/ 108/113
Summary of Results and Perspectives I
Community detection.
4 Heterogeneous dense network clustering.
. Semi-supervised clustering.
Sparse network extensions.
Beyond community detection (hub detection).
Applications.
4 Improved methods for community detection.
. Applications to distributed optimization (network diffusion, graph signalprocessing).
References.
H. Tiomoko Ali, R. Couillet, “Spectral community detection in heterogeneous large networks”,
(submitted to) Journal of Multivariate Analysis, 2016.
F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, P. Zhang, “Spectral
redemption in clustering sparse networks. Proceedings of the National Academy of Sciences”,110(52), 20935-20940, 2013.
C. Bordenave, M. Lelarge, L. Massoulie, “Non-backtracking spectrum of random graphs:
community detection and non-regular Ramanujan graphs”, Foundations of Computer Science(FOCS), 2015 IEEE 56th Annual Symposium on, pp. 1347-1357, 2015
A. Saade, F. Krzakala, L. Zdeborova, “Spectral clustering of graphs with the Bethe Hessian”,
In Advances in Neural Information Processing Systems, pp. 406-414, 2014.
108 / 113
Perspectives/ 109/113
Summary of Results and Perspectives I
Robust statistics.
4 Tyler, Maronna (and regularized) estimators
4 Elliptical data setting, deterministic outlier setting
4 Central limit theorem extensions
Joint mean and covariance robust estimation
Robust regression (preliminary works exist already using strikingly differentapproaches)
Applications.
4 Statistical finance (portfolio estimation)
4 Localisation in array processing (robust GMUSIC)
4 Detectors in space time array processing
Correlation matrices in biostatistics, human science datasets, etc.
References.
R. Couillet, F. Pascal, J. W. Silverstein, “Robust Estimates of Covariance Matrices in the
Large Dimensional Regime”, IEEE Transactions on Information Theory, vol. 60, no. 11, pp.7269-7278, 2014.
109 / 113
Perspectives/ 110/113
Summary of Results and Perspectives II
R. Couillet, F. Pascal, J. W. Silverstein, “The Random Matrix Regime of Maronna’s
M-estimator with elliptically distributed samples”, Elsevier Journal of Multivariate Analysis, vol.139, pp. 56-78, 2015.
T. Zhang, X. Cheng, A. Singer, “Marchenko-Pastur Law for Tyler’s and Maronna’s
M-estimators”, arXiv:1401.3424, 2014.
R. Couillet, M. McKay, “Large Dimensional Analysis and Optimization of Robust Shrinkage
Covariance Matrix Estimators”, Elsevier Journal of Multivariate Analysis, vol. 131, pp. 99-120,2014.
D. Morales-Jimenez, R. Couillet, M. McKay, “Large Dimensional Analysis of Robust
M-Estimators of Covariance with Outliers”, IEEE Transactions on Signal Processing, vol. 63,no. 21, pp. 5784-5797, 2015.
L. Yang, R. Couillet, M. McKay, “A Robust Statistics Approach to Minimum Variance Portfolio
Optimization”, IEEE Transactions on Signal Processing, vol. 63, no. 24, pp. 6684–6697, 2015.
R. Couillet, “Robust spiked random matrices and a robust G-MUSIC estimator”, Elsevier
Journal of Multivariate Analysis, vol. 140, pp. 139-161, 2015.
A. Kammoun, R. Couillet, F. Pascal, M.-S. Alouini, “Optimal Design of the Adaptive
Normalized Matched Filter Detector”, (submitted to) IEEE Transactions on InformationTheory, 2016, arXiv Preprint 1504.01252.
110 / 113
Perspectives/ 111/113
Summary of Results and Perspectives III
R. Couillet, A. Kammoun, F. Pascal, “Second order statistics of robust estimators of scatter.
Application to GLRT detection for elliptical signals”, Elsevier Journal of Multivariate Analysis,vol. 143, pp. 249-274, 2016.
D. Donoho, A. Montanari, “High dimensional robust m-estimation: Asymptotic variance via
approximate message passing”, Probability Theory and Related Fields, 1-35, 2013.
N. El Karoui, “Asymptotic behavior of unregularized and ridge-regularized high-dimensional
robust regression estimators: rigorous results.” arXiv preprint arXiv:1311.2445, 2013.
111 / 113
Perspectives/ 112/113
Summary of Results and Perspectives I
Other works and ideas.
4 Spike random matrix sparse PCA
. Non-linear shrinkage methods
. Sparse kernel PCA
. Random signal processing on graph methods.
. Random matrix analysis of diffusion networks performance.
Applications.
4 Spike factor models in portfolio optimization
. Non-linear shrinkage in portfolio optimization, biostats
References.
R. Couillet, M. McKay, “Optimal block-sparse PCA for high dimensional correlated samples”,
(submitted to) Journal of Multivariate Analysis, 2016.
J. Bun, J. P. Bouchaud, M. Potters, “On the overlaps between eigenvectors of correlated
random matrices”, arXiv preprint arXiv:1603.04364 (2016).
Ledoit, O. and Wolf, M., “Nonlinear shrinkage estimation of large-dimensional covariance
matrices”, 2011
112 / 113
Perspectives/ 113/113
The End
Thank you.
113 / 113