Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs ›...

Preview:

Citation preview

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

1

Lisbon, Portugal June 5-8, 2017

SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations

Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017

spars2017.lx.it.pt

Rémi Gribonval Inria Rennes - Bretagne Atlantique

remi.gribonval@inria.fr

Random Moments for Compressive Learning

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Main Contributors & Collaborators

3

Anthony Bourrier Nicolas Keriven Yann Traonmilin

Nicolas Tremblay

Gilles Puy

Mike Davies Patrick PerezGilles Blanchard

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Agenda

From Compressive Sensing to Compressive Learning ? The Sketch Trick Compressive K-means Compressive GMM Conclusion

4

From Compressive Sensing to Compressive Learning

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Machine Learning

Available data training collection of feature vectors = point cloud

Goals infer parameters to achieve a certain task generalization to future samples with the same probability distribution

Examples

6

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size m

k*n/

m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

-6 -4 -2 0 2 4 6-4

-3

-2

-1

0

1

2

3

PCA principal subspace

Dictionary learning dictionary

Clustering centroids

Classification classifier parameters

(e.g. support vectors)

X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

x1X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

x1 x2X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

x1 x2 xN…X X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

High feature dimension n Large collection size N

Challenging dimensions

7

x1 x2 xN…X X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

High feature dimension n Large collection size N

Challenging dimensions

7

x1 x2 xN…X X

Challenge: compress before learning ?X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = large matrix of feature vectors

8

x1 x2 xN…X X

yNy2 …y1Y = MX

M

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = large matrix of feature vectors

8

x1 x2 xN…X X

yNy2 …y1Y = MX

M

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = large matrix of feature vectors

Reduce feature dimension [Calderbank & al 2009, Reboredo & al 2013]

(Random) feature projection Exploits / needs low-dimensional feature model

8

x1 x2 xN…X X

yNy2 …y1Y = MX

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Challenges of large collections

Feature projection: limited impact

9

X

Y = MX

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Challenges of large collections

Feature projection: limited impact

9

X

Y = MX

“Big Data” Challenge: compress collection size

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

10

X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

XM

z 2 Rm

Sketching operator nonlinear in the feature vectors

linear in their probability distribution

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

XM

z 2 Rm

Sketching operator nonlinear in the feature vectors

linear in their probability distribution

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Example: Compressive K-means

11

X

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size m

k*n/

m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

M z 2 Rm

Recovery algorithm

estimated centroids

ground truth

N = 1000;n = 2 m = 60

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational impact of sketching

12

Ph.D. A. Bourrier & N. Keriven

Computation time Memory

Collection size N Collection size N

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

CHS

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)K = 5

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)K = 5

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3

distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.

In the GMM-UBM model, each speaker S is represented by one GMM (⇥

S

,↵S

). The key point is theintroduction of a model (⇥

UBM

,↵UBM

) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:

T (X ) =

p⇥S ,↵S (X )

p⇥UBM ,↵UBM (X )

. (23)

If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.

The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X

S

specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥

S

,↵S

). Werefer the reader to [51] for more details on this procedure.

In our framework, the EM or compressive estimation algorithms are used to learn theUBM.

5.2 Setup

The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.

20

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Information preservation ?

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

The Sketch Trick

Data distribution

Sketch

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010

Dimension reduction ?

14

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

Compressive Learning (Heuristic) Examples

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning

Point cloud = empirical probability distribution

Reduce collection dimension ~ sketching

16

X

z` =1

N

NX

i=1

h`(xi) 1 ` m

M z 2 Rm

Sketching operator

Choosing information preserving sketch ?

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies

Example: Compressive K-means

17

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies

Example: Compressive K-means

17

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies

Example: Compressive K-means

17

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

18

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

Example: Compressive K-means

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

18

ground truth

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60

Density model=mixture of K Diracs

p ⇡KX

k=1

↵k�✓k

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

Example: Compressive K-means

z = Mp ⇡KX

k=1

↵kM�✓k

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

18

estimated centroids

ground truth

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60

Density model=mixture of K Diracs

p ⇡KX

k=1

↵k�✓k

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

firstname.lastname@technicolor.com

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, Francefirstname.lastname@inria.fr

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

Recovery algorithm

= “decoder”

Example: Compressive K-means

CLOMP =Compressive Learning OMP similar to: OMP with Replacement, Subspace Pursuit & CoSaMP

z = Mp ⇡KX

k=1

↵kM�✓k

⇡ arg min↵k,✓k

kz �KX

k=1

↵kM�✓kk2

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

20

N training samplesK=10 clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective1 rep. 5 rep.

0

0.05

0.1

0.15

0.2

SS

E/N

N1=70 000

KM

CKM

1 rep. 5 rep.

N2=300 000

1 rep. 5 rep.

N3=1 000 000

MNIST infMNIST infMNIST

Lloyd-Max vs Sketch+CLOMP algorithm with 1 or 5 replicates (random initialization)

CLOMPLloyd-Max

Spectral features

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: fit k Gaussians

X M z 2 RmSampled

Characteristic Function

m = 5 000N = 60 000 000;n = 12

21

z = Mp ⇡KX

k=1

↵kMp✓k

p ⇡KX

k=1

↵kp✓k

Density model=GMM with diagonal covariance

Recovery algorithm

= “decoder”

Example: Compressive GMM

⇡ arg min↵k,✓k

kz �KX

k=1

↵kMp✓kk2

estimated GMM parameters (⇥,↵)

Compressive Hierarchical Splitting (CHS) = extension of CLOMP to general GMM

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

23

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

CHS

for EMfor CHS

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

CHS

Proof of Concept: Speaker Verification Results (DET-curves)

24

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

for EM

for CHS

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

m= 5 000720 000-fold compression exploit whole collection improved performance

Proof of Concept: Speaker Verification Results (DET-curves)

25

~ 50 Gbytes ~ 1000 hours of speech

m= 10003 600 000-fold compression

m= 5007 200 000-fold compression

CHS

Computational Efficiency

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

~ One-layer random neural net

Decoding = next layers DNN ~ hierarchical sketching ?

see also [Bruna & al 2013, Giryes & al 2015]

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

28

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Privacy-reserving sketch and forget

see also [Bruna & al 2013, Giryes & al 2015]

~ One-layer random neural net

Decoding = next layers DNN ~ hierarchical sketching ?

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

29

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Streaming algorithms One pass; online update

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

29

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

streaming…

Streaming algorithms One pass; online update

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

30

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

… … … …

DIS TRI BU TED

Distributed computing Decentralized (HADOOP) / parallel (GPU)

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Summary: Compressive K-means / GMM

31

✓ Dimension reduction ✓ Resource efficiency

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

CHS102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)

K = 5

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)

K = 5

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3

distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.

In the GMM-UBM model, each speaker S is represented by one GMM (⇥

S

,↵S

). The key point is theintroduction of a model (⇥

UBM

,↵UBM

) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:

T (X ) =

p⇥S ,↵S (X )

p⇥UBM ,↵UBM (X )

. (23)

If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.

The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X

S

specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥

S

,↵S

). Werefer the reader to [51] for more details on this procedure.

In our framework, the EM or compressive estimation algorithms are used to learn theUBM.

5.2 Setup

The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.

20

✓ In the pipe: information preservation (generalized RIP, “intrinsic dimension”)

✓ Neural net - like

z

•Challenge: provably good recovery algorithms ?

Conclusion

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Projections & Learning

33

y

Signal

space

x

Observation space

Signal Processing compressive sensing

M M

Probability space

Sketch space

Machine Learning compressive learning

z

p

Linear “projection”

Compressive sensing random projections of data items

Compressive learning with sketches

random projections of collections

nonlinear in the feature vectors

linear in their probability distribution

Reduce dimension of data items

Reduce size of collection

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Summary

Compressive GMM

Bourrier, G., Perez, Compressive Gaussian Mixture Estimation. ICASSP 2013 Keriven & al, Sketching for Large-Scale Learning of Mixture Models. ICASSP 2016 & arXiv:1606.02838

Compressive k-means

Keriven & al, Compressive K-Means submitted to ICASSP 2017

Compressive spectral clustering (with graph signal processing)

Tremblay & al, Accelerated Spectral Clustering using Graph Filtering of Random Signals ICASSP 2016

Tremblay & al, Compressive Spectral Clustering ICML 2016 & arXiv:1602.02018

Ex: with Amazon graph (106 edges), 5 times speedup (3 hours instead of 15 hours for k= 500 classes)

34

Challenge: compress before learning ?X

Introduction Graph signal processing... ... applied to clustering Conclusion

What’s the point of using a graph ?

N points in d = 2 dimensions.Result with k-means (k=2) :

After creating a graph fromthe N points’ interdistances,and running the spectral clus-tering algorithm (with k=2) :

N. Tremblay Graph signal processing for clustering Rennes, 13th of January 2016 8 / 26

O(k2 log2 k +N(logN + k))O(k2N)

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Recent / ongoing work / challenges

When is information preserved with sketches / projections ? Bourrier & al, Fundamental perf. limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory, 2014

Notion of Instance Optimal Decoders = Uniform guarantees Fundamental role of general Restricted Isometry Property

How to reconstruct: algorithm / decoder ? Traonmilin & G., Stable recovery of low-dimensional cones in Hilbert spaces - One RIP to rule them all. ACHA 2016

RIP guarantees for general (convex & nonconvex) regularizers

How to (maximally) reduce dimension? [Dirksen 2014] : given a random sub-gaussian linear form Puy & al, Recipes for stable linear embeddings from Hilbert spaces to ℝ^m arXiv:1509.06947

Role of covering dimension / Gaussian width of normalized secant set

What is the achievable compression for learning tasks ? Compressive statistical learning, work in progress with G. Blanchard, N. Keriven, Y. Traonmilin

Number of random moments = “intrinsic dimension” of PCA, k-means, Dictionary Learning … Statistical learning: risk minimization + generalization to future samples with same distribution

35

Guarantees ?

�(y) := argminx2H

f(x) s.t.kMx� yk ✏

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

TH###NKS# Lisbon, Portugal June 5-8, 2017

SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations

Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017

spars2017.lx.it.pt

Recommended