DPPs in stats and ML - with real bits of joint work w ... - flattened.pdf · DPPs in stats and ML with real bits of joint work w/ Adrien Hardy, Michalis Titsias, Guillaume Gautier,

DPPs in stats and MLwith real bits of joint work w/ Adrien Hardy, Michalis

Titsias, Guillaume Gautier, and Michal Valko.

Remi Bardenet

1CNRS & CRIStAL, Univ. Lille, France

Remi Bardenet (CNRS & Univ. Lille) Determinantal point processes in stats and ML: computational issues 1

Summary

Determinantal point processes

A zoo of DPPs

DPPs in stats and ML

Advances on inference and sampling


Summary


A zoo of DPPs




Point processes

I A point process X on S is a random countable set of points inS .

I In most cases, it is defined by its joint intensities ρk

E

[k∏

i=1

X (Di )

]=

∫∏

Di

ρk(x1, . . . , xk)dµ(x1) . . . dµ(xk)

for disjoint Di s, see [6].

I A point process is determinantal with kernel K if:



I Existence is tricky, see e.g. [11]

I A DPP is repulsive.

I Repulsiveness is geometric.


Summary


A zoo of DPPs




Uniform spanning trees

I Let A be the vertex-edge incidence matrix of a connectedgraph G , and drop the last row.

I Sample a uniform spanning tree of G , then

edges in T ∼ DPP(K ),

with K = AT (AAT )−1A, see [5].


Random matrices

Eigenvalues of some random matrices are DPPs:

I when G is filled in with iid complex Gaussians,

40 30 20 10 0 10 20 30 4040

30

20

10

0

10

20

30

40

Figure: The Ginibre ensemble with N = 1000.


Random matrices

Eigenvalues of some random matrices are DPPs:

I when H = 12 (G + G ∗),

10 5 0 5 100.06

0.04

0.02

0.00

0.02

0.04

0.06

Figure: The GUE with N = 50.


A zoo of DPPs: N free fermions at equilibrium

I In statistical quantum physics, a system of one particle isdescribed at equilibrium by

HψE (q) = EψE (q).

I We want a Ψ : SN → C such that

|Ψ(qσ(1), . . . , qσ(N))|2 = |Ψ(q1, . . . , qN)|2,∀σ ∈ SN .


Summary


A zoo of DPPs




Orthogonal polynomial ensembles

I Let µ be a positive Borel measure, [2] build a DPP(µ,KN) onRd such that

√N1+1/d

(N∑i=1

f (xi )

KN(xi , xi )−∫

f (x)µ(dx)

)law−−−−→

N→∞N(0,Ω2

f ,ω

).

for f essentially C 1, where Ωf ,ω measures the decay of theFourier coefficients of f .

I This is useful for Monte Carlo, provided we know how tosample from that DPP!


Computational issue #1: Sampling from a DPP

I The vanilla algorithm starts from a diagonalized

K (x , y) =∞∑i=1

λiϕi (x)ϕi (y).

I This is O(N3), knowing the diagonalization and neglectingrejection sampling!


An example from spatial statistics [10, Section 5.4]



I Compare a hardcore-Strauss model

p(x1:n|β, γ, r1, r2) ∝ βn∏i

1xi∈W∏i<j

1‖xi−xj‖>r1γ

1‖xi−xj‖<r2 .

(1)

I fitted with adhoc pseudolikelihood methods.



I to a Matern DPP

ρk(x|ρ, ν, α) = det((K (xi , xj)

))∏i

1xi∈W (1)

with K (x , y) = τKν,α(‖x − y‖), Kν,α(0) = 1.

I Since ∫K (x , x)1W (x)dx = τ |W |,

we have an unbiased estimator τ = n|W | .


Computational issue #2: Fitting a DPP

The density of a DPP

Let µ be supported on S compact,

K (x , y) =∑k>0

λkΦk(x)Φk(y),

and assume λk ⊂ [0, 1) for all k. Then DPP(K , µ) has a density fw.r.t. the unit rate Poisson process on S , and

f (x1, . . . , xn) ∝det((L(xi , xj)

))det(I + L)

where L = (I − K)−1K has kernel

L(x , y) =∑k>0

λi1− λi

Φk(x)Φk(y).


Analysis of [10]


Text summarization [9]



I Build a kernel between sentences

Lij =√qiSij√qj

where Sij ∝∑

w tfi (w)tfj(w)idf(w)2, and qi = exp(θTui ).

I and sample from det(LI )1|I |=k .

I Fitting θ is relatively easy.

I Note this is not a DPP if L is not a projection.



I Again, this requires a lot of sampling, but we have time.


Summary


A zoo of DPPs




Inference

I Lots of activity in ML and stats, see [4, 13] and refs within,but no clear winning strategy.

I If you forget about K but parametrize L = Lθ instead, weshow [3] how to bypass the spectral decomposition, see also[1].


Remember computational issue #2: Fitting a DPP

The density of a DPP

Let µ be supported on S compact,

K (x , y) =∑k>0

λkΦk(x)Φk(y),

and assume λk ⊂ [0, 1) for all k. Then DPP(K , µ) has a density fw.r.t. the unit rate Poisson process on S , and

f (x1, . . . , xn) ∝det((L(xi , xj)

))det(I + L)

where L = (I − K)−1K has kernel

L(x , y) =∑k>0

λi1− λi

Φk(x)Φk(y).


Bounding Fredholm determinants

Proposition

Let Z = z1, . . . , zm ⊂ Rd , then

det LZdet(LZ + Ψ)

e−∫L(x,x)dµ(x)+tr(L−1

Z Ψ) 61

det(I + L)6

det LZdet(LZ + Ψ)

,

where LZ = ((L(zi , zj)) and Ψij =∫L(zi , x)L(x, zj)dµ(x).

I Now we can optimize over Z and plug this into MCMCroutines!

I Empirically, we only suffer from the dimension d .


What do the optimized inducing inputs look like?

Figure: The panels in the top row show the initial inducing inputlocations for various values of m, while the corresponding panels in thebottom row show the optimized locations.


Sampling finite projection DPPs

I Random projections can help [9].

I Some DPPs are just easy to sample: e.g. USTs of graphswith no bottleneck.

I Assume we know A such that K = AT (AAT )−1A.

I Key idea we use from [7] in [8]:

Vol(Zon(A)) , A[0, 1]n =∑B∈B

Vol(Zon(B)) =∑B∈B

det B


Sampling the zonotope







0 5 10 15 20 25 30

#iterations (x103)

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.2 0.2

0.4 0.4

0.6 0.6

0.8 0.8

1.0 1.0

Basis Exchange

Zonotope

Figure: Relative error of the mass of a triplet for a BA graph withrandom uniform weights.



0 5 10 15 20 25 30CPU time (s)

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.2 0.2

0.4 0.4

0.6 0.6

0.8 0.8

1.0 1.0Basis Exchange

Zonotope

Figure: Relative error of the mass of a triplet for a BA graph withrandom uniform weights.


Conclusion

I DPPs are the kernel machine of PPs,

I Applications in stats [10] and ML [9],

I Applications in signal processing [12] and Bayesiannonparametrics are coming!

I Fast inference and sampling are available.

I powerful statistical models and algorithms combine ideas fromalgebra, combinatorial geometry, functional analysis.


References I

[1] R. H. Affandi, E. B. Fox, R. P. Adams, and B. Taskar.

Learning the parameters of determinantal point processes.

In Proceedings of the International Conference on Machine Learning(ICML), 2014.

[2] R. Bardenet and A. Hardy.

Monte Carlo with determinantal point processes.

arXiv preprint arXiv:1605.00361, 2016.

[3] R. Bardenet and M. K. Titsias.

Inference for determinantal point processes without spectralknowledge.

In Advances in Neural Information Processing Systems (NIPS),pages 3375–3383, 2015.


References II

[4] V.-E. Brunel, A. Moitra, P. Rigollet, and J. Urschel.

Maximum likelihood estimation of determinantal point processes.


[5] R. Burton and R. Pemantle.

Local characteristics, entropy and limit theorems for spanning treesand domino tilings via transfer-impedances.

Annals of Probability, 21(3):1329–1371, 07 1993.

[6] D. J. Daley and D. Vere-Jones.

An introduction to the theory of point processes.

Springer, 2nd edition, 2003.

[7] Martin Dyer and Alan Frieze.

Random walks, totally unimodular matrices, and a randomised dualsimplex algorithm.

Mathematical Programming, 64(1-3):1–16, 1994.


References III

[8] G. Gautier, R. Bardenet, and M. Valko.

Zonotope hit-and-run for efficient sampling of projection dpps.

In International Conference on Machine Learning (ICML), 2017.

[9] A. Kulesza and B. Taskar.

Determinantal point processes for machine learning.

Foundations and Trends in Machine Learning, 2012.

[10] F. Lavancier, J. Møller, and E. Rubak.

Determinantal point process models and statistical inference:Extended version.

Preprint arXiv: 1205.4818, 2014.

[11] O. Macchi.

The coincidence approach to stochastic point processes.

Advances in Applied Probability, 7:83–122, 1975.


References IV

[12] N. Tremblay, P.-O. Amblard, and S. Barthelme.

Graph sampling with determinantal processes.


[13] J. Urschel, V.-E. Brunel, A. Moitra, and P. Rigollet.

Learning determinantal point processes with moments and cycles.



Documents

DPPs in stats and ML - with real bits of joint work w ... - flattened.pdf · DPPs in stats and ML with real bits of joint work w/ Adrien Hardy, Michalis Titsias, Guillaume Gautier,