73
A glance at information-geometric signal processing Frank Nielsen Sony Computer Science Laboratories, Inc. 22nd October 2012 (MAHI) c 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/1

Slides: A glance at information-geometric signal processing

Embed Size (px)

DESCRIPTION

Presented at MAHI 2012

Citation preview

Page 1: Slides: A glance at information-geometric signal processing

A glance atinformation-geometric signal processing

Frank Nielsen

Sony Computer Science Laboratories, Inc.

22nd October 2012 (MAHI)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/1

Page 2: Slides: A glance at information-geometric signal processing

Information geometry in Statistical Signal Processing

Statistical signal processing (SSP) models data with distributions:

◮ parametric (Gaussians, histograms) [model size ∼ D],

◮ semi-parametric (mixtures) [model size ∼ kD],

◮ non-parametric (kernel density estimators [model size ∼ n],Dirichlet/Gaussian processes [model size ∼ D log n],)

Data = Pattern (→information) + noise (independent)

Paradigm of computational information geometry provides:

◮ Information (entropy), statistical invariance & geometry,

◮ Language of geometry for intuitive reasoning,

◮ Novel geometric algorithms for signal processing.

→ Intrinsic data analysis

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/1

Page 3: Slides: A glance at information-geometric signal processing

Example of information-geometric SSP (I)

Statistical distance: total Bregman divergence (tBD).

Shape Retrieval using Hierarchical Total Bregman Soft Clustering, IEEE

Trans. Pattern Analysis and Machine Intelligence (PAMI), 2012.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/1

Page 4: Slides: A glance at information-geometric signal processing

Example of information-geometric SSP (II)

DTI: diffusion ellipsoids interpreted as zero-centered Gaussiandistributions.total Bregman divergence (tBD).

(3D rat corpus callosum)

Total Bregman Divergence and its Applications to DTI Analysis, IEEE

Trans. Medical Imaging (TMI), 2010.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/1

Page 5: Slides: A glance at information-geometric signal processing

Statistical mixtures: Generative models of data sets

GMM = feature descriptor for information retrieval (IR)→ classification [19], matching, etc.Increase dimension using color image patches.Low-frequency information encoded into compact statistical model.

Generative model → statistical image by GMM sampling.

→ A mixture∑k

i=1wiN(µi ,Σi) is interpreted as a weighted pointset in a parameter space: {wi , θi = (µi ,Σi)}ki=1.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/1

Page 6: Slides: A glance at information-geometric signal processing

Information-geometric hyperspectral imagingImage with z-axis = spectral bands (radiance or reflectance).→ characterize spectral variability, similarity and discrimination.

◮ Normalize hyperspectral pixel vector (→histogram):

pi =xi

∑Li=1 xi

.

◮ Spectral information divergence (single pixel):

SID(x , y) = D(x ||y) + D(y ||x),

D(p||q) =

L∑

i=1

pi logpi

qi

(aka. Jeffreys symmetrized Kullback-Leibler divergence [23])

C.-I. Chang, An information-theoretic approach to spectral variability,

similarity, and discrimination for hyperspectral image analysis, IEEETrans. Information Theory, 2000.

Sided and symmetrized Bregman centroids, IEEE Trans. Information

Theory, 2009.c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/1

Page 7: Slides: A glance at information-geometric signal processing

Fisher-Rao Riemannian geometry (1945)◮ D-parametric distribution family:

{p(x ; θ)

∣∣θ ⊆ R

D}.

◮ Fisher Information matrix (FIM):

I (θ) = [Iij ], Iij = Eθ

[∂ log p(x ; θ)

∂θi

∂ log p(x ; θ)

∂θj

]

,

I (θ) = Var

[∂

∂θlog p(x ; θ)

]

� 0,

always semi-positive definite: ∀x , xT I (θ)x ≥ 0.◮ Cramer-Rao lower bound (CRLB) for an unbiased estimator θ:

Var[θ] � I−1(θ)

Lowner ordering for cone of positive definite matrices:A � B ⇔ A− B � 0.

→ FIM interpreted as curvature of the log-likelihood function(derivative of log-l.= score function).→ Estimation efficiency of θ depends on true hidden θ parameter.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/1

Page 8: Slides: A glance at information-geometric signal processing

Fisher-Rao Riemannian geometry (1945)

Rao chose the FIM for defining a statistical manifold (M, g)

◮ Infinitesimal length element:

ds2 =∑

ij

gij(θ)dθidθj = dθT I (θ)dθ

◮ Geodesic and distance are hard to explicitly calculate:

ρ(p(x ; θ1), p(x ; θ2)) = minθ(s)

θ(0)=θ1θ(1)=θ2

∫ 1

0

√(dθ

ds

)T

I (θ)dθ

dsds

◮ Metric property of ρ, log/exp tangent/manifold mapping

→ FR geometry limited from the viewpoint of computation.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/1

Page 9: Slides: A glance at information-geometric signal processing

A particular case of Fisher-Rao Riemannian geometryFor location-scale families (normal, Cauchy, Laplace, uniform,elliptical): p(x ;µ, σ) = 1

σf( x−µ

σ

).

Fisher-Rao geometry amounts to hyperbolic geometry of constantcurvature κ = − 1

(d−1)β depending on the density profile:

β =∫ (

xf ′(x)f (x) + 1

)2f (x)dx .

FR statistical Voronoi diagram [25] = Hyperbolic Voronoi diagramon parameter space.

[Klein ball]Hyperbolic Voronoi diagrams made easy , ICCSA, 2010. [4]

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/1

Page 10: Slides: A glance at information-geometric signal processing

Statistical invarianceRiemannian structure (M, g) on {p(x ; θ) | θ ∈ Θ ⊂ R

D}◮ θ-Invariance under non-singular parameterization:

ρ(p(x ; θ), p(x ; θ′)) = ρ(p(x ;λ(θ)), p(x ;λ(θ′)))

Normal parameterization (µ, σ) or (µ, σ2) yields same distance

◮ x-Invariance under different x-representation:Sufficient statistics (Fisher, 1922):

Pr(X = x |t(X ) = t, θ) = Pr(X = x |T (X ) = t)

All information for θ is contained in T .→ Lossless information data reduction (exponential families).Markov kernel = statistical morphism (Chentsov 1972,[7, 8]).A particular Markov kernel is a deterministic mappingT : X → Y with y = T (x), py = pxT

−1.

Invariance if and only if g = Fisher information matrix

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/1

Page 11: Slides: A glance at information-geometric signal processing

f -divergences (1960’s)A statistical non-metric distance between two probability measures:

If (p : q) =

f

(p(x)

q(x)

)

q(x)dx

f : continuous convex function with f (1) = 0, f ′(1) = 0, f ′′(1) = 1.→ asymmetric (not a metric, except TV), modulo affine term.→ can always be symmetrized using s = f + f ∗, withf ∗(x) = xf (1/x).include many well-known statistical measures: Kullback-Leibler,α-divergences, Hellinger, Chi squared, total variation (TV), etc.

f -divergences are the only statistical divergences that preservesequivalence wrt. sufficient statistic mapping:

If (p : q) ≥ If (pM : qM)

with equality if and only if M = T (monotonicity property).c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/1

Page 12: Slides: A glance at information-geometric signal processing

Outline: Dually flat spaces

Statistical invariance also obtained using (M, g ,∇,∇∗) where

∇ and ∇∗ are dual affine connections.

Riemannian structure (M, g) is particular case for ∇ = ∇∗ = ∇0,Levi-Civita connection: (M, g) = (M, g ,∇(0),∇(0))

Dually flat space are algorithmically-friendly:

◮ Statistical mixtures of exponential families

◮ Learning & simplifying mixtures (k-MLE)

◮ Bregman Voronoi diagrams & dually ⊥ triangulations

Goal: Algorithmics of Gaussians/histograms wrt. Kullback-Leibler divergence.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/1

Page 13: Slides: A glance at information-geometric signal processing

Exponential Family Mixture Models (EFMMs)Generalize Gaussian & Rayleigh MMs to many usual distributions.

m(x) =

k∑

i=1

wipF (x ;λi ) with ∀i wi > 0,∑k

i=1 wi = 1

pF (x ;λ) = e〈t(x),θ〉−F (θ)+k(x)

F : log-Laplace transform (partition, cumulant function):

F (θ) = log

x∈Xe〈t(x),θ〉+k(x)dx ,

θ ∈ Θ =

{

θ

∣∣∣∣

x∈Xe〈t(x),θ〉+k(x)dx < ∞

}

the natural parameter space.

◮ d : Dimension of the support X .

◮ D: order of the family (= dimΘ). Statistic: t(x) : Rd → RD .

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/1

Page 14: Slides: A glance at information-geometric signal processing

Statistical mixtures: Rayleigh MMs [35, 20]IntraVascular UltraSound (IVUS) imaging:

Rayleigh distribution:

p(x ;λ) = xλ2 e

− x2

2λ2

x ∈ R+

d = 1 (univariate)D = 1 (order 1)θ = − 1

2λ2

Θ = (−∞, 0)F (θ) = − log(−2θ)t(x) = x2

k(x) = log x(Weibull k = 2)

Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissuesRayleigh Mixture Models (RMMs):for segmentation and classification tasks

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/1

Page 15: Slides: A glance at information-geometric signal processing

Statistical mixtures: Gaussian MMs [9, 20, 10]Gaussian mixture models (GMMs): model low frequency.Color image interpreted as a 5D xyRGB point set.

Gaussian distribution p(x ;µ,Σ):1

(2π)d2√

|Σ|e−

12DΣ−1 (x−µ,x−µ)

Squared Mahalanobis distance:DQ(x , y) = (x − y)TQ(x − y)x ∈ R

d

d (multivariate)

D = d(d+3)2 (order)

θ = (Σ−1µ, 12Σ−1) = (θv , θM)

Θ = R× Sd++

F (θ) = 14θ

Tv θ

−1M θv − 1

2 log |θM | +d2 log πt(x) = (x ,−xxT )k(x) = 0

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/1

Page 16: Slides: A glance at information-geometric signal processing

Sampling from a Gaussian Mixture ModelTo sample a variate x from a GMM:

◮ Choose a component l according to the weight distributionw1, ...,wk ,

◮ Draw a variate x according to N(µl ,Σl).

→ Sampling is a doubly stochastic process:

◮ throw a biased dice with k faces to choose the component:

l ∼ Multinomial(w1, ...,wk)

(Multinomial is also an EF, normalized histogram.)

◮ then draw at random a variate x from the l -th component

x ∼ Normal(µl ,Σl)

x = µ+ Cz with Cholesky: Σ = CCT and z = [z1 ... zd ]T

standard normal random variate:zi =√−2 logU1 cos(2πU2)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/1

Page 17: Slides: A glance at information-geometric signal processing

Relative entropy for exponential families◮ Distance between features (e.g., GMMs)◮ Kullback-Leibler divergence (cross-entropy minus entropy):

KL(P : Q) =

p(x) logp(x)

q(x)dx ≥ 0

=

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

= F (θQ)− F (θP)− 〈θQ − θP ,∇F (θP)〉= BF (θQ : θP)

Bregman divergence BF defined for a strictly convex anddifferentiable function up to some affine terms.

◮ Proof KL(P : Q) = BF (θQ : θP) follows from

X ∼ EF (θ) =⇒ E [t(X )] = ∇F (θ)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/1

Page 18: Slides: A glance at information-geometric signal processing

Convex duality: Legendre transformation

◮ For a strictly convex and differentiable function F : X → R:

F ∗(y) = supx∈X

{〈y , x〉 − F (x)︸ ︷︷ ︸

lF (y ;x);

}

◮ Maximum obtained for y = ∇F (x):

∇x lF (y ; x) = y −∇F (x) = 0 ⇒ y = ∇F (x)

◮ Maximum unique from convexity of F (∇2F ≻ 0):

∇2x lF (y ; x) = −∇2F (x) ≺ 0

◮ Convex conjugates:

(F ,X ) ⇔ (F ∗,Y), Y = {∇F (x) | x ∈ X}

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/1

Page 19: Slides: A glance at information-geometric signal processing

Legendre duality: Geometric interpretationConsider the epigraph of F as a convex object:

◮ convex hull (V -representation), versus◮ half-space (H-representation).

OF

z

x

P : (x, F (x))

(0, F (xP )− xPF′(xP ) = −F ∗(yP ))

HP : z = (x− xP )F′(xP ) + F (xP )

Q

xP

zP = F (xP )

HQ : z = (x− xQ)F′(p) + F (xQ)

Dual coordinate systems:

P =

xPHP : yP = F ′(xP )

0

HP+

Legendre transform also called “slope” transform.c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/1

Page 20: Slides: A glance at information-geometric signal processing

Legendre duality & Canonical divergence

◮ Convex conjugates have functional inverse gradients

∇F−1 = ∇F ∗

∇F ∗ may require numerical approximation(not always available in analytical closed-form)

◮ Involution: (F ∗)∗ = F with ∇F ∗ = (∇F )−1.

◮ Convex conjugate F ∗ expressed using (∇F )−1:

F ∗(y) = 〈(∇F )−1(y), y〉 − F ((∇F )−1(y))

◮ Fenchel-Young inequality at the heart of canonical divergence:

F (x) + F ∗(y) ≥ 〈x , y〉

AF (x : y) = AF∗(y : x) = F (x) + F ∗(y)− 〈x , y〉 ≥ 0

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/1

Page 21: Slides: A glance at information-geometric signal processing

Dual Bregman divergences & canonical divergence [24]

KL(P : Q) = EP

[

logp(x)

q(x)

]

≥ 0

= BF (θQ : θP) = BF∗(ηP : ηQ)

= F (θQ) + F ∗(ηP)− 〈θQ , ηP〉= AF (θQ : ηP) = AF∗(ηP : θQ)

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP)(moment parameterization).

KL(P : Q) =

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

Shannon cross-entropy and entropy of EF [24]:

H×(P : Q) = F (θQ)− 〈θQ ,∇F (θP)〉 − EP [k(x)]

H(P) = F (θP)− 〈θP ,∇F (θP)〉 − EP [k(x)]

H(P) = −F ∗(ηP)− EP [k(x)]

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/1

Page 22: Slides: A glance at information-geometric signal processing

Bregman divergence: Geometric interpretation (I)

Potential function F , graph plot F : (x ,F (x)).

DF (p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/1

Page 23: Slides: A glance at information-geometric signal processing

Bregman divergence: Geometric interpretation (II)

Potential function f , graph plot F : (x , f (x)).

Bf (p||q) = f (p)− f (q)− (p − q)f ′(q)

F

Xpq

p

q

Hq

Bf (p||q)

Bf (.||q): vertical distance between the hyperplane Hq tangent toF at lifted point q, and the translated hyperplane at p.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/1

Page 24: Slides: A glance at information-geometric signal processing

total Bregman divergence (tBD)

By analogy to least squares and total least squarestotal Bregman divergence (tBD) [13, 36, 14]

δf (x , y) =bf (x , y)

1 + ‖∇f (y)‖2

Proved statistical robustness of tBD.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/1

Page 25: Slides: A glance at information-geometric signal processing

Bregman sided centroids [23, 19]Bregman centroids = unique minimizers of average Bregmandivergences (BF convex in right argument)

θ = argminθ1

n

n∑

i=1

BF (θi : θ)

θ′ = argminθ1

n

n∑

i=1

BF (θ : θi )

θ =1

n

n∑

i=1

θi , center of mass, independent of F

θ′ = (∇F )−1

(

1

n

n∑

i=1

(∇F )(θi )

)

→ Generalized Kolmogorov-Nagumo f -means.c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/1

Page 26: Slides: A glance at information-geometric signal processing

Bregman divergences BF and ∇F -means

Bijection quasi-arithmetic means (∇F ) ⇔ Bregman divergence BF .Bregman divergence BF F ←→ f = F ′ f−1 = (F ′)−1 f -mean(entropy/loss function F ) (Generalized means)

Squared Euclidean distance 12x2 ←→ x x Arithmetic mean

(half squared loss)∑n

j=11nxj

Kullback-Leibler divergence x log x − x ←→ log x exp x Geometric mean

(Ext. neg. Shannon entropy) (∏n

j=1 xj )1n

Itakura-Saito divergence − log x ←→ − 1x

− 1x

Harmonic mean(Burg entropy) n

∑nj=1

1xj

∇F strictly increasing (like cumulative distribution functions)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/1

Page 27: Slides: A glance at information-geometric signal processing

Bregman sided centroids [23]

Two sided centroids C and C ′ expressed using two θ/η coordinatesystems: = 4 equations.

C : θ , η′

C ′ : θ′, η

C : θ =1

n

n∑

i=1

θi

η′ = ∇F (θ)

C ′ : η =1

n

n∑

i=1

ηi

θ′ = ∇F ∗(η)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/1

Page 28: Slides: A glance at information-geometric signal processing

Bregman information [23]Bregman information = minimum of loss function

IF (P) =1

n

n∑

i=1

BF (θi : θ)

=1

n

n∑

i=1

F (θi)− F (θ)− 〈θi − θ,∇F (θ)〉

=1

n

n∑

i=1

F (θi)− F (θ)−⟨

1

n

n∑

i=1

θi − θ

︸ ︷︷ ︸

=0

,∇F (θ)

= JF (θ1, ..., θn)

Jensen diversity index (e.g., Jensen-Shannon for F (x) = x log x)

◮ For squared Euclidean distance, Bregman information =cluster variance,

◮ For Kullback-Leibler divergence, Bregman information relatedto mutual information.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/1

Page 29: Slides: A glance at information-geometric signal processing

Bregman k-means clustering [5]

Bregman k-means: Find k centers C = {C1, ...,Ck} that minimizesthe loss function:

LF (P : C) =∑

P∈P

BF (P : C)

BF (P : C) = mini∈{1,...,k}

BF (P : Ci )

→ generalize Lloyd’ s quadratic error in Vector Quantization (VQ)

LF (P : C) = IF (P)− IF (C)

IF (P) → total Bregman informationIF (C) → between-cluster Bregman informationLF (P : C) → within-cluster Bregman information

total Bregman information = within-cluster Bregman information + between-cluster Bregman information

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/1

Page 30: Slides: A glance at information-geometric signal processing

Bregman k-means clustering [5]

IF (P) = LF (P : C) + IF (C)Bregman clustering amounts to find the partition C∗that minimizes

the information loss:

LF∗ = LF (P : C∗) = min

C(IF (P) − IF (C))

Bregman k-means :

◮ Initialize distinct seeds: C1 = P1, ...,Ck = Pk

◮ Repeat until convergence◮ Assign point Pi to its closest centroid:

Ci = {P ∈ P | BF (P : Ci ) ≤ BF (P : Cj) ∀j 6= i}◮ Update cluster centroids by taking their center of mass:

Ci =1

|Ci |

P∈CiP .

Loss function monotonically decreases and converges to a local

optimum. (Extend to weighted point sets using barycenters.)c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/1

Page 31: Slides: A glance at information-geometric signal processing

Bregman k-means++ [1]: Careful seeding (only?!)

(also called Bregman k-medians since min∑

i B1F (pi : x)).

Extend the D2-initialization of k-means++

Only seeding stage yields probabilistically guaranteed globalapproximation factor:

Bregman k-means++:

◮ Choose C = {Cl} for l uniformly random in {1, ..., n}◮ While |C| < k

◮ Choose P ∈ P with probabilityBF (P:C)∑ni=1 BF (Pi :C)

= BF (P:C)LF (P:C)

→ Yields a O(log k) approximation factor (with high probability).Constant in O(·) depends on ratio of min/max ∇2F .

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/1

Page 32: Slides: A glance at information-geometric signal processing

Exponential family mixtures: Dual parameterizationsA finite weighted point set {(wi , θi )}ki=1 in a statistical manifold.Many coordinate systems but two natural for computing:

◮ usual λ-parameterization or map ◦ λ,◮ natural θ-parameterization and dual η-parameterization.

λ ∈ Λ

η ∈ Hθ ∈ Θ

Exponential familydual parameterization

η = ∇θF (θ) θ = ∇ηF∗(η)

Legendre transform(Θ, F ) ↔ (H,F ∗)

Natural parameters Expectation parameters

Original parameters

(KL distance invariant under non-degenerate reparameterization.)c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/1

Page 33: Slides: A glance at information-geometric signal processing

Maximum Likelihood Estimator (MLE)Given n iid. observations x1, ..., xnMaximum Likelihood Estimator

θ = argmaxθ∈Θ

n∏

i=1

pF (xi ; θ) = argmaxθ∈Θe∑n

i=1〈t(xi ),θ〉−F (θ)+k(xi )

is unique maximum since ∇2F ≻ 0. MLE equation:

∇F (θ) =1

n

n∑

i=1

t(xi )

MLE is consistent, efficient with asymptotic normal distribution:θ ∼ N

(θ, 1n I

−1(θ))

Fisher information matrix for exponential families:

I (θ) = var[t(X )] = ∇2F (θ) = (∇2F ∗(η))−1

MLE may be biased (eg, normal distributions).→ called observed point P in information geometry.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/1

Page 34: Slides: A glance at information-geometric signal processing

Duality Bregman ↔ Exponential families [5]

Bregman divergence:BF∗(x : η)

Bregman generator:F ∗(η)

Cumulant function:F (θ)

Exponential family:pF (x|θ)

Legendreduality

η = ∇F (θ)

An exponential family...

pF (x ; θ) = exp(〈t(x), θ〉 − F (θ) + k(x))

has the log-density interpreted as a Bregman divergence:

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/1

Page 35: Slides: A glance at information-geometric signal processing

Exponential families ⇔ Bregman divergences: Examples

Identify iso-distance contour as iso-probability contour(Bregman divergences always convex on rhs.)

F (x) pF (x |θ) ⇔ BF∗

Generator Exponential Family ⇔ Dual Bregman divergence

x2 Spherical Gaussian ⇔ Squared lossx log x Multinomial ⇔ Kullback-Leibler divergencex log x − x Poisson ⇔ I -divergence− log x Geometric ⇔ Itakura-Saito divergencelog |X | Wishart ⇔ log-det/Burg matrix div. [37]

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/1

Page 36: Slides: A glance at information-geometric signal processing

Maximum likelihood estimator revisitedθ = argmaxθ

∏ni=1 pF (xi ; θ)

maxθ

n∑

i=1

(〈t(xi ), θ〉 − F (θ) + k(xi ))

maxθ

n∑

i=1

−BF∗(t(xi) : η) + F ∗(t(xi )) + k(xi )︸ ︷︷ ︸

constant

≡ minθ

n∑

i=1

BF∗(t(xi) : η )

Right-sided Bregman centroid = center of mass:

η =1

n

n∑

i=1

t(xi)

η-MLE is center of mass of sufficient statistics {yi = t(xi)}ni=1.c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/1

Page 37: Slides: A glance at information-geometric signal processing

Learning a mixture using the Expectation-Maximization [5]

◮ EM increases monotonically the expected complete likelihoodL (or log-likelihood function l). (Marginalize the hiddenvariables zi ’s)

◮ EM needs an initialization Θ0. (Usually by k-means: E.g., foreach cluster we fit a Gaussian centered at the cluster centroidwith covariance matrix the covariance of the cluster, andweight the relative proportion of points in that cluster.)

◮ EM needs a stopping criterion. EM keeps improving theexpected log-likelihood. Need to break the loop when thedifference of log-likelihood between successive iterations <threshold.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/1

Page 38: Slides: A glance at information-geometric signal processing

Learning a mixture using the

Expectation-Maximization [5, 13]EM for EFMM is equivalent to a Bregman soft clustering.Bregman EM soft clustering algorithm on {x1, ..., xn}:Initialization. Set {wi , ηi}ki=1 with

∑ki=1 wi = 1

Loop until improvement < threshold.Expectation. (compute posterior probabilities)For all observations x

For all model components i :

Pr(i |x) = wie−BF∗ (x :ηi )

∑kj=1 wje

−BF∗ (x :ηj )

Maximization. For all model components iwi =

1n

∑nj=1 Pr(i |xj)

ηi =∑n

j=1 Pr(i |xj)xj∑nj=1 Pr(i |xj)

→ barycenter

Monotonous convergence of the expected complete likelihood.!!! But sampling variates is a doubly stochastic process... !!!

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/1

Page 39: Slides: A glance at information-geometric signal processing

k-MLE for EFMM = Bregman Hard Clustering [18]Bijection exponential families (distributions) ↔ Bregman distances

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x), η = ∇F (θ)

k-MLE (F ) = Bregman hard k-means for F ∗ + cross-entropyminimization for weightsComplete log-likelihood:

maxΘ

n∑

i=1

k∑

j=1

δj (zi)(log pF (xi |θj) + logwj )

minH

n∑

i=1

k∑

j=1

δj (zi)((BF∗(t(xi ) : ηj)− logwj )−k(xi )− F ∗(t(xi))︸ ︷︷ ︸

constant

≡ minH

n∑

i=1

kminj=1

BF∗(t(xi) : ηj )− logwj

→ guarantees the (local) convergence of the complete likelihood ofk-MLE. (Assign a sample to a unique cluster: Hard clustering).

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/1

Page 40: Slides: A glance at information-geometric signal processing

k-MLE for EFMMs [18]

◮ 0. Initialization: ∀i ∈ {1, ..., k}, let wi =1k and ηi = t(xi)

(initialization is discussed later on).

◮ 1. Assignment:∀i ∈ {1, ..., n}, zi = argminkj=1BF∗(t(xi) : ηj).Let Ci = {xj |zj = i},∀i ∈ {1, ..., k} be the cluster partition

◮ 2. Update the η-parameters:∀i ∈ {1, ..., k}, ηi = 1

|Ci |

x∈Cit(x).

Goto step 1 unless local convergence of the completelikelihood is reached.

◮ 3. Update the weights: ∀i ∈ {1, ..., k},wi =1n |Ci |.

Goto step 1 unless local convergence of the completelikelihood is reached.

→ Steps 2 and 3 iterated until convergence: k-MLE = Hard EM→ Can use other k-means heuristics (like Hartigan greedy swap)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/1

Page 41: Slides: A glance at information-geometric signal processing

Further generalization of k-MLE

Each mixture component can have its own exponential family

Infinitely many families of exponential families:

◮ Weibull (incl. Rayleigh or exponential),

◮ generalized Gaussians (incl. normal, Laplace, uniform).

p(x ;µ, α, β) =β

2αΓ(1/β)exp

(

−|x − µ|βα

)

with α > 0 (scale parameter) and β > 0 (shape parameter).Apply k-MLE by adding at each round acomponent family selection (eg., select the best β for eachcomponent).

k-MLE for mixtures of generalized Gaussians, ICPR, 2012. [34]

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/1

Page 42: Slides: A glance at information-geometric signal processing

k-MLE++ [18]

◮ k-MLE++ = Bregman F ∗ k-means++ initializationGuaranteed approximation on the best complete averagelog-likelihood.→ Single step mixture learning (fast and good)

◮ Indivisibility: Robustness when identifying statistical mixturemodels? Which k?

∀k ∈ N, N(µ, σ2) =

k∑

i=1

N

k,σ2

k

)

(add small perturbations → we should cluster MMs to getcompact high quality equivalent MMs)

◮ → Choose large k (like k = n for Kernel Density Estimators),and simplify MMs [9, 33]

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/1

Page 43: Slides: A glance at information-geometric signal processing

Speeding-up k-MLE... Fast assignment

◮ Proximity data-structures for Bregman k-means:

Ci = {P ∈ P | BF (P : Ci) ≤ BF (P : Cj) ∀j 6= i}◮ Bregman Voronoi diagrams [6]◮ Bregman Nearest Neighbors: ball trees [30] or vantage point

trees [29].

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/1

Page 44: Slides: A glance at information-geometric signal processing

Anisotropic Voronoi diagram (for MVN MMs) [12, 15]

From the source color image (a), we buid a 5D GMM with k = 32components, and color each pixel with the mean color of theanisotropic Voronoi cell it belongs to. (∼ weighted squaredMahalanobis distance per center)

(a) (b)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/1

Page 45: Slides: A glance at information-geometric signal processing

Voronoi diagrams

Voronoi diagram, dual ⊥ Delaunay triangulation (general position)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/1

Page 46: Slides: A glance at information-geometric signal processing

Bregman dual bisectors: Hyperplanes &

hypersurfaces [6, 22, 25]

Right-sided bisector: → Hyperplane (θ-hyperplane)

HF (p, q) = {x ∈ X | BF (x : p) = BF (x : q)}.HF :

〈∇F (p)−∇F (q), x〉 + (F (p)− F (q) + 〈q,∇F (q)〉 − 〈p,∇F (p)〉) = 0

Left-sided bisector: → Hypersurface (η-hyperplane)

H ′F (p, q) = {x ∈ X | BF (p : x) = BF (q : x)}

H ′F : 〈∇F (x), q − p〉+ F (p)− F (q) = 0

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/1

Page 47: Slides: A glance at information-geometric signal processing

Visualizing Bregman bisectors

Primal coordinates θ Dual coordinates ηnatural parameters expectation parameters

p

q

Source Space: Logistic loss

p(0.87337870,0.14144719) q(0.92858669,0.61296731)

D(p,q)=0.49561129 D(q,p)=0.60649981 p’

q’

Gradient Space: Bernouilli

p’(1.93116855,-1.80332178) q’(2.56517944,0.45980247)

D*(p’,q’)=0.60649981 D*(q’,p’)=0.49561129

p

qSource Space: Itakura-Saito

p(0.52977081,0.72041688) q(0.85824458,0.29083834)

D(p,q)=0.66969016 D(q,p)=0.44835617

p’

q’

Gradient Space: Itakura-Saito dual

p’(-1.88760873,-1.38808518) q’(-1.16516903,-3.43833618)

D*(p’,q’)=0.44835617 D*(q’,p’)=0.66969016

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/1

Page 48: Slides: A glance at information-geometric signal processing

Bregman Voronoi diagrams as minimization diagrams [6]A subclass of affine diagrams which have all non-empty cells .Minimization diagram of the n functionsDi(x) = BF (x : pi) = F (x)− F (pi )− 〈x − pi ,∇F (pi )〉.≡ minimization of n linear functions:

Hi (x) = (pi − x)T∇F (qi )− F (pi )

⇐⇒c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/1

Page 49: Slides: A glance at information-geometric signal processing

Bregman dual Delaunay triangulations

Delaunay Exponential Hellinger-like

◮ empty Bregman sphere property,

◮ geodesic triangles.

BVDs extends Euclidean Voronoi diagrams with similar complexity/algorithms.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/1

Page 50: Slides: A glance at information-geometric signal processing

Non-commutative Bregman Orthogonality3-point property (generalized law of cosines):

BF (p : r) = BF (p : q) + BF (q : r)− (p − q)T (∇F (r)−∇F (q)))

P

QR

DF (P ||R)

DF (Q||R)

DF (P ||Q)

Γ∗

PQ

ΓQR

(pq)θ Bregman orthogonal to (qr)η iff.

BF (p : r) = BF (p : q) + BF (q : r)

(Equivalent to 〈θp − θq, ηr − ηq〉 = 0)Extend Pythagoras theorem

(pq)θ ⊥F (qr)η

→ ⊥F is not commutative...... except in the squared Euclidean/Mahalanobis case,F (x) = 1 〈x , x〉.c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/1

Page 51: Slides: A glance at information-geometric signal processing

Dually orthogonal Bregman Voronoi & TriangulationsOrdinary Voronoi diagram is perpendicular to Delaunaytriangulation.Dual line segment geodesics:

(pq)θ = {θ = θp + (1− λ)θq |λ ∈ [0, 1]}(pq)η = {η = ηp + (1− λ)ηq |λ ∈ [0, 1]}

Bisectors:

Bθ(p, q) : 〈x , θq − θp〉+ F (θp)− F (θq) = 0

Bη(p, q) : 〈x , ηq − ηp〉+ F ∗(ηp)− F ∗(ηq) = 0

Dual orthogonality:

Bη(p, q) ⊥ (pq)η

(pq)θ ⊥ Bθ(p, q)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/1

Page 52: Slides: A glance at information-geometric signal processing

Dually orthogonal Bregman Voronoi & Triangulations

Bη(p, q) ⊥ (pq)η

(pq)θ ⊥ Bθ(p, q)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/1

Page 53: Slides: A glance at information-geometric signal processing

Simplifying mixture: Kullback-Leibler projection theoremAn exponential family mixture model p =

∑ki=1 wipF (x ; θi )

Right-sided KL barycenter p∗ of components interpreted as theprojection of the mixture model p ∈ P onto the model exponentialfamily manifold EF [32]:

p∗ = arg minp∈EF

KL(p : p)

P

p∗p

exponential family

mixture sub-manifold

manifold of probability distribution

m-geodesic

EF

p

e-geodesic

Right-sided KL centroid = Left-sided Bregman centroidc© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/1

Page 54: Slides: A glance at information-geometric signal processing

Left-sided or right-sided Kullback-Leibler centroids?Left/right Bregman centroids=Right/left entropic centroids (KL of exp. fam.)

Left-sided/right-sided centroids: different (statistical) properties:

◮ Right-sided entropic centroid: zero-avoiding (cover support ofpdfs.)

◮ Left-sided entropic centroid: zero-forcing (captures highestmode).

N1 = N (−4, 4)

N2 = (5, 0.64)Left-side Kullback-Leibler centroid

(zero-forcing)

Right-sided Kullback-Leibler centroidzero-avoiding

Symmetrized centroid

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/1

Page 55: Slides: A glance at information-geometric signal processing

Hierarchical clustering of GMMs (Burbea-Rao)

Hierarchical clustering of GMMs wrt. Bhattacharyya distance.Simplify the number of components of an initial GMM.

(a) source

(b) k = 48

(c) k = 16

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/1

Page 56: Slides: A glance at information-geometric signal processing

Two symmetrizations of Bregman divergences

◮ Jeffreys-Bregman divergences.

SF (p; q) =BF (p, q) + BF (q, p)

2

=1

2〈p − q,∇F (p)−∇F (q)〉,

◮ Jensen-Bregman divergences (diversity index).

JF (p; q) =BF (p,

p+q2 ) + BF (q,

p+q2 )

2

=F (p) + F (q)

2− F

(p + q

2

)

= BRF (p, q)

Skew Jensen divergence [19, 27]

J(α)F (p; q) = αF (p)+(1−α)F (q)−F (αp+(1−α)q) = BR

(α)F (p; q)

(Jeffreys and Jensen-Shannon symmetrization of Kullback-Leibler)c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/1

Page 57: Slides: A glance at information-geometric signal processing

(Burbea-Rao centroids (α-skewed Jensen centroids)

Minimum average divergence

OPT : c = argminx

n∑

i=1

wiJ(α)F (x , pi ) = argmin

xL(x)

Equivalent to minimize:

E (c) = (

n∑

i=1

wiα)F (c) −n∑

i=1

wiF (αc + (1− α)pi )

Sum E = F + G of convex F + concave G function ⇒Convex-ConCave Procedure (CCCP)Start from arbitrary c0, and iteratively update as:

∇F (ct+1) = −∇G (ct)

⇒ guaranteed convergence to a local minimum.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 57/1

Page 58: Slides: A glance at information-geometric signal processing

ConCave Convex Procedure (CCCP)minx E (x) = F (x) + G (x)∇F (ct+1) = −∇G (ct)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 58/1

Page 59: Slides: A glance at information-geometric signal processing

Iterative algorithm for Burbea-Rao centroids

Apply CCCP scheme

∇F (ct+1) =n∑

i=1

wi∇F (αct + (1− α)pi )

ct+1 = ∇F−1

(n∑

i=1

wi∇F (αct + (1− α)pi )

)

Get arbitrarily fine approximations of the (skew) Burbea-Raocentroids and barycenters.

Unique GLOBAL minimum when divergence is separable [19].

Unique GLOBAL minimum for matrix mean [21] for the logDetdivergence.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 59/1

Page 60: Slides: A glance at information-geometric signal processing

Statistical divergences (Recap.)

◮ Kullback-Leibler is a f -divergence (→ statistical invariance,information monotonicity, curved geometry)

◮ Kullback-Leibler of exponential families = Bregmandivergences on parameters (dually flat geometry)

◮ Skew Jensen-divergence (Burbea-Rao, α = 12) include

Bregman divergences in limit cases [19]

◮ No known closed form for Kullback-Leibler of mixtures. Butclosed-form for EFMMs with the Cauchy-Schwarzdivergence [17]:

CS(P : Q) = − log

∫p(x)q(x)dx

√∫p(x)2dx

∫q(x)2dx

,

Closed-Form Information-Theoretic Divergences for Statistical Mixtures,

ICPR, 2012.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 60/1

Page 61: Slides: A glance at information-geometric signal processing

Summary

Computational information-geometric signal processing:

◮ Statistical manifold (M, g): Rao’s distance and Fisher-Raocurved riemannian geometry.

◮ Statistical manifold (M, g ,∇,∇∗): dually flat spaces,Bregman divergences, geodesics are straight lines in either θ/ηparameter space.

◮ Clustering & learning statistical mixtures (EM=soft Bregmanclustering, k-MLE, KDE simplification, hierarchicalmixtures [11])

◮ Software library: jMEF [9] (Java), pyMEF [31] (Python)

◮ ... but also many other geometry to explore: Hilbertian,Finsler [3], Kahler, Wasserstein, etc. (it is easy to requirenon-Euclidean geometry but then space is wild open!)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 61/1

Page 62: Slides: A glance at information-geometric signal processing

Acknowledgements THANK YOU!

Andre Ferrari, Cedric Fevotte, Cedric RichardCaroline Daire & MAHI Organizers.University of Nice Sophia-Antipolis, Lagrange lab,Albert Bijaoui, OCA,CNRS.

Collaborators: Shun-ichi Amari, Marc Arnaudon, Michel Barlaud,Sylvain Boltz, Jean-Daniel Boissonnat, Eric Debreuve, VincentGarcia, Meizhu Liu Richard Nock, Paolo Piro, Olivier Schwander,Baba C. Vemuri.

Sony Computer Science Laboratories Inc: Professors HiroakiKitano and Mario Tokoro.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 62/1

Page 63: Slides: A glance at information-geometric signal processing

Exponential families & statistical distancesUniversal density estimators [2] generalizing Gaussians/histograms(single EF density approximates any smooth density)Explicit formula for

◮ Shannon entropy, cross-entropy, and Kullback-Leiblerdivergence [24]:

◮ Renyi/Tsallis entropy and divergence [26]◮ Sharma-Mittal entropy and divergence [28]. A 2-parameter

family extending extensive Renyi (for β → 1) andnon-extensive Tsallis entropies (for β → α)

Hα,β(p) =1

1− β

((∫

p(x)αdx

) 1−β1−α

− 1

)

,

with α > 0, α 6= 1, β 6= 1.◮ Skew Jensen and Burbea-Rao divergence [19]◮ Chernoff information and divergence [16]◮ Mixtures: total Least square, Jensen-Renyi, Cauchy-Schwarz

divergence [17].c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 63/1

Page 64: Slides: A glance at information-geometric signal processing

Statistical invariance: Markov kernel

Probability family: p(x ; θ).

(X , σ) and (X ′, σ′) two measurable spaces.σ: A σ-algebra on X

(non-empty, closed under complementation and countable union).

Markov kernel = transition probability kernelK : X × σ′ → [0, 1]:

◮ ∀E ′ ∈ σ′,K (·,E ′) measurable map,

◮ ∀x ∈ X ,K (x , ·) is a probability measure on (X ′, σ′).

p a pm. on (X , σ) induces Kp a pm., with

Kp(E ′) =

XK (x ,E ′)p(dx),∀E ′ ⊂ σ′

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 64/1

Page 65: Slides: A glance at information-geometric signal processing

Space of Bregman spheres and Bregman balls [6]

Dual Bregman balls (bounding Bregman spheres):

BallrF (c , r) = {x ∈ X | BF (x : c) ≤ r}and BalllF (c , r) = {x ∈ X | BF (c : x) ≤ r}

Legendre duality:

BalllF (c , r) = (∇F )−1(BallrF∗(∇F (c), r))

Illustration for Itakura-Saito divergence, F (x) = − log x

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 65/1

Page 66: Slides: A glance at information-geometric signal processing

Space of Bregman spheres: Lifting map [6]F : x 7→ x = (x ,F (x)), hypersurface in R

d+1.Hp : Tangent hyperplane at p, z = Hp(x) = 〈x − p,∇F (p)〉+ F (p)

◮ Bregman sphere σ −→ σ with supporting hyperplaneHσ : z = 〈x − c ,∇F (c)〉 + F (c) + r .(// to Hc and shifted vertically by r)σ = F ∩ Hσ.

◮ intersection of any hyperplane H with F projects onto X as aBregman sphere:

H : z = 〈x , a〉+b → σ : BallF (c = (∇F )−1(a), r = 〈a, c〉−F (c)+b)

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 66/1

Page 67: Slides: A glance at information-geometric signal processing

Bibliographic references I

Marcel R. Ackermann and Johannes Blomer.

Bregman clustering for separable instances.

In Scandinavian Workshop on Algorithm Theory (SWAT), pages 212–223, 2010.

Yasemin Altun, Alexander J. Smola, and Thomas Hofmann.

Exponential families for conditional random fields.

In Uncertainty in Artificial Intelligence (UAI), pages 2–9, 2004.

Marc Arnaudon and Frank Nielsen.

Medians and means in Finsler geometry.

LMS Journal of Computation and Mathematics, 15, 2012.

Marc Arnaudon and Frank Nielsen.

On approximating the Riemannian 1-center.

Computational Geometry, 46(1):93 – 104, 2013.

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.

Clustering with Bregman divergences.

Journal of Machine Learning Research, 6:1705–1749, 2005.

Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.

Bregman Voronoi diagrams.

Discrete and Computational Geometry, 44(2):281–307, April 2010.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 67/1

Page 68: Slides: A glance at information-geometric signal processing

Bibliographic references II

Nikolai Nikolaevich Chentsov.

Statistical Decision Rules and Optimal Inferences.

Transactions of Mathematics Monograph, numero 53, 1982.

Published in russian in 1972.

Jose Manuel Corcuera and Federica Giummole.

A characterization of monotone and regular divergences.

Annals of the Institute of Statistical Mathematics, 50(3):433–450, 1998.

Vincent Garcia and Frank Nielsen.

Simplification and hierarchical representations of mixtures of exponential families.

Signal Processing (Elsevier), 90(12):3197–3212, 2010.

Vincent Garcia, Frank Nielsen, and Richard Nock.

Levels of details for Gaussian mixture models.

In Asian Conference on Computer Vision (ACCV), volume 2, pages 514–525, 2009.

Vincent Garcia, Frank Nielsen, and Richard Nock.

Hierarchical Gaussian mixture model.

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 4070–4073,2010.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 68/1

Page 69: Slides: A glance at information-geometric signal processing

Bibliographic references IIIFrancois Labelle and Jonathan Richard Shewchuk.

Anisotropic Voronoi diagrams and guaranteed-quality anisotropic mesh generation.

In Proceedings of the nineteenth annual symposium on Computational geometry, SCG ’03, pages 191–200,New York, NY, USA, 2003. ACM.

Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.

Transactions on Pattern Analysis and Machine Intelligence, 2012.

Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to shape retrieval.

In International Conference on Computer Vision (CVPR), pages 3463–3468, 2010.

Frank Nielsen.

Visual Computing: Geometry, Graphics, and Vision.

Charles River Media / Thomson Delmar Learning, 2005.

Frank Nielsen.

Chernoff information of exponential families.

arXiv, abs/1102.2684, 2011.

Frank Nielsen.

Closed-form information-theoretic divergences for statistical mixtures.

In International Conference on Pattern Recognition (ICPR), 2012.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 69/1

Page 70: Slides: A glance at information-geometric signal processing

Bibliographic references IVFrank Nielsen.

k-MLE: A fast algorithm for learning statistical mixture models.

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012.

preliminary, technical report on arXiv.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhattacharyya centroids.

IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.

Frank Nielsen and Vincent Garcia.

Statistical exponential families: A digest with flash cards, 2009.

arXiv.org:0911.4863.

Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri.

Jensen divergence based SPD matrix means and applications.

In International Conference on Pattern Recognition (ICPR), 2012.

Frank Nielsen and Richard Nock.

The dual Voronoi diagrams with respect to representational Bregman divergences.

In International Symposium on Voronoi Diagrams (ISVD), pages 71–78, 2009.

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.

IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 70/1

Page 71: Slides: A glance at information-geometric signal processing

Bibliographic references V

Frank Nielsen and Richard Nock.

Entropies and cross-entropies of exponential families.

In International Conference on Image Processing (ICIP), pages 3621–3624, 2010.

Frank Nielsen and Richard Nock.

Hyperbolic Voronoi diagrams made easy.

In International Conference on Computational Science and its Applications (ICCSA), volume 1, pages74–80, Los Alamitos, CA, USA, march 2010. IEEE Computer Society.

Frank Nielsen and Richard Nock.

On renyi and tsallis entropies and divergences for exponential families.

arXiv, abs/1105.3259, 2011.

Frank Nielsen and Richard Nock.

Skew Jensen-Bregman Voronoi diagrams.

Transactions on Computational Science, 14:102–128, 2011.

Frank Nielsen and Richard Nock.

A closed-form expression for the Sharma-Mittal entropy of exponential families.

Journal of Physics A: Mathematical and Theoretical, 45(3), 2012.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 71/1

Page 72: Slides: A glance at information-geometric signal processing

Bibliographic references VIFrank Nielsen, Paolo Piro, and Michel Barlaud.

Bregman vantage point trees for efficient nearest neighbor queries.

In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881,2009.

Paolo Piro, Frank Nielsen, and Michel Barlaud.

Tailored Bregman ball trees for effective nearest neighbors.

In European Workshop on Computational Geometry (EuroCG), LORIA, Nancy, France, March 2009. IEEE.

Olivier Schwander and Frank Nielsen.

PyMEF - A framework for exponential families in Python.

In IEEE/SP Workshop on Statistical Signal Processing (SSP), 2011.

Olivier Schwander and Frank Nielsen.

Learning mixtures by simplifying kernel density estimators.

In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 403–426, 2012.

Olivier Schwander and Frank Nielsen.

Model centroids for the simplification of kernel density estimators.

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 737–740,2012.

Olivier Schwander, Frank Nielsen, Aurelien Schutz, and Yannick Berthoumieu.

k-MLE for mixtures of generalized Gaussians.

In International Conference on Pattern Recognition (ICPR), 2012.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 72/1

Page 73: Slides: A glance at information-geometric signal processing

Bibliographic references VII

Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez.

Rayleigh mixture model for plaque characterization in intravascular ultrasound.

IEEE Transaction on Biomedical Engineering, 58(5):1314–1324, 2011.

Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to DTI analysis.

IEEE Transactions on Medical Imaging, 2011.

10.1109/TMI.2010.2086464.

Shijun Wang and Rong Jin.

An information geometry approach for distance metric learning.

Journal of Machine Learning Research, 5:591–598, 2009.

c© 2012 Frank Nielsen, Sony Computer Science Laboratories, Inc. 73/1