46
1/46 Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary Riemannian Optimization and its Application to Computations on Symmetric Positive Definite Matrices Wen Huang Universit´ e catholique de Louvain Wen Huang Universit´ e catholique de Louvain

Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

1/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Riemannian Optimization and its Application toComputations on Symmetric Positive Definite

Matrices

Wen Huang

Universite catholique de Louvain

Wen Huang Universite catholique de Louvain

Page 2: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

2/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Joint work with:

Pierre-Antoine Absil, Professor ofMathematical Engineering,Universite catholique de Louvain

Kyle A. Gallivan, Professor of Mathematics,Florida State University

Sylvain Chevallier, Assistant Professor,Universite de Versailles

Xinru Yuan, Ph.D candidate in Applied andComputational Mathematics,Florida State University

Wen Huang Universite catholique de Louvain

Page 3: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

3/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

1 Introduction

2 Motivations

3 Optimization

4 History

5 Geometric Mean and Dictionary Learning

6 Summary

Wen Huang Universite catholique de Louvain

Page 4: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

4/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Riemannian Optimization

Problem: Given f(x) :M→ R, solve

minx∈M

f(x)

where M is a Riemannian manifold.

M

Rf

Wen Huang Universite catholique de Louvain

Page 5: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

5/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Examples of Manifolds

Sphere Ellipsoid

Stiefel manifold: St(p, n) = X ∈ Rn×p|XTX = IpGrassmann manifold: Set of all p-dimensional subspaces of Rn

Set of fixed rank m-by-n matrices

And many more

Wen Huang Universite catholique de Louvain

Page 6: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

6/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Riemannian Manifolds

Roughly, a Riemannian manifold M is a smooth set with asmoothly-varying inner product on the tangent spaces.

M

x

ξ

η

R

〈η, ξ〉xTxM

Wen Huang Universite catholique de Louvain

Page 7: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

7/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Applications

Four applications are used to demonstrate the importances of theRiemannian optimization:

Independent component analysis [CS93]

Matrix completion problem [Van12]

Geometric mean of symmetric positive definite matrices[ALM04, JVV12]

Dictionary learning of symmetric positive definite matrices [CS15]

Wen Huang Universite catholique de Louvain

Page 8: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

8/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Independent Component Analysis

People 1

People p

People 2

Microphone 1

Microphone n

Microphone 2

s(t) ∈ Rp

IC 1

IC p

IC 2

x(t) ∈ Rn

Cocktail party problem

ICA

Observed signal is x(t) = As(t)

One approach:

Assumption: Es(t)s(t+ τ) is diagonal for all τCτ (x) := Ex(t)x(x+ τ)T = AEs(t)s(t+ τ)T AT

Wen Huang Universite catholique de Louvain

Page 9: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

9/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Independent Component Analysis

Minimize joint diagonalization cost function on the Stiefel manifold[TI06]:

f : St(p, n)→ R : V 7→N∑i=1

‖V TCiV − diag(V TCiV )‖2F .

C1, . . . , CN are covariance matrices andSt(p, n) = X ∈ Rn×p|XTX = Ip.

Wen Huang Universite catholique de Louvain

Page 10: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

10/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Matrix Completion Problem

Matrix completion problem

User 1

User 2

User m

Movie 1 Movie 2 Movie n

Rate matrix M

1

53

4

4

5 3

15

2

The matrix M is sparse

The goal: complete the matrix M

Wen Huang Universite catholique de Louvain

Page 11: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

11/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Matrix Completion Problem

movies meta-user meta-moviea11 a14

a24

a33

a41

a52 a53

=

b11 b12

b21 b22

b31 b32

b41 b42

b51 b52

(c11 c12 c13 c14

c21 c22 c23 c24

)

Minimize the cost function

f : Rm×nr → R : X 7→ f(X) = ‖PΩM − PΩX‖2F .

Rm×nr is the set of m-by-n matrices with rank r. It is known to be aRiemannian manifold.

Wen Huang Universite catholique de Louvain

Page 12: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

12/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Geometric Mean of Symmetric PositiveDefinite (SPD) Matrices

Computing the mean of a population of SPD matrices is important inmedical imaging, image processing, radar signal processing, and elasticity.The desired properties are given in the ALM1 list, some of which are

if A1, . . . , Ak commute, then G(A1, . . . , Ak) = (A1 . . . Ak)1k ;

G(Aπ(1), . . . , Aπ(k)) = G(A1, . . . , Ak), with π a permutation of(1, . . . , k);

G(A1, . . . , Ak) = G(A−1

1 , . . . A−1k

)−1;

detG(A1, . . . , Ak) = (detA1 . . . detAk)1k ;

where A1, . . . , Ak are SPD matrices, and G(·, . . . , ·) denotes thegeometric mean of arguments.

1T. Ando, C.-K. Li, and R. Mathias, Geometric means, Linear Algebra and ItsApplications, 385:305-334, 2004

Wen Huang Universite catholique de Louvain

Page 13: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

13/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Geometric Mean of Symmetric PositiveDefinite Matrices

One geometric mean is the Karcher mean of the manifold of SPDmatrices with the affine invariant metric, i.e.,

G(A1, . . . , Ak) = arg minX∈Sn

+

1

2k

k∑i=1

dist2(X,Ai),

where dist(X,Y ) = ‖ log(X−1/2Y X−1/2)‖F is the distance under theRiemannian metric

g(ηX , ξX) = trace(ηXX−1ξXX

−1).

Wen Huang Universite catholique de Louvain

Page 14: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

14/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Dictionary learning of symmetric positivedefinite (SPD) matrices

Dictionary learning can be applied for classification and denoising.

Euclidean dictionary learning problem (one formulation):

min‖di‖2≤1,ri∈Rn

N∑i=1

‖xi −[d1, d2, . . . , dn

]ri‖22 + λ‖ri‖1, (1)

where xi ∈ Rs, i = 1, . . . k are given data points,di ∈ Rs, i = 1, . . . , n and ri ∈ Rn, i = 1, . . . , N are dictionary andsparse codes respectively.

Problem (1) is usually solved by alternatively optimizing overD :=

[d1, . . . , dn

]and R :=

[r1, . . . , rN

].

Wen Huang Universite catholique de Louvain

Page 15: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

15/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Application: Dictionary learning of symmetric positivedefinite (SPD) matrices

Dictionary learning problem of SPD matrices (one formulation):

minB∈Md

n,R∈Rn×N+

1

2

N∑i=1

(dist2(Xi,Bri) + ‖ri‖1

)+ trace (B) ,

where Mdn denotes the product of n manifolds of SPD matrices Sd+,

i.e., Mdn :=

(Sd+)n

.

Problem (1) also can be solved by alternatively optimizing over Band R.

Optimizing over B is a Riemannian optimization problem.

Wen Huang Universite catholique de Louvain

Page 16: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

16/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

More Applications

Large-scale Generalized Symmetric Eigenvalue Problem and SVD

Blind source separation on both Orthogonal group and Obliquemanifold

Low-rank approximate solution symmetric positive definite LyapanovAXM +MXA = C

Best low-rank approximation to a tensor

Rotation synchronization

Graph similarity and community detection

Low rank approximation to role model problem

Shape analysis

Wen Huang Universite catholique de Louvain

Page 17: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

17/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Comparison with Constrained Optimization

All iterates on the manifold

Convergence properties of unconstrained optimization algorithms

No need to consider Lagrange multipliers or penalty functions

Exploit the structure of the constrained set

M

Wen Huang Universite catholique de Louvain

Page 18: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

18/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Iterations on the Manifold

Consider the following generic update for an iterative Euclideanoptimization algorithm:

xk+1 = xk + ∆xk = xk + αksk .

This iteration is implemented in numerous ways, e.g.:

Steepest descent: xk+1 = xk − αk∇f(xk)

Newton’s method: xk+1 = xk −[∇2f(xk)

]−1∇f(xk)

Trust region method: ∆xk is set by optimizing a local model.

Objects

Direction/movement: sk/∆xk

Gradient: ∇f(xk)

Hessian: ∇2f(xk)

Addition: +

xk xk + dk

Wen Huang Universite catholique de Louvain

Page 19: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

19/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Riemannian gradient and Riemannian Hessian

Definition

The Riemannian gradient of f at x is the unique tangent vector in TxMsatisfying ∀η ∈ TxM , the directional derivative

D f(x)[η] = 〈grad f(x), η〉

and grad f(x) is the direction of steepest ascent.

Definition

The Riemannian Hessian of f at x is a symmetric linear operator fromTxM to TxM defined as

Hess f(x) : TxM → TxM : η → ∇ηgrad f,

where ∇ is the affine connection.

Wen Huang Universite catholique de Louvain

Page 20: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

20/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Retractions

Euclidean Riemannianxk+1 = xk + αkdk xk+1 = Rxk

(αkηk)

Definition

A retraction is a mapping R from TM to Msatisfying the following:

R is continuously differentiable

Rx(0) = x

DRx(0)[η] = η

maps tangent vectors back to the manifold

defines curves in a direction

η

x Rx(tη)

TxMx

η

Rx(η)

M

Wen Huang Universite catholique de Louvain

Page 21: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

21/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Generic Riemannian Optimization Algorithm

1. At iterate x ∈M2. Find η ∈ TxM which satisfies certain condition.

3. Choose new iterate x+ = Rx(η).

4. Goto step 1.

A suitable setting

This paradigm is sufficient for describing many optimization methods.

Tx0MTx1M

Tx2M

η2

x0

x3η1

x1 = Rx0(η0)

η0 x2

Wen Huang Universite catholique de Louvain

Page 22: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

22/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Categories of Riemannian optimization methods

Retraction-based: local information only

Line search-based: use local tangent vector and Rx(tη) to define line

Steepest decent

Newton

Local model-based: series of flat space problems

Riemannian trust region Newton (RTR)

Riemannian adaptive cubic overestimation (RACO)

Wen Huang Universite catholique de Louvain

Page 23: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

23/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Categories of Riemannian optimization methods

Elements required for optimizing a cost function (M, g):

an representation for points x on M , for tangent spaces TxM , andfor the inner products gx(·, ·) on TxM ;

choice of a retraction Rx : TxM →M ;

formulas for f(x), grad f(x) and Hess f(x) (or its action);

Computational and storage efficiency;

Wen Huang Universite catholique de Louvain

Page 24: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

24/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Categories of Riemannian optimization methods

Retraction and transport-based: information from multiple tangent spaces

Conjugate gradient: multiple tangent vectors

Quasi-Newton e.g. Riemannian BFGS: transport operators betweentangent spaces

Additional element required for optimizing a cost function (M, g):

formulas for combining information from multiple tangent spaces.

Wen Huang Universite catholique de Louvain

Page 25: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

25/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Vector Transports

Vector Transport

Vector transport: Transport a tangentvector from one tangent space toanother

Tηxξx, denotes transport of ξx totangent space of Rx(ηx). R is aretraction associated with TIsometric vector transport TS preservethe length of tangent vector

x

M

TxM

ηx

Rx(ηx)

ξx

Tηxξx

Figure: Vector transport.

Wen Huang Universite catholique de Louvain

Page 26: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

26/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Retraction/Transport-based Riemannian Optimization

Benefits

Increased generality does not compromise the important theory

Less expensive than or similar to previous approaches

May provide theory to explain behavior of algorithms specificallydeveloped for a particular application – or closely related ones

Possible Problems

May be inefficient compared to algorithms that exploit applicationdetails

Wen Huang Universite catholique de Louvain

Page 27: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

27/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Some History of Optimization On Manifolds (I)

Luenberger (1973), Introduction to linear and nonlinear programming.Luenberger mentions the idea of performing line search along geodesics,“which we would use if it were computationally feasible (which itdefinitely is not)”. Rosen (1961) essentially anticipated this but was notexplicit in his Gradient Projection Algorithm.

Gabay (1982), Minimizing a differentiable function over a differentialmanifold. Steepest descent along geodesics; Newton’s method alonggeodesics; Quasi-Newton methods along geodesics. On Riemanniansubmanifolds of Rn.

Smith (1993-94), Optimization techniques on Riemannian manifolds.Levi-Civita connection ∇; Riemannian exponential mapping; paralleltranslation.

Wen Huang Universite catholique de Louvain

Page 28: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

28/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Some History of Optimization On Manifolds (II)

The “pragmatic era” begins:

Manton (2002), Optimization algorithms exploiting unitary constraints“The present paper breaks with tradition by not moving alonggeodesics”. The geodesic update Expxη is replaced by a projectiveupdate π(x+ η), the projection of the point x+ η onto the manifold.

Adler, Dedieu, Shub, et al. (2002), Newton’s method on Riemannianmanifolds and a geometric model for the human spine. The exponentialupdate is relaxed to the general notion of retraction. The geodesic canbe replaced by any (smoothly prescribed) curve tangent to the searchdirection.

Absil, Mahony, Sepulchre (2007) Nonlinear conjugate gradient usingretractions.

Wen Huang Universite catholique de Louvain

Page 29: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

29/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Some History of Optimization On Manifolds (III)

Theory, efficiency, and library design improve dramatically:

Absil, Baker, Gallivan (2004-07), Theory and implementations ofRiemannian Trust Region method. Retraction-based approach. Matrixmanifold problems, software repository

http://www.math.fsu.edu/~cbaker/GenRTR

Anasazi Eigenproblem package in Trilinos Library at Sandia NationalLaboratory

Absil, Gallivan, Qi (2007-10), Basic theory and implementations ofRiemannian BFGS and Riemannian Adaptive Cubic Overestimation.Parallel translation and Exponential map theory, Retraction and vectortransport empirical evidence.

Wen Huang Universite catholique de Louvain

Page 30: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

30/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Some History of Optimization On Manifolds (IV)

Ring and With (2012), combination of differentiated retraction andisometric vector transport for convergence analysis of RBFGS

Absil, Gallivan, Huang (2009-2015), Complete theory of RiemannianQuasi-Newton and related transport/retraction conditions, RiemannianSR1 with trust-region, RBFGS on partly smooth problems, A C++library: http://www.math.fsu.edu/~whuang2/ROPTLIB

Sato, Iwai (2013-2015), Global convergence analysis using thedifferentiated retraction for Riemannian conjugate gradient methods

Many people Application interests start to increase noticeably

Wen Huang Universite catholique de Louvain

Page 31: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

31/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Current UCL/FSU Methods

Riemannian Steepest Descent

Riemannian Trust Region Newton: global, quadratic convergence

Riemannian Broyden Family : global (convex), superlinearconvergence

Riemannian Trust Region SR1: global, (d+ 1)−superlinearconvergence

For large problems

Limited memory RTRSR1Limited memory RBFGS

Riemannian conjugate gradient (much more work to do on localanalysis)

A library is available at www.math.fsu.edu/~whuang2/ROPTLIB

Wen Huang Universite catholique de Louvain

Page 32: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

32/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Current/Future Work on Riemannian methods

Manifold and inequality constraints

Discretization of infinite dimensional manifolds and theconvergence/accuracy of the approximate minimizers – specific to aproblem and extracting general conclusions

Partly smooth cost functions on Riemannian manifold

Wen Huang Universite catholique de Louvain

Page 33: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

33/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Geometric Mean and Dictionary Learning

Computations of SPD matrices are used to show the performance ofRiemannian methods.

Geometric mean of SPD matrices [ALM04]

minX∈Sn

+

1

2k

k∑i=1

dist2(X,Ai) =1

2k

k∑i=1

‖ log(A−1/2i XA

−1/2i )‖2F .

Dictionary learning for SPD matrices [CS15]

minB∈Md

n,R∈Rn×N+

1

2

N∑i=1

(dist2(Xi,Bri) + ‖ri‖1

)+ trace (B) .

Wen Huang Universite catholique de Louvain

Page 34: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

34/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Geometric Mean of SPD matrices

Hemstitching phenomenon

Condition Number at the minimizer [YHAG15]

For the cost function F (X) = 12k

∑ki=1 dist2(Ai, X), we have

1 ≤ HessFA(X)[∆X,∆X]

‖∆X‖2≤ 1 +

log(maxκi)

2.

If maxκi = 1010, then 1 + log(maxκi)2 ≈ 12.51.

Wen Huang Universite catholique de Louvain

Page 35: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

35/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Algorithms

gradF (X) = −1

k

k∑i=1

Log(AiX−1)X.

First order approaches

Riemannian steepest descent [RA11]

Riemannian conjugate gradient [JVV12]

Richardson-like iteration [BI13]

Limited-memory Riemannian BFGS method [YHAG15]

Wen Huang Universite catholique de Louvain

Page 36: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

36/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Implementations

Function: 12k

∑ki=1 ‖ log(A

−1/2i XA

−1/2i )‖2F ;

Gradient:

−1

k

k∑i=1

Log(AiX−1)X =

1

k

k∑i=1

A1/2i Log(A

−1/2i XA

−1/2i )A

−1/2i X1/2

A−1/2i can be computed in advance.

The dominated computational time is on the function evaluation.

Wen Huang Universite catholique de Louvain

Page 37: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

37/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Implementations

Retraction [JVV12]

Exponential mapping: RX(ξX) = X1/2 exp(X−1/2ξXX−1/2)X1/2

Second order retraction: RX(ξX) = X + ξX + ξXX−1ξx/2

Vector transports:

Parallel translation:TηX ξX = Q(X, ηX)ξXQ(X, ηX)T ,

Q(X, ηX) = X1/2 exp(X−1/2ηXX

−1/2

2

)X−1/2

Vector transport by parallelization: essentially an identity

The dominated computational time is on the function evaluation.

Wen Huang Universite catholique de Louvain

Page 38: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

38/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Numerical Results

iterations0 10 20 30 40 50 60 70

dist(µ,X

k)

10-16

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

102

RSDRCGLRBFGSRL Iteration

iterations20 40 60 80 100 120 140

dist(µ,X

k)

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

102

RSDRCGLRBFGSRL Iteration

Figure: Evolution of averaged distance between current iterate and the exactKarcher mean with respect to time and iterations with k = 100 (the number ofmatrices) and n = 3 (the size of matrices); Left: 1 ≤ κ(Ai) ≤ 200; Right:103 ≤ κ(Ai) ≤ 2 · 106

Wen Huang Universite catholique de Louvain

Page 39: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

39/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Numerical Results

iterations0 5 10 15 20 25 30 35 40 45 50

dist(µ,X

k)

10-16

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

102

RSDRCGLRBFGSRL Iteration

iterations0 50 100 150

dist(µ,X

k)

10-12

10-10

10-8

10-6

10-4

10-2

100

102

RSDRCGLRBFGSRL Iteration

Figure: Evolution of averaged distance between current iterate and the exactKarcher mean with respect to time and iterations with k = 30 (the number ofmatrices) and n = 100 (the size of matrices); Left: 1 ≤ κ(Ai) ≤ 20; Right:104 ≤ κ(Ai) ≤ 2 · 106

Wen Huang Universite catholique de Louvain

Page 40: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

40/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Dictionary Learning for SPD matrices

The subproblem: given R, find B.

minB∈Md

n

1

2

N∑i=1

dist2(Xi,Bri) + trace (B) .

Similar techniques, i.e., implementations for vector transport,retraction, function and gradient evaluation, can be applied;

The dominated cost is on the function evaluations;

The cost function is nonconvex;

We set the initial iterate by X0 = X(R†)

+, where X is a tensor

whose i-th slice is Xi, † denotes the psudo-inverse and M+ denotesa matrix forming by positive entries of M .

Wen Huang Universite catholique de Louvain

Page 41: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

41/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Numerical Results

Artificial tests:N : number of training points in Xn: number of atoms in dictionary Bd: size of SPD matricesR: the representation matrix

0 5 10 15 20

size of SPD matrix

0

0.5

1

1.5

com

puta

tiona

l tim

e

N: 100, n: 20, sparsity of R: 70%

LRBFGSRCG

0 20 40 60 80

size of dictionary

0

1

2

3

4

com

puta

tiona

l tim

e

N: 500, d: 4, sparsity of R: 70%

LRBFGSRCG

Figure: An average of 50 random runs for various parameter settings.

Wen Huang Universite catholique de Louvain

Page 42: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

42/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Dictionary Learning for SPD matrices

The subproblem: given B, find R.

minR=[r1,...,rN ]∈Rn×N

+

1

2

N∑i=1

(dist2(Xi,Bri) + ‖ri‖1

).

The domain Rn×N+ is NOT a manifold. A Riemannian optimization-likeidea can be applied.

0 10 20 30 40 50 60 70 80 90 100

iteration

10-2

10-1

100

101

102

103

func

tion

valu

e

Projected gradient LBFGSProjected gradient SDSpectral projected gradient

5.1s

0.11s

0.10s

Figure: An representative result. N = 100, d = 5, n = 20

Wen Huang Universite catholique de Louvain

Page 43: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

43/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Summary

Introduced the framework of Riemannian optimization and thestate-of-the-art Riemannian algorithms

Used applications to show the importance of Riemannianoptimization

Showed the performance of Riemannian optimization by geometricmean and dictionary learning of SPD matrices

Wen Huang Universite catholique de Louvain

Page 44: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

44/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

Thanks!

Wen Huang Universite catholique de Louvain

Page 45: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

45/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

References I

T. Ando, C. K. Li, and R. Mathias.

Geometric means.Linear Algebra and Its Applications, 385:305–334, 2004.

D. A. Bini and B. Iannazzo.

Computing the Karcher mean of symmetric positive definite matrices.Linear Algebra and its Applications, 438(4):1700–1710, February 2013.doi:10.1016/j.laa.2011.08.052.

J. F. Cardoso and A. Souloumiac.

Blind beamforming for non-gaussian signals.IEE Proceedings F Radar and Signal Processing, 140(6):362, 1993.

A. Cherian and S. Sra.

Riemannian dictionary learning and sparse coding for positive definite matrices.CoRR, abs/1507.02772, 2015.

B. Jeuris, R. Vandebril, and B. Vandereycken.

A survey and comparison of contemporary algorithms for computing the matrix geometric mean.Electronic Transactions on Numerical Analysis, 39:379–402, 2012.

Q. Rentmeesters and P.-A. Absil.

Algorithm comparison for karcher mean computation of rotation matrices and diffusion tensors.19th European Signal Processing Conference (EUSIPCO 2011), (Eusipco):2229–2233, 2011.

F. J. Theis and Y. Inouye.

On the use of joint diagonalization in blind signal processing.2006 IEEE International Symposium on Circuits and Systems, (2):7–10, 2006.

Wen Huang Universite catholique de Louvain

Page 46: Riemannian Optimization and its Application to ...whuang2/pdf/slides_2016-GMDL.pdf · Application: Dictionary learning of symmetric positive de nite (SPD) matrices Dictionary learning

46/46

Introduction Motivations Optimization History Geometric Mean and Dictionary Learning Summary

References II

B. Vandereycken.

Low-rank matrix completion by Riemannian optimization—extended version.SIAM Journal on Optimization, 23(2):1214–1236, 2012.

X. Yuan, W. Huang, P.-A. Absil, and K. A. Gallivan.

A riemannian limited-memory bfgs algorithm for computing the matrix geometric mean.Technical Report UCL-INMA-2015.12, U.C.Louvain, 2015.

Wen Huang Universite catholique de Louvain