12
IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1 Coupling Matrix Manifolds and Their Applications in Optimal Transport Dai Shi, Junbin Gao, Xia Hong, S.T. Boris Choy and Zhiyong Wang Abstract—Optimal transport (OT) is a powerful tool for mea- suring the distance between two defined probability distributions. In this paper, we develop a new manifold named the coupling ma- trix manifold (CMM), where each point on CMM can be regarded as the transportation plan of the OT problem. We firstly explore the Riemannian geometry of CMM with the metric expressed by the Fisher information. These geometrical features of CMM have paved the way for developing numerical Riemannian optimization algorithms such as Riemannian gradient descent and Riemannian trust region algorithms, forming a uniform optimization method for all types of OT problems. The proposed method is then applied to solve several OT problems studied by previous litera- ture. The results of the numerical experiments illustrate that the optimization algorithms that are based on the method proposed in this paper are comparable to the classic ones, for example the Sinkhorn algorithm, while outperforming other state-of-the- art algorithms without considering the geometry information, especially in the case of non-entropy optimal transport. Index Terms—Optimal Transport, Doubly Stochastic Matrices, Coupling Matrix Manifold, Sinkhorn Algorithm, Wasserstein Distance, Entropy Regularized Optimal Transport I. I NTRODUCTION An Optimal Transport (OT) problem can be briefly de- scribed as to find out the optimized transport plan (defined as transportation polytope) between two or more sets of subjects with certain constraints [36]. It was firstly formalized by French mathematician Gaspard Monge in 1781 [32], and was generalized by Kantorovich who provided a solution of Monge’s problem in 1942 [26] and established its importance to logistics and economics. As the solution of the OT problem provides the optimized transportation plan between probability distributions, and the advance in computer science allows us to perform a large amount of computation in a high dimensional space, the optimized distance, known as the Wasserstein distance [35], Monge-Kantorovich distance [7] and Earth Mover’s distance [40], has been treated as a target being analyzed in various aspects such as image processing [16], [39], pattern analysis [10], [31], [49] and domain adaption [9], [30], [47]. The OT-based method for comparing two probability den- sities and generative models are vital in machine learning research where data are often presented in the form of point Dai Shi, Junbin Gao and S.T. Boris Choy are with the Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, NSW 2006, Australia. E-mail: {junbin.gao, dai.shi, boris.choy}@sydney.edu.au. Xia Hong is with theDepartment of Computer Science, University of Reading, Reading, RG6 6AY,UK. E-mail: [email protected]. Zhiyong Wang is with the School of Computer Science, The University of Sydney, NSW 2006, Australia. E-mail: [email protected]. clouds, histograms, bags-of-features, or more generally, even manifold-valued data set. In recent years, there has been an increase in the applications of the OT-based methods in machine learning. The authors of [6] approached OT- based generative modeling, triggering fruitful research un- der the variational Bayesian concepts, such as Wassertein GAN [4], [22], Wasserstein Auto-encoders [45], [48], and Wasserstein variational inference [3] and their computationally efficient sliced version [28]. Another reason that OT gains its popularity is convexity. As the classic Kantorovich OT problem is a constrained linear programming problem or a convex minimization problem where the minimal value of the transport cost objective function is usually defined as the divergence/distance between two distributions of loads [36], or the cost associated with the transportation between the source subjects and targets. Therefore, the convex optimization plays an essential role in finding the solutions of OT. The computation of the OT distance can be approached in principle by interior-point methods, and one of the best is from [29]. Although the methods for finding the solutions of OT have been widely investigated in the literature, one of the major problems is that these algorithms are excessively slow in handling large scale OT problems. Another issue with the classic Kantorovich OT formulation is that its solution plan merely relies on a few routes as a result of the sparsity of optimal couplings, and therefore fails to reflect the practical traffic conditions. These issues limit the wider applicability of OT-based distances for large-scale data within the field of machine learning until a regularized transportation plan was introduced by Cuturi [10] in 2013. By applying this new method (regularized OT), we are not only able to reduce the sparsity in the transportation plan, but also speed up the Sinkhorn algorithm with a linear convergence [27]. By offering a unique solution, better computational stability compared with the previous algorithms and being underpinned by the Sinkhorn algorithm, the entropy regularization method has successfully delivered OT approaches into modern ma- chine learning aspects [46], such as unsupervised learning using Restricted Boltzmann Machines [33], Wasserstein loss function [18], computer graphics [41] and discriminant analy- sis [17]. Other algorithms that aim for high calculation speed in the area of big data have also been explored, such as the stochastic gradient-based algorithms [20] and fast methods to compute Wasserstein barycenters [11]. Altschuler et al. [2] proposed the Greenkhorn algorithm, a greedy variant of the Sinkhorn algorithm that updates the rows and columns which violate most of the constraints. In order to meet the requirements of various practical arXiv:1911.06905v1 [cs.LG] 15 Nov 2019

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

Coupling Matrix Manifolds and TheirApplications in Optimal Transport

Dai Shi, Junbin Gao, Xia Hong, S.T. Boris Choy and Zhiyong Wang

Abstract—Optimal transport (OT) is a powerful tool for mea-suring the distance between two defined probability distributions.In this paper, we develop a new manifold named the coupling ma-trix manifold (CMM), where each point on CMM can be regardedas the transportation plan of the OT problem. We firstly explorethe Riemannian geometry of CMM with the metric expressed bythe Fisher information. These geometrical features of CMM havepaved the way for developing numerical Riemannian optimizationalgorithms such as Riemannian gradient descent and Riemanniantrust region algorithms, forming a uniform optimization methodfor all types of OT problems. The proposed method is thenapplied to solve several OT problems studied by previous litera-ture. The results of the numerical experiments illustrate that theoptimization algorithms that are based on the method proposedin this paper are comparable to the classic ones, for examplethe Sinkhorn algorithm, while outperforming other state-of-the-art algorithms without considering the geometry information,especially in the case of non-entropy optimal transport.

Index Terms—Optimal Transport, Doubly Stochastic Matrices,Coupling Matrix Manifold, Sinkhorn Algorithm, WassersteinDistance, Entropy Regularized Optimal Transport

I. INTRODUCTION

An Optimal Transport (OT) problem can be briefly de-scribed as to find out the optimized transport plan (definedas transportation polytope) between two or more sets ofsubjects with certain constraints [36]. It was firstly formalizedby French mathematician Gaspard Monge in 1781 [32], andwas generalized by Kantorovich who provided a solution ofMonge’s problem in 1942 [26] and established its importanceto logistics and economics.

As the solution of the OT problem provides the optimizedtransportation plan between probability distributions, and theadvance in computer science allows us to perform a largeamount of computation in a high dimensional space, theoptimized distance, known as the Wasserstein distance [35],Monge-Kantorovich distance [7] and Earth Mover’s distance[40], has been treated as a target being analyzed in variousaspects such as image processing [16], [39], pattern analysis[10], [31], [49] and domain adaption [9], [30], [47].

The OT-based method for comparing two probability den-sities and generative models are vital in machine learningresearch where data are often presented in the form of point

Dai Shi, Junbin Gao and S.T. Boris Choy are with the Disciplineof Business Analytics, The University of Sydney Business School, TheUniversity of Sydney, NSW 2006, Australia. E-mail: junbin.gao, dai.shi,[email protected].

Xia Hong is with theDepartment of Computer Science, University ofReading, Reading, RG6 6AY,UK. E-mail: [email protected].

Zhiyong Wang is with the School of Computer Science, The University ofSydney, NSW 2006, Australia. E-mail: [email protected].

clouds, histograms, bags-of-features, or more generally, evenmanifold-valued data set. In recent years, there has beenan increase in the applications of the OT-based methodsin machine learning. The authors of [6] approached OT-based generative modeling, triggering fruitful research un-der the variational Bayesian concepts, such as WasserteinGAN [4], [22], Wasserstein Auto-encoders [45], [48], andWasserstein variational inference [3] and their computationallyefficient sliced version [28]. Another reason that OT gainsits popularity is convexity. As the classic Kantorovich OTproblem is a constrained linear programming problem or aconvex minimization problem where the minimal value ofthe transport cost objective function is usually defined as thedivergence/distance between two distributions of loads [36],or the cost associated with the transportation between thesource subjects and targets. Therefore, the convex optimizationplays an essential role in finding the solutions of OT. Thecomputation of the OT distance can be approached in principleby interior-point methods, and one of the best is from [29].

Although the methods for finding the solutions of OT havebeen widely investigated in the literature, one of the majorproblems is that these algorithms are excessively slow inhandling large scale OT problems. Another issue with theclassic Kantorovich OT formulation is that its solution planmerely relies on a few routes as a result of the sparsity ofoptimal couplings, and therefore fails to reflect the practicaltraffic conditions. These issues limit the wider applicabilityof OT-based distances for large-scale data within the field ofmachine learning until a regularized transportation plan wasintroduced by Cuturi [10] in 2013. By applying this newmethod (regularized OT), we are not only able to reducethe sparsity in the transportation plan, but also speed up theSinkhorn algorithm with a linear convergence [27].

By offering a unique solution, better computational stabilitycompared with the previous algorithms and being underpinnedby the Sinkhorn algorithm, the entropy regularization methodhas successfully delivered OT approaches into modern ma-chine learning aspects [46], such as unsupervised learningusing Restricted Boltzmann Machines [33], Wasserstein lossfunction [18], computer graphics [41] and discriminant analy-sis [17]. Other algorithms that aim for high calculation speedin the area of big data have also been explored, such as thestochastic gradient-based algorithms [20] and fast methods tocompute Wasserstein barycenters [11]. Altschuler et al. [2]proposed the Greenkhorn algorithm, a greedy variant of theSinkhorn algorithm that updates the rows and columns whichviolate most of the constraints.

In order to meet the requirements of various practical

arX

iv:1

911.

0690

5v1

[cs

.LG

] 1

5 N

ov 2

019

Page 2: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 2

situations, many works have been done to define suitableregularizations. For newly introduced regularizations, Desseinet al. [13] extended the regularization in terms of convex func-tions. To apply OT to power functions, the Tsallis RegularizedOptimal Transport (trot) distance problem was introduced in[42]. Furthermore, in order to involve OT into series data,the order-preserving Wassertein distance with its regularizorwas developed in [9]. In addition, to maintain the locality inOT-assisted domain adaption, the Laplacian regularization wasalso proposed in [9]. While entropy-based regularizations haveachieved great success in terms of calculation efficiency, thoseproblems without such regularization are still challenging. Forexample, to solve a Laplacian regularized OT problem, Courtyet al. proposed a generalized conditional gradient algorithm,which is a variant of the classic conditional gradient algorithm[5]. In this paper, we shall compare the experimental resultsof several entropy and non-entropy regularized OT problemsbased on previous studies and the new manifold optimizationalgorithm proposed in Section IV.

Non-entropy regularized OT problems arise the questionabout the development of a uniform and generalized methodthat is capable of efficiently and accurately calculating all sortof regularized OT problems. To answer this question, we firstconsider that all OT problems are constrained optimizationproblems on the transport plane space, namely the set ofpolytope [36]. Such constrained problems can be regardedas the unconstrained problem on a specific manifold withcertain constraints. The well-defined Riemannian optimizationcan provide better performance than the original constrainedproblem with the advantage of treating lower dimensionalmanifold as a new search space. Consequentially, those funda-mental numerical iterative algorithms, such as the Riemanniangradient descent (RGD) and Riemannian trust region (RTR),can naturally solve the OT problems, achieving convergenceunder mild conditions.

The main purpose and contribution of this paper are topropose a manifold based framework for optimizing the trans-portation polytope for which the related Riemannian geometrywill be explored. The “Coupling Matrix Manifold” providesan innovative method for solving OT problems under theframework of manifold optimization. The research on thecoupling matrix manifold has rooted in our earlier paper [44]in which the so-called multinomial manifold was explored inthe context of tensor clustering. The optimization on multi-nomial manifolds has successfully been applied to severaldensity learning tasks [23]–[25]. More recently, Douik andHassibi [14] explored the manifold geometrical structure andthe related convex optimization algorithms on three types ofmanifolds constructed by three types of matrices, namely thedoubly stochastic matrices, symmetric stochastic matrices andpositive stochastic matrices. The CMM introduced in thispaper can be regarded as the generalization of their doublypositive stochastic manifolds. According to the mathematicaland experimental results, the CMM framework unifies all typesof OT solutions, providing closed form solutions comparedwith previous literature with higher efficiency, thus opening thedoor of solving OT problems under the manifold optimizationframework.

The remainder of the paper is organized as follows. SectionII introduces CMM and its Riemannian geometry,includingthe tangent space, Riemannian gradient, Riemannian Hessian,and Retraction operator, all the ingredients for the Riemannianoptimization algorithms. In Section III, we review several OTproblems with different regularizations from other studies.These regularization problems will be then converted into theoptimization problem on CMM so that the Riemannian versionof optimization algorithms (RGD and RTR) can be applied.In Section IV, we will conduct several numerical experi-ments to demonstrate the performance of the new Riemannianalgorithms and compare the results with classic algorithms(i.e. Sinkhorn algorithm). Finally Section V concludes thepaper with several recommendations for future research andapplications.

II. COUPLING MATRIX MANIFOLDS–CMMIn this section, we introduce the CMM and Riemannian

geometry of this manifold in order to solve any generic OTproblems [36] under the framework of CMM optimization [1].

Throughout this paper, we use a bold lower case letterfor a vector x ∈ Rd, a bold upper case letter for a matrixX ∈ Rn×m, and a calligraphy letter for a manifold M.The embedded matrix manifold M is a smooth subset ofvector space E embedded in the matrix space Rn×m. For anyX ∈M, TXM is the tangent space of the manifold M at X[1]. 0d and 1d ∈ Rd are the d-dimensional vectors of zeros andones, respectively, and Rn×m+ is the set of all n×m matriceswith real and positive elements.

A. The Definition of a Manifold

Definition 1. Two vectors p ∈ Rn+ and q ∈ Rm+ are coupledif pT1n = qT1m. A matrix X ∈ Rn×m+ is called a couplingmatrix for the coupled vectors p and q if X1m = p andXT1n = q. The set of all the coupling matrices for the givencoupled p and q is denoted by

Cmn (p,q) = X ∈ Rn×m+ : X1m = p and XT1n = q. (1)

Remark 1. The coupling condition

pT1n = qT1m (2)

is vital in this paper as this condition ensures a non-emptytransportation polytope so that the manifold optimization pro-cess can be naturally employed. This condition is checked inLemma 2.2 of [12], and the proof of this lemma is based onthe north-west corner rule algorithm described in [38].Remark 2. The defined space Cmn (p,q) is a subset of theclassic transport plan space (or polytope)

Pmn (p,q) = X ∈ Rn×m : X1m = p and XT1n = q,

where each entry of a plan X is nonnegative. In practice, thisconstraint on Cmn (p,q) does prevent the solution plan frombeing sparsity.

Proposition 1. The subset Cmn (p,q) forms a smooth manifoldof dimension (n− 1)(m− 1) in its embedding space Rn×m+ ,named as the Coupling Matrix Manifold.

Page 3: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 3

Proof. Define a mapping F : Rn×m+ → Rn+m by

F (X) =

[X1m − pXT1n − q

].

HenceCmn (p,q) = F−1(0n+m).

Clearly DF (X) is a linear mapping from Rn×m+ to Rn+mwith

DF (X)[∆X] =

[∆X1m∆XT1n

].

Hence the null space of DF (X) is

K = ∆X : ∆X1m = 0n,∆XT1n = 0m.

As there are only n+m− 1 linearly independent constraintsamong ∆X1m = 0n, and ∆XT1n = 0m, the rank of thenull space is nm − n − m + 1 = (n − 1)(m − 1). Hencethe dimension of the range will be n + m − 1. Accordingto the sub-immersion theorem (Proposition 3.3.4 in [1]), thedimension of the manifold Cmn (p,q) is (n− 1)(m− 1).

Several special cases of the coupling matrix manifolds thathave been explored recently are as follows:

Remark 3. When both p and q are discrete distributions, i.e.,pT1n = qT1m = 1 which are naturally coupled. In this case,we call Cmn (p,q) the double probabilistic manifold, denotedby

Pmn (p,q) = X ∈ Rn×m+ :X1m = p,XT1n = q,

and pT1n = qT1m = 1.

Remark 4. The doubly stochastic multinomial manifold [14]:This manifold is the special case of Cmn (p,q) with n = mand p = q = 1n, e.g.

DPn = X ∈ Rn×n+ : X1n = 1n,XT1n = 1n.

DPn can be regarded as the two-dimensional extension ofthe multinomial manifold introduced in [44], defined as

Pmn = X ∈ Rn×m+ : X1m = 1n.

B. The Tangent Space and Its Metric

From now on, we only consider the coupling matrix man-ifold Cmn (p,q) where p and q are a pair of coupled vectors.For any coupling matrix X ∈ Cmn (p,q), the tangent spaceTXCmn (p,q) is given by the following proposition.

Proposition 2. The tangent space TXCmn (p,q) can be calcu-lated as

TXCmn (p,q) = Y ∈ Rn×m : Y1m = 0n, YT1n = 0m

(3)

and its dimension is (n− 1)(m− 1).

Proof. It is easy to prove Proposition 2 by differentiating theconstraint conditions. We omit this.

Also it is clear that Y1m = 0n and YT1n = 0m consistof m + n equations where only m + n − 1 conditions are ingeneral independent because

∑ij Yij = 1TnY1m = 0. Hence

the dimension of the tangent space is nm − n − m + 1 =(n− 1)(m− 1).

Following [14], [44], we still use the Fisher information asthe Riemannian metric g on the tangent space TXCmn (p,q).For any two tangent vectors ξX, ηX ∈ TXCmn (p,q), the metricis defined as

g(ξX, ηX) =∑ij

(ξX)ij(ηX)ijXij

= Tr((ξX X)(ηX)T ) (4)

where the operator means the element-wise division of twomatrices in the same size.Remark 5. Equivalently we may use the normalized Rieman-nian metric as follows

g(ξX, ηX) = (pT1n)∑ij

(ξX)ij(ηX)ijXij

.

As one of building blocks for the optimization algorithmson manifolds, we consider how a matrix of size n × m canbe orthogonally projected onto the tangent space TXCmn (p,q)under its Riemannian metric g.

Theorem 3. The orthogonal projection from Rn×m toTXCmn (p,q) takes the following form

ΠX(Y) = Y − (α1Tm + 1nβT )X, (5)

where the symbol denotes the Hadamard product, and αand β are given by

α = (P−XQ−1X)+(Y1m −XQ−1YT1n) ∈ Rn (6)

β = Q−1(YT1n −XTα) ∈ Rm (7)

where Z+ denotes the pseudo-inverse of Z, P = diag(p) andQ = diag(q).

Proof. We only present a simple sketch of the proof here. First,it is easy to verify that for any vectors α ∈ Rn and β ∈ Rm,N = (α1Tm + 1nβ

T ) X is orthogonal to the tangent spaceTXCmn (p,q). This is because for any S ∈ TXCmn (p,q), wehave the following inner product induced by g,

〈N,S〉X = Tr((NX)ST ) = Tr((α1Tm + 1nβT )ST )

= αTS1m + βTST1n = 0.

For any Y ∈ Rn×m and X ∈ Cmn (p,q), there exist α andβ such that the following orthogonal decomposition is valid

Y = ΠX(Y) + (α1Tm + 1nβT )X

HenceY1m = ((α1Tm + 1nβ

T )X)1m

By direct element manipulation, we have

Y1m = Pα+ Xβ.

SimilarlyYT1n = XTα+ Qβ.

From the second equation we can express β in terms of α as

β = Q−1(YT1n −XTα)

Taking this equation into the first equation gives

Y1m = (P−XQ−1X)α+ XQ−1YT1n

This gives both (6) and (7).

Page 4: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 4

C. Riemannian Gradient and Retraction

The classical gradient descent method can be extended tothe case of optimization on manifold with the aid of the so-called Riemannian gradient. As the coupling matrix manifoldis embedded in the Enclidean space, the Riemannian gradientcan be calculated via projecting the Euclidean gradient onto itstangent space. Given the Riemannian metric which is definedin (4), we can immediately formulate the following lemma,see [14], [44],

Lemma 4. Suppose that f(X) is a real-valued smoothfunction defined on Cmn (p,q) with its Euclidean gradientGradf(X), then the Riemannian gradient gradf(X) can becalculated as

gradf(X) = ΠX(Gradf(X)X). (8)

Proof. As Df(X)[ξX], the directional derivative of f alongany tangent vector ξX, according to the definition of Rieman-nian gradient, for the metric g(·, ·) in (4) we have:

g(gradf(X), ξX) = Df(X)[ξX] = 〈Gradf(X), ξX〉 (9)

where the right equality comes from the definition of Eu-clidean gradient Gradf(X) with the classic Euclidean metric〈·, ·〉. Clearly we have

〈Gradf(X), ξX〉 = g(Gradf(X)X, ξX) (10)

where g(Gradf(X) X, ξX) can be simply calculated ac-cording to the formula in (4), although Gradf(X)X is notin the tangent space TXCmn (p,q). Considering its orthogonaldecomposition according to the tangent space, we shall have

Gradf(X)X = ΠX(Gradf(X)X) + Q (11)

where Q is the orthogonal complement satisfying g(Q, ξX) =0 for any tangent vector ξX. Taking (11) into (10) andcombining it with (9) gives

Df(X)[ξX] = g(ΠX(Gradf(X)X), ξX).

Hencegradf(X) = ΠX(Gradf(X)X).

This completes the proof.

As an important part of the manifold gradient descentprocess, the retraction function retracts a tangent vector backto the manifold [1]. For Euclidean submanifolds, the simplestway to define a retraction is

RX(ξX) = X + ξX

In our case, to ensure RX(ξX) ∈ Cmn (p,q), ξX shouldbe in the smaller neighbourhood of 0 particularly when Xhas smaller entries. This will result an inefficient descentoptimization process. To provide a new retraction with highefficiency, following [14], [44], we define P as the projectionfrom the set of element-wise positive matrices Rn×m+ onto themanifold Cmn (p,q) under the Euclidean metric. Then we havethe following lemma.

Lemma 5. For any matrix M ∈ Rn×m+ , there exist twodiagonal matrices D1 ∈ Rn×n+ and D2 ∈ Rm×m+ such that

P (M) = D1MD2 ∈ Cmn (p,q)

where both D1 and D2 can be determined by the extendedSinkhorn-Knopp algorithm [36].

The Sinkhorn-Knopp algorithm is specified in Algorithm 1below, which implements the projection P in Lemma 5.

Algorithm 1 The Sinkhorn-Knopp Algorithm

Input: M ∈ n×m+ , p ∈ Rn+ and q ∈ Rm+ , a tolerance ε =

1e− 10 and the number of maximal iteration TOutput: D1 and D2

1: Initializingd1 = qMT1m; d2 = p (Md1);

2: while the iteration is less than T do3: d1 = qMTd2; d2 = p (Md1);4: D1 = diag(d1) and D2 = diag(d2);5: if ‖D1Md2 − p‖ < ε and ‖D2M

Td1 − q‖ < ε then6: break while;7: end if8: end while

Based on the projection P , we define the following retrac-tion mapping for Cmn (p,q)

Lemma 6. Let P be the projection defined in Lemma 5, themapping RX : TXCmn (p,q)→ Cmn (p,q) given by

RX(ξX) = P (X exp(ξX X))

is a valid retraction on Cmn (p,q). Here exp(·) is the element-wise exponential function and ξX is any tangent vector at X.

Proof. We need to prove that (i) RX(0) = X and (ii)γξX(τ) = RX(τξX) satisfies dγξX (τ)

∣∣∣τ=0

= ξX.For (i), it is obvious that RX(0) = X as P (X) = X for

any X ∈ Cmn (p,q).For (ii),

dγξX(τ)

∣∣∣∣τ=0

= limτ→0

γξX(τ)− γξX(0)

τ

= limτ→0

P (X exp(τξX X))−X

τ

As all exp(·), and are element-wise operations, the firstorder approximation of the exponential function gives

P (X exp(τξX X)) = P (X + τξX) + o(τ)

where limτ→0o(τ)τ = 0. The next step is to show that P (X+

τξX) ≈ X + τξX when τ is very small. For this purpose,consider a smaller tangent vector ∆X such that X + ∆X ∈Rn×m+ . There exist two smaller diagonal matrices ∆D1 ∈ Rn+and ∆D2 ∈ Rm+ that satisfy

P (X + ∆X) = (In + ∆D1)(X + ∆X)(Im + ∆D2)

where I are identity matrices. By ignoring higher order smallquantity, we have

P (X + ∆X) ≈ X + ∆X + ∆D1X + X∆D2.

Page 5: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 5

As both P (X + ∆X) and X are on the coupling matrixmanifold and ∆X is a tangent vector, we have

p =P (X + ∆X)1m ≈ (X + ∆X + ∆D1X + X∆D2)1m

≈p + 0 + ∆D1p + X∆D21m = p + PδD1 + XδD2

where δD = diag(D) and P = diag(P). Hence,

PδD1 + XδD2 ≈ 0.

Similarly,XT δD1 + QδD2 ≈ 0.

That is [P XXT Q

] [δD1

δD2

]≈ 0.

Hence [δD1, δD2]T is in the null space of the above matrixwhich contains [1Tn ,−1Tm]T . In general, there exists a constantc such that δD1 = c1n and δD2 = −c1m and this gives

∆D1X + X∆D2 = 0.

Combining all results obtained above, we have P (X+τξX) ≈X + τξX as τ is sufficiently smaller. Hence, this completesthe proof.

D. The Riemannina Hessian

Theorem 7. Let Gradf(X) and Hessf(X)[ξX] be the Eu-clidean gradient and Euclidean Hessian, respectively. TheRiemennian Hessian hessf(X)[ξX] can be expressed as

hessf(X)[ξX] = ΠX

(γ − 1

2(γ ξX)X

)with

µ =(P−XQ−1Xt)+

η =Gradf(X)X

α =µ(η1m −XQ−1ηT1n)

β =Q−1(ηT1n −XTα)

γ =η − (α1Tm + 1nβT )X

µ =µ(XQ−1ξTX + ξXQ−1XT )µ

η =Hessf(X)[ξX]X + Gradf(X) ξXα =µ(η1m −XQ−1ηT1n)

+ µ(η1m − ξXQ−1ηT1n −XQ−1ηT1n)

β =Q−1(ηT1n − ξTXα−XT α)

γ =η − (α1Tm + 1nβT )X− (α1Tm + 1nβ

T ) ξX.

Proof. It is well known [1] that the Riemannian Hessian can becalculated from the Riemannian connection∇ and Riemanniangradient via

hessf(X)[ξX] = ∇ξXgradf(X).

Furthermore the connection∇ξXηX on the submanifold can begiven by the projection of the Levi-Civita connection ∇ξXηX,i.e., ∇ξXηX = ΠX(∇ξXηX). For the Euclidean space Rn×mendowed with the Fisher information, with the same approach

used in [44], it can be shown that the Levi-Civita connectionis given by

∇ξXηX = D(ηX)[ξX]− 1

2(ξX ηX)X.

Hence,

hessf(X)[ξX] = ΠX(∇ξXgradf(X))

=ΠX

(D(gradf(X))[ξX]− 1

2(ξX gradf(X))X

)According to Lemma 4, the directional derivative can beexpressed as

D(gradf(X))[ξX] = D(ΠX(η))[ξX]

=D(η − (α1Tm + 1nβT )X)[ξX]

=D(η)[ξX]− (D(α)[ξX]1Tm + 1nD(β)[ξX]T )X

− (α1Tm + 1nβT ) ξX.

Taking in the expressions for η, α, β and directly computingdirectional derivatives give all formulae in the theorem.

III. RIEMANNIAN OPTIMIZATION APPLIED TO OTPROBLEMS

In this section, we illustrate the Riemannian optimizationin solving various OT problems, starting by reviewing theframework of the optimization on Riemannian manifolds.

A. Optimization on Manifolds

Early attempts to adapt standard manifold optimizationmethods were presented by [19] in which steepest descent,Newton and qusasi-Newtwon methods were introduced. Thesecond-order geometry related optimization algorithm such asthe Riemannian trust region algorithm was proposed in [1],where the algorithm was applied on some specific manifoldssuch as the Stiefel and Grassman manifolds.

This paper focuses only on the gradient descent methodwhich is the most widely used optimization method in machinelearning. Suppose that M is a D-dimensional Riemannianmanifold. Let f : M → R be a real-valued function definedon M. Then, the optimization problem on M has the form

minX∈M

f(X).

For any X ∈ M and ξX ∈ TXM, there always exists ageodesic starting at X with initial velocity ξX, denoted by γξX .With this geodesic the so-called exponential mapping expX :TXM→M is defined as

expX(ξX) = γξX(1), for any ξX ∈ TXM.

Thus the simplest Riemannian gradient descent (RGD) con-sists of the following two main steps:

1) Compute the Riemannian gradient of f at the currentposition X(t), i.e. ξX(t) = gradf(X(t));

2) Move in the direction ξX(t) according to X(t+1) =expX(t)(−αξX(t)) with a step-size α > 0.

Step 1) is straightforward as the Riemannian gradient canbe calculated from the Euclidean gradient according to (8) in

Page 6: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 6

Lemma 4. However, it is generally difficult to compute theexponential map effectively as the computational processesrequire some second-order Riemannian geometrical elementsto construct the geodesic, which sometimes is not unique ona manifold point. Therefore, instead of using the exponentialmap in RGD, an approximated method, namely the retractionmap is commonly adopted. For coupling matrix manifoldCmn (p,q), a retraction mapping has been calculated in Lemma6. Hence Step 2) in the RGD is defined by

X(t+1) = RX(t)(−αξX(t)).

Hence for any given OT-based optimization problem

minX∈Cmn (p,q)

f(X),

conducting the RGD algorithm comes down to the computa-tion of Euclidean gradient Gradf(X). Similarly, formulatingthe second-order Riemannian optimization algorithms basedon Riemannian Hessian, such as Riemannian Newton methodand Riemannian trust region method, boil down to calculatingenable the calcuationn of the Euclidean Hessian. See Theorem7.

B. Computational Complexity of Coupling Matrix ManifoldOptimization

In this section we give a simple complexity analysis onoptimizing a function defined on the coupling matrix manifoldby taking the RGD algorithm as an example. Suppose that weminimize a given objective function f(X) defined on Cmn . Forthe sake of simplicity, we consider the case of m = n.

In each step of RGD, we first calculate the Euclideangradient Gradf(X(t)) with the number of flops Et(n). In mostcases shown in the next subsection, we have Et(n) = O(n2)Before applying the GD step, we shall calculate the Rie-mannian gradient gradf(X(t)) by the projection according toLemma 6 which is implemented by the Sinkhorn-Knopp algo-rithm in Algorithm 1. The complexity of Sinkhorn-Knopp al-gorithm to have an ε-approximate solution O(n log(n)ε−3) =O(n log(n)) [2].

If RGD is coducted T iterations, the overall computationalcomplexity will be

O(n log(n)T )+TEt(n) = O(n log(n)T )+O(Tn2) = O(Tn2).

This means the new algorithm will be slower than theSinkhorn-Knopp algorithm, however that is price we have topay for the OT problems where Sinkhorn-Knopp algorithm isnot feasible.

C. Application Examples

As mentioned before, basic Riemannain optimization algo-rithms are constructed on the Euclidean gradient and Hessianof the objective function. In the first part of our applicationexample, some classic OT problems are presented to illustratethe calculation process for their Riemannian gradient andHessian.

1) The Classic OT Problem: The objective function of theclassic OT problem [37] is

minX∈Cmn (p,q)

f(X) = Tr(XTC) (12)

where C = [Cij ] ∈ Rn×m is the given cost matrix and f(X)gives the overall cost under the transport plan X. The solutionX∗ to this optimization problem is called the transport planwhich induces the lowest overall cost f(X∗). When the costis measured by the distance between the source object and thetarget object, the best transport plan X∗ assists in defining theso-called Wasserstein distance between the source distributionand the target distribution.

Given that problem (12) is indeed a linear programmingproblem, it is straightforward to solve the problem by thelinear programming algorithms. In this paper, we solve theOT problem under the Riemannian optimization framework.Thus, for the classic OT, obviously the Euclidean gradient andHessian can be easily computed as:

Gradf(X) = C

andHessf(X)[ξ] = 0.

2) The Entropy Regularized OT Problem: It is obvious thatthis classic OT problem can be generalized to the manifoldoptimization process within our defined coupling matrix man-ifold Cmn (p,q) where pT1n = qT1m is not necessarily equalto 1, and the number of rows and the number of columns canbe unequal. To improve the efficiency of the algorithm, weadd an entropy regularization term. Hence, the OT problembecomes

minX∈Cmn (p,q)

f(X) = Tr(XTC)− λH(X),

where H(X) is the discrete entropy of the coupling matrixand is defined by:

H(X) , −∑ij

Xij(log(Xij)).

In terms of matrix operation, H(X) has the form

H(X) = −1Tn (X log(X))1m

where log applies to each element of the matrix. The min-imization is a strictly convex optimization process, and forλ > 0 the solution X∗ is unique and has the form:

X∗ = diag(µ)Kdiag(ν)

where K = e−Cλ is computed entry-wisely [36], and µ and ν

are obtained by the Sinkhorn-Knopp algorithm.Now, for objective function

f(X) = Tr(XTC)− λH(X),

one can easily check that the Euclidean gradient is

Gradf(X) = C + λ(I + log(X)),

where I is a matrix of all 1s in size n×m, and the EuclideanHessian is, in terms of mapping differential, given by

Hessf(X)[ξ] = λ(ξ X).

Page 7: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 7

3) The Power Regularization for OT Problem: Dessein etal. [13] further extended the regularization to

minX∈Cnn(p,q)

Tr(XTC) + λφ(X)

where φ is an appropriate convex function. As an example,we consider the squared regularization proposed by [15]

minX∈Cnn(p,q)

f(X) = Tr(XTC) + λ∑ij

X2ij

and we apply a zero truncated operator in the manifoldalgorithm. It is then straightforward to prove that

Gradf(X) = C + 2λX

andHessf(X)[ξ] = 2λξ.

The Tsallis Regularized Optimal Transport is used in [34]to define trot distance which comes with the following regu-larization problem

minX∈Cnn(p,q)

f(X) = Tr(XTC)− λ 1

1− q∑ij

(Xqij −Xij).

For the sake of convenience, we denote Xq := [Xqij ]n,mi=1,j=1

for any given constant q > 0. Then we have

Gradf(X) = C− λ

1− q(qXq−1 − I)

andHessf(X)[ξ] = qλ

[Xq−2 ξ

].

4) The Order-Preserving OT Problem: The order-preserving OT problem is proposed in [42] and is adopted by[43] for learning distance between sequences. This learningprocess takes the local order of temporal sequences and thelearned transport defines a flexible alignment between twosequences. Thus, the optimal transport plan only assigns largeloads to the most similar instance pairs of the two sequences.

For sequences U = (u1, ...,un) and V = (v1, ...,vm) inthe respective given orders, the distance matrix between themis

C = [d(ui,vj)2]n,mi=1,j=1.

Define an n×m matrix (distance between orders)

D =

[1(

in −

jm

)2+ 1

]and the (exponential) similarity matrix

P =1

σ√

[exp

− l(i, j)

2

2σ2

]where σ > 0 is the scaling factor and

l(i, j) =

∣∣∣∣∣∣in −

jm√

1n2 + 1

m2

∣∣∣∣∣∣ .The (squared) distance between sequences U and V is givenby

d2(U,V) = Tr(CTX∗) (13)

where the optimal transport plan X∗ is the solution to thefollowing order-preserving regularized OT problem

X∗ = arg minX∈Cmn (p,q)

f(X) = Tr(XT (C− λ1D))+λ2KL(X||P)

where the KL-divergence is defined as

KL(X||P) =∑ij

Xij(log(Xij)− log(Pij))

and specially p = 1n1n and q = 1

m1m are uniform distribu-tions. Hence

Gradf(X) = (C− λ1D)+λ2(I + log(X)− log(P))

andHessf(X)[ξ] = λ2(ξ X).

5) The OT Domain Adaption Problem: OT has also beenwidely used for solving the domain adaption problems. Inthis subsection, the authors of [9] formalized two class-based regularized OT problems, namely the group-inducedOT (OT-GL) and the Laplacian regularized OT (OT-Laplace).As the OT-Laplace is found to be the best performer fordomain adaption, we only apply our coupling matrix manifoldoptimization to it and thus we summarize its objective functionhere.

As pointed out in [9], this regularization aims at pre-serving the data graph structure during transport. ConsiderPs = [ps1,p

s2, ...,p

sn] to be the n source data points and

Pt = [pt1,pt2, ...,p

tm] the m target data points, both are

defined in Rd. Obviously, Ps ∈ Rd×n and Pt ∈ Rd×m.The purpose of domain adaption is to transport the sourcePs towards the target Pt so that the transported sourcePs = [ps1, p

s2, ..., p

sn] and the target Pt can be jointly used

for other learning tasks.Now suppose that for the source data we have extra label

information Ys = [ys1, y22 , ..., y

sn]. With this label information

we sparsify similarities Ss = [Ss(i, j)]ni,j=1 ∈ Rn×n+ among

the source data such that Ss(i, j) = 0 if ysi 6= ysj for i, j =1, 2, ..., n. That is, we define a 0 similarity between two sourcedata points if they do not belong to the same class or donot have the same labels. Then the following regularization isproposed

Ωsc(X) =1

n2

n∑i,j=1

Ss(i, j)‖psi − psj‖22.

With a given transport plan X, we can use the barycentricmapping in the target as the transported point for each sourcepoint [9]. When we use the uniform marginals for both sourceand target and the `2 cost, the transported source is expressedas

Ps = nXPt. (14)

It is easy to verify that

Ωsc(X) = Tr(PTt XTLsXPt), (15)

where Ls = diag(Ss1n)−Ss is the Laplacian of the graph Ssand the regularizer Ωc(X) is therefore quadratic with respectto X. Similarly when the Laplacian Lt in the target domain

Page 8: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 8

is available, the following symmetric Laplacian regularizationis proposed

Ωc(X) = (1− α)Tr(PTt XTLsXPt) + αTr(PTsXLtX

TPs)

= (1− α)Ωsc(X) + αΩtc(X).

When α = 0, this goes back to the regularizer Ωsc(X) in (15).Finally the OT domain adaption is defined by the following

Laplacian regularized OT problem

minX∈Cmn (1n,1m)

f(X) = Tr(XTC)− λH(X) +1

2ηΩc(X)

(16)

Hence the Euclidean gradient and uclidean Hessian are givenby

Gradf(X) =C + λ(I + log(X))

+ η((1− α)LsXPtPTt + αPsP

TsXLt).

and

Hessf(X)[ξ] = λ(ξX)+η((1−α)LsξPtPTt +αPsP

Ts ξLt),

respectively.

IV. EXPERIMENTAL RESULTS AND COMPARISONS

In this section, we investigate the performance of ourproposed methods. The implementation of the coupling matrixmanifold follows the framework of ManOpt Matlab toolboxin http://www.manopt.org from which we call the conjugategradient descent algorithm as our Riemannian optimizationsolver in experiments. All experiments are carried out on alaptop computer running on a 64-bit operating system withIntel Core i5-8350U 1.90GHz CPU and 16G RAM withMATLAB 2019a version.

A. Synthetic Data for the Classic OT Problem

First of all, we conduct a numerical experiment on a classicOT problem with synthetic data and the performance of theproposed optimization algorithms are demonstrated.

Consider the following source load p and target load q, andtheir per unit cost matrix C:

p =

33342221

, q =

42644

, C =

0 0 1.2 2 22 4 4 4 01 0 0 0 30 1 2 1 31 1 0 1 22 1 2 0.8 34 0 0 1 10 1 0 1 3

.

For this setting, we solve the classic OT problem using thecoupling matrix manifold optimization (CMM) and the stan-dard linear programming (LinProg) algorithm, respectively.We visualize the learned transport plan matrices from bothalgorithms in Fig. 1. The results reveal that the linear program-ming algorithm is constrained by a non-negative condition forthe entries of transport plan and hence the output transportationplan demonstrates the sparse pattern. While our coupling

1 2 3 4 5

2

4

6

8

(a) LinProg1 2 3 4 5

2

4

6

8

(b) CMM

Fig. 1: Two transport plan matrices via: (a) Linear Program-ming and (b) Coupling Matrix Manifold Optimization.

matrix manifold imposes the positivity constraints, it generatesa less sparse solution plan, which give a preferred pattern inmany practical problems. The proposed manifold optimizationperform well in this illustrative example.

Next we consider an entropy regularized OT problem whichcan be easily solved by the Sinkhorn algorithm. We testboth the Sinkhorn algorithm and the new coupling matrixmanifold optimization on the same synthetic problem over100 regularizer λ values on a log scale ranging [−2, 2], i.e.,λ = 0.001 to 100.0. Mean squared error is used as a criterionto measure the closeness between transport plan matrices inboth algorithms.

0 20 40 60 80 100

0

0.05

0.1

0.15

0.2

0.25

Fig. 2: The error between two transport plan matrices givenby two algorithms verse the regularizer λ.

From Fig. 2, we observe that when the Sinkhorn algorithmbreaks down for λ < 0.001 due to computational instability.On the contrary, the manifold-assisted algorithm generatesreasonable results for a wider range of regularizer values. Wealso observe that both algorithms give almost exactly sametransport plan matrices when λ > 0.1668. However, in termsof computational time, the Sinkhorm algorithm is generallymore efficient than the manifold assisted method in the entropyregularize OT problem,

B. Experiments on the Order-Preserving OT

In this experiment, we demonstrate the performance incalculating the order-preserving Wasserstein distance [42]using a real dataset. The “Spoken Arabic Digits (SAD)”

Page 9: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 9

Algorithms 1NN 3NN 5NN 7NN 13NN 19NNS-OWP [42] 0.8236 0.8454 0.8454 0.8418 0.8473 0.8290

(std) 0.0357 0.0215 0.0215 0.0220 0.0272 0.0240CM-OWP 0.8091 0.8309 0.8255 0.8218 0.8109 0.8091

(std) 0.0275 0.0212 0.0194 0.0196 0.0317 0.0315

TABLE I: The classification accuracy of the kNN classifiersbased on two algorithms for the order-preserving Wassersteindistance.

dataset, available from the UCI Machine Learning Repos-itory (https://archive.ics.uci.edu/ml/datasets/Spoken+Arabic+Digit), contains 8,800 vectorial sequences from ten spokenArabic digits. The sequences consist of time series of the mel-frequency cepstrumcoefficients (MFCCs) features extractedfrom the speech signals. This is a classification learning taskon ten classes. The full set of training data has 660 sequencesamples per digit spoken repeatedly for 10 times by 44 maleand 44 female Arabic native speakers. For each digit, another220 samples are retained as testing sets.

The experimental setting is similar to that in [42]. Basedon the order-preserving Wasserstein distance (OPW) betweenany two sequence, we directly test the nearest neighbour(NN) classifier. To define the distance in (13), we use threehyperparameters: the width parameter σ of the radius basisfunction (RBF), two regularizers λ1 and λ2. For the compar-ative purpose, these hyperparameters are chosen to be σ = 1,λ1 = 50 and λ2 = 0.1, as in [42]. Our purpose here is toillustrate that the performance of the NN classifier based onthe coupling matrix manifold optimization algorithm (namedas CM-OPW) is comparable to the NN classification resultsfrom Sinkhorn algorithm (named as S-OPW). We randomlychoose 10% training data and 10% testing data for each runin the experiments. The classification mean accuracy and theirstandard error are reported in TABLE I based on five runs.

In this experiment, we also observe that the distance cal-culation fails for some pairs of training and testing sequencesdue to numerical instability of the Sinkhorn algorithm. Ourconclusion is that the performance of the manifold-based algo-rithm is comparable in terms of similar classification accuracy.When k = 1, the test sequence is also viewed as a query toretrieve the training sequences, and the mean average precision(MAP) is MAP = 0.1954 for the S-OPW and MAP = 0.3654for CM-OPW. Theoretically the Sinkhorn algorithm is super-fast, outperforming all other existing algorithms; however,it is not applicable to those OT problems with non-entropyregularizations. We demonstrate these problems in the nextsubsection.

C. Laplacian Regularized OT Problems: Synthetic DomainAdaption

Courty et al. [9] analyzed two moon datasets and found thatthe OM domain adaption method significantly outperformedthe subspace alignment method significantly.

We use the same experimental data and protocol as in [9]to perform a direct and fair comparison between results1.

1We sincerely thanks to the authors of [9] for providing us the completesimulated two moon datasets.

(a) rotation = 10 (b) rotation = 30

(c) rotation = 50 (d) rotation = 90

Fig. 3: Two moons’ example for increasing rotation angles

Each of the two domains represents the source and the targetrespectively presenting two moon shapes associated with twospecific classes. See Fig. 3.

The source domain contains 150 data points sampled fromthe two moons. Similarly, the target domain has the samenumber of data points, sampled from two moons shapes whichrotated at a given angle from the base moons used in the sourcedomain. A classifier between the data points from two domainswill be trained once transportation process is finished.

To test the generalization capability of the classifier basedon the manifold optimization method, we sample a set of 1000data points according to the distribution of the target domainand we repeat the experiment for 10 times, each of which isconducted on 9 different target domains corresponding to 10,20, 30, 40, 50, 60, 70, 80 and 90 rotations, respec-tively. We report the mean classification error and variance ascomparison criteria.

We train the SVM classifiers with a Gaussian kernel, whoseparameters were automatically set by 5-fold cross-validation.The final results are shown in TABLE II. For comparativepurpose, we also present the results based on the DA-SVMapproach [8] and the PBDA [21] from [9].

From TABLE II, we observe that the coupling matrix mani-fold assisted optimization algorithm significantly improves theefficiency of the GCG (the generalized conditional gradient)algorithm which ignores the manifold constraints while aweaker Lagrangian condition was imposed in the objectivefunction. This results in a sub-optimal solution to the transportplan, producing poorer transported source data points.

Page 10: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 10

Rotate Angle 10 20 30 40 50 70 90

SVM (no adapt.) 0 0.104 0.24 0.312 0.4 0.764 0.828DASVM [8] 0 0 0.259 0.284 0.334 0.747 0.82PBDA [21] 0 0.094 0.103 0.225 0.412 0.626 0.687

OT-Laplace [9] 0 0 0.004 0.062 0.201 0.402 0.524CM-OT-Lap (ours) 0.0027 0.0043 0.0014 0.0142 0.0301 0.0446 0.0797

(variance) 0.0000 0.0002 0.0000 0.0007 0.0013 0.0015 0.0057

TABLE II: Mean error rate over 10 realizations for the two moons simulated example.

D. Laplacian Regularized OT Problems: Image DomainAdaption

We now apply our manifold-based algorithm to solve theLaplician regularized OT problem for the challenging real-world adaptation tasks. In this experiment, we test the domainadaption for both handwritten digits images and face imagesfor recognition. We follow the same setting used in [9] for afair comparison.

1) Digit recognition: We use the two-digit famous hand-written digit datasets USPS and MNIST as the source andtarget domain and verse, respectively, in our experiment2. Thedatasets share 10 classes of features (single digits from 0-9).We randomly sampled 1800 images from USPS and 2000 fromMNIST. In order to unify the dimensions of two domains, theMNIST images are re-sized into 16 × 16 resolution same asUSPS. The grey level of all images are then normalized toproduce the final feature space for all domains. For this case,we have two settings U-M (USPS as source and MNIST astarget) and M-U (MNIST as source and USPS as target).

2) Face Recognition: In the face recognition experiment,we use PIE (“Pose, Illumination, Expression”) dataset whichcontain 32×32 images of 68 individuals with different poses:pose, illuminations and expression conditions3. In order tomake a fair and reasonable comparison with [9], we selectPIE05(C05, denoted as P1, left pose), PIE07(C07, denote asP2, upward pose), PIE09(C09, denoted as P3, downward pose)and PIE29(C29, denoted as P4, right pose). This four domainsinduce 12 adaptation problems with increasing difficulty (thehardest adaptation is from left to the right). Note that largevariability between each domain is due to the illuminationand expression.

3) Experiment Settings and Result Analysis: We generatethe experimental results by applying the manifold-based algo-rithm on two types of Laplacian regularized problems, namely:Problem (16) with α = 0 (CMM-OT-Lap) and with α = 0.5(CMM-OT-symmLap). We follow the same experimental set-tings in [9]. For all methods, the regularization parameter λwas initially set to 0.01 similarly, another parameter, η thatcontrols the performance of Laplacian terms was set to 0.1.

In both Face and digital recognition experiments, 1NN istrained with the adapted source data and target data, and thenwe report the overall accuracy (OA) score (in %) calculatedon testing samples from the target domain. We compare OAsbetween our CMM-OT solutions to the baseline methodsand the results generated by the methods provided in [9]in TABLE III. Note that, we applied both coupling matrix

2Both datasets can be found at http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html.

3http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html

OT Laplacian and coupling matrix OT symmetric Laplacianalgorithm for all experiments, and due to the high similarityof the results generated from these two methods, we onlylist the OA generated from the non-symmetric CMM-OT-Lapalgorithm in table.

As a result, the OA based on the solution generated fromCMM based OT Laplician algorithm over-performs all othermethods in both digital and face recognition experiments, withmean OA = 65.52% and 72.59%, respectively. Averagely, ourmethod is able to increase 4% and 16% of the OA from theprevious results. However, in terms of the adaptation problemwith the highest difficulty : P1 to P4, we got similar resultcompared with previous results, with the OA = 47.54% from[9] and 48.98% from our method respectively.

Domains 1NN OT-IT OT-Lap CMM-OT-LapU-M 39.00 53.66 57.43 60.67M-U 58.33 64.73 64.72 70.37mean 48.66 59.20 61.07 65.52P1-P2 23.79 53.73 58.92 58.08P1-P3 23.50 57.43 57.62 62.65P1-P4 15.69 47.21 47.54 48.98P2-P1 24.27 60.21 62.74 93.10P2-P3 44.45 63.24 64.29 69.18P2-P4 25.86 51.48 53.52 65.10P3-P1 20.95 57.50 57.87 91.70P3-P2 40.17 63.61 65.75 75.66P3-P4 26.16 52.33 54.02 87.60P4-P1 18.14 45.15 45.67 90.30P4-P2 24.37 50.71 52.50 66.46P4-P3 27.30 52.10 52.71 62.29mean 26.22 54.56 56.10 72.59

TABLE III: Overall recognition accuracies in % in both digitaland face recognition

V. CONCLUSIONS

This paper explores the so-called coupling matrix manifoldson which the majority of the OT objective functions aredefined. We formally defined the manifold, explored its tangentspaces, defined a Riemennian metric based on informationmeasure, proposed all the formulas for the Riemannian gradi-ent, Riemannina Hessian and an appropriate retraction as themajor ingradients for implementation Riemannian optimiza-tion on the manifold. We apply manifold-based optimizationalgorithms (Riemannian gradient descent and second-orderRiemannian trust region) into several types of OT problems,including the classic OT problem, the entropy regularized OTproblem, the power regularized OT problem, the state-of-the-art order-preserving Wasserstein distance problems and theOT problem in regularized domain adaption applications. Theresults from three sets of numerical experiments demonstratethat the newly proposed Riemannian optimization algorithms

Page 11: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 11

perform as well as the classic algorithms such as Sinkhornalgorithm. We also find that the new algorithm overperformsthe generalized conditional gradient when solving non-entropyregularized OT problem where the classic Sinkhorn algorithmis not applicable.

ACKNOWLEDGEMENT

This project is partially supported by the University ofSydney Business School ARC Bridging grant.

REFERENCES

[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms onmatrix manifolds. Princeton University Press, 2008.

[2] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear timeapproximation algorithms for optimal transport via Sinkhorn iteration. InProceedings of the 31st International Conference on Neural InformationProcessing Systems, NIPS’17, pages 1961–1971, USA, 2017. CurranAssociates Inc.

[3] Luca Ambrogioni, Umut Guclu, Yagmur Gucluturk, Max Hinne, EricMaris, and Marcel A. J. van Gerven. Wasserstein variational inference.In Proceedings of the 32Nd International Conference on Neural Infor-mation Processing Systems, NIPS’18, pages 2478–2487, USA, 2018.Curran Associates Inc.

[4] Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein GAN.CoRR, abs/1701.07875, 2017.

[5] D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.[6] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl Johann Simon-

Gabriel, and Bernhard Scholkopf. From optimal transport to generativemodeling: the vegan cookbook. Technical report, 2017.

[7] Haım Brezis. Remarks on the Monge-Kantorovich problem in thediscrete setting. Comptes Rendus Mathematique, 356(2):207–213, 2018.

[8] L. Bruzzone and M. Marconcini. Domain adaptation problems: ADASVM classification technique and a circular validation strategy. IEEETransactions on Pattern Analysis andMachine Intelligence, 32(5):770–787, 2010.

[9] Nicolas Courty, Remi Flamary, Devis Tuia, and Alain Rakotomamonjy.Optimal transport for domain adaptation. IEEE transactions on patternanalysis and machine intelligence, 39(9):1853–1865, 2016.

[10] Marco Cuturi. Sinkhorn distances: lightspeed computation of optimaltransport. In Advances in Neural Information Processing Systems,volume 26, pages 2292–2300, 2013.

[11] Marco Cuturi and Arnaud Doucet. Fast computation of Wassersteinbarycenters. In Eric P. Xing and Tony Jebara, editors, Proceedings ofthe 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 685–693, Bejing,China, 22–24 Jun 2014. PMLR.

[12] Jesus A De Loera and Edward D Kim. Combinatorics and geometry oftransportation polytopes: an update. Discrete geometry and algebraiccombinatorics, 625:37–76, 2014.

[13] Arnaud Dessein, Nicolas Papadakis, and Jean-Luc Rouas. Regularisedoptimal transport and the rot mover’s distance. Journal of MachineLearning Research, 19(15):1–53, 2018.

[14] A. Douik and B. Hassibi. Manifold optimization over the set of doublystochastic matrices: A second-order geometry. arXiv, 1802.02628:1–20,2018.

[15] Montacer Essid and Justin Solomon. Quadratically regularized op-timal transport on graphs. SIAM Journal on Scientific Computing,40(4):A1961–A1986, 2018.

[16] S. Ferradans, N. Papadakis, G. Peyre, and J.-F. Aujol. Regularized dis-crete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853–1882, 2014.

[17] Remi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rako-tomamonjy. Wasserstein discriminant analysis. Machine Learning,107(12):1923–1945, Dec 2018.

[18] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, and Tomaso A. Poggio. Learning with a Wasserstein loss. InAdvances inNeural Information Processing Systems (NIPS), volume 28,2015.

[19] D. Gabay. Minimizing a differentiable function over a differentialmanifold. Journal of Optimization Theory and Applications, 37(2):177–219, 1982.

[20] Aude Genevay, Marco Cuturi, Gabriel Peyre, and Francis Bach. Stochas-tic optimization for large-scale optimal transport. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 3440–3448. Curran Associates, Inc., 2016.

[21] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. APAC-Bayesian approach for domain adaptation with specialization to linearclassifiers. In Proceedings of International Conference on MachineLearning (ICML), pages 738–746, Atlanta, USA, 2013.

[22] Ishaan Gulrajani, Faruk Ahmed, Martın Arjovsky, Vincent Dumoulin,and Aaron C. Courville. Improved training of Wasserstein GANs. CoRR,abs/1704.00028, 2017.

[23] X. Hong and Junbin Gao. Sparse density estimation on multinomialmanifold combining local component analysis. In Proceedings ofInternational Joint Conference on Neural Networks (IJCNN), pages 1–7,July 2015.

[24] Xia Hong and Junbin Gao. Estimating the square root of probabilitydensity function on Riemannian manifold. Expert Systems, In Press,2018.

[25] Xia Hong, Junbin Gao, Sheng Chen, and Tanveer Zia. Sparse densityestimation on the multinomial manifold. IEEE Transactions on NeuralNetworks and Learning Systems, 26:2972–2977, 2015.

[26] Leonid V Kantorovich. On the translocation of masses. In Dokl. Akad.Nauk. USSR (NS), volume 37, pages 199–201, 1942.

[27] P. A. Knight. The Sinkhorn-Knopp algorithm: convergence and applica-tions. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, 2008.

[28] Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Ro-hde. Sliced Wasserstein auto-encoders. In Proceedings of InternationalConference on Learning Representation (ICLR), 2019.

[29] Y. T. Lee and A. Sidford. Path finding methods for linear programming:Solving linear programs in o(vrank) iterations and faster algorithms formaximum flow. In Proceedings of IEEE 55th Annual Symposium onFoundations of Computer Science, pages 424–433, Oct 2014.

[30] Gal Maman, Or Yair, Danny Eytan, and Ronen Talmon. Domain adap-tation using Riemannian geometry of SPD matrices. In InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages4464–4468, Brighton, United Kingdom, May 2019. IEEE.

[31] Michael Miller and Jan Van Lent. Monge’s optimal transport distancewith applications for nearest neighbour image classification. CoRR,abs/1612.00181, 2016.

[32] Gaspard Monge. Memoire sur la theorie des deblais et des remblais.Histoire de l’Academie Royale des Sciences de Paris, 1781.

[33] Gregoire Montavon, Klaus-Robert Muller, and Marco Cuturi. Wasser-stein training of restricted Boltzmann machines. In Advances in NeuralIn-formation Processing Systems, volume 29, pages 3718–3726, 2016.

[34] Boris Muzellec, Richard Nock, Giorgio Patrini, and Frank Nielsen.Tsallis regularized optimal transport and ecological inference. InProceedings of AAAI, pages 2387–2393, 2017.

[35] Victor M. Panaretos and Yoav Zemel. Statistical aspects of Wassersteindistances. Annual Review of Statistics and Its Application, 6:405–431,2019.

[36] G. Peyre and M. Cuturi. Computational Optimal Transport: WithApplications to Data Science. Foundations and Trends in MachineLearning Series. Now Publishers, 2019.

[37] Gabriel Peyre, Marco Cuturi, et al. Computational optimal transport.Foundations and Trends R© in Machine Learning, 11(5-6):355–607,2019.

[38] Maurice Queyranne and Frits Spieksma. Multi-index transportationproblems: Multi-index transportation problems mitp. Encyclopedia ofOptimization, pages 2413–2419, 2009.

[39] Julien Rabin and Nicolas Papadakis. Convex color image segmentationwith optimal transport distances. In International Conference on ScaleSpace and Variational Methods in Computer Vision, pages 256–269.Springer, 2015.

[40] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover’sdistance as a metric for image retrieval. International Journal ofComputer Vision, 40(2):99–121, 2000.

[41] Justin Solomon, Fernando de Goes, Gabriel Peyre, Marco Cuturi, AdrianButscher, Andy Nguyen, Tao Du, and Leonidas Guibas. ConvolutionalWasserstein distances: Efficient optimal transportation on geometricdomains. ACM Transactions on Graphics, 34(4):66:1–66:11, July 2015.

[42] Bing Su and Gang Hua. Order-preserving wasserstein distance for se-quence matching. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1049–1057, 2017.

Page 12: IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 1

IEEE TRANSACTIONS ON NNLS, VOL. XX, NO. X, XXX 2019 12

[43] Bing Su and Ying Wu. Learning distance for sequences by learning aground metric. In Proceedings of the 36th International Conference onMachine Learning (ICML), 2019.

[44] Yanfeng Sun, Junbin Gao, Xia Hong, Bamdev Mishra, and BaocaiYin. Heterogeneous tensor decomposition for clustering via manifoldoptimization. IEEE Transactions on Pattern Analysis and MachineIntelligence, 38:476–489, 2016.

[45] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and BernhardSchoelkopf. Wasserstein auto-encoders. In Proceedings of InternationalConference on Learning Repreesentation, 2018.

[46] Cedric Villani. Optimal Transport: Old and New, chapter The Wasser-stein distances, pages 93–111. Springer Berlin Heidelberg, Berlin,Heidelberg, 2009.

[47] Or Yair, Felix Dietrich, Ronen Talmon, and Ioannis G. Kevrekidis. Op-timal transport on the manifold of SPD matrices for domain adaptation.CoRR, abs/1906.00616, 2019.

[48] Shunkang Zhang, Yuan Gao, Yuling Jiao, Jin Liu, Yang Wang, and CanYang. Wasserstein-Wasserstein auto-encoders. CoRR, abs/1902.09323,2019.

[49] Peng Zhao and Zhi-Hua Zhou. Label distribution learning by optimaltransport. In Proceedings of The Thirty-Second AAAI Conference onArtificial Intelligence (AAAI), pages 4506–4513, 2018.

PLACEPHOTOHERE

Dai Shi is a research student at Business School,University of Sydney. He used to be one of theChinese reserve team members for internationalMathematical Olympics. His current research inter-est involve differential geometry, machine learningespecially manifold optimization and general theoryof relativity.

PLACEPHOTOHERE

Junbin Gao received the B.Sc. degree in compu-tational mathematics from the Huazhong Universityof Science and Technology (HUST), China, in 1982,and the Ph.D. degree from the Dalian University ofTechnology, China, in 1991. He is currently Pro-fessor of big data analytics with The University ofSydney Business School, The University of Sydney,Australia. He was a Professor in computer sciencewith the School of Computing and Mathematics,Charles Sturt University, Australia. From 1982 to2001, he was an Associate Lecturer, a Lecturer, an

Associate Professor, and a Professor with the Department of Mathematics,HUST. From 2001 to 2005, he was a Senior Lecturer and a Lecturer incomputer science with the University of New England, Australia. His currentresearch interests include machine learning, data analytics, Bayesian learningand inference, and image analysis.

PLACEPHOTOHERE

Xia Hong received her university education at theNational University of Defense Technology, P. R.China (BSc, 1984, MSc, 1987), and the Universityof Sheffield, UK (PhD, 1998), all in AutomaticControl. She worked as a research assistant at theBeijing Institute of Systems Engineering, Beijing,China from 1987-1993. She worked as a researchfellow in the Department of Electronics and Com-puter Science at University of Southampton from1997-2001. She was appointed by the University ofReading as lecturer (2001), Reader (2009) and then

Professor (2013). Her research interest covers machine learning algorithms,data analytics, neuro-fuzzy systems, nonlinear system modelling and iden-tification, optimization, multi-sensor data fusion, pattern recognition, signalprocessing for control and communications and bioinformatics.

PLACEPHOTOHERE

Boris Choy received his B.Sc. (Honours) degreein mathematics from the University of Leeds, UK,M.Phil degree in Statistics from the Chinese Uni-verity of Hong Kong, Hong Kong and Ph.D degreefrom Imperial College London, UK. He is currentlysenior lecturer at The University of Sydney BusinessSchool, The University of Sydney, Australia. He wasLecturer at the University of Technology Sydney,Australia and Assistant Professor at The Universityof Hong Kong, Hong Kong. His current researchinterests include Bayesian methods and inference,

data analytics, health informatics, insurance analytics, and risk assessmentand management.

PLACEPHOTOHERE

Zhiyong Wang received his B. Eng. and M. Eng.Degrees in electronic engineering from South ChinaUniversity of Technology, Guangzhou, China, andhis Ph.D. degree from Hong Kong Polytechnic Uni-versity, Hong Kong. He is currently an AssociateProfessor and Associate Director of the MultimediaLaboratory with the School of Computer Science,The University of Sydney, Australia. His researchinterests focus on multimedia computing, includ-ing multimedia information processing, retrieval andmanagement, Internet-based multimedia data min-

ing, human-centered multimedia computing, and pattern recognition.