Geometric losses for distributional learning...Geometric losses for distributional learning Arthur...

Geometric losses for distributional learning

Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyre(1)

(1) Ecole Normale Superieure, DMACentre National pour la Recherche Scientifique

Paris, France

(2) NTT Communication Science LaboratoriesKyoto, Japan

June 17, 2019

1 Introduction

2 Fenchel-Young losses for distribution spaces

3 Geometric softmax from Sinkhorn negentropy

4 Statistical properties

5 Applications

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 1 / 17

Contribution: losses and links for continuous metrized output

Handling output geometryLink and loss with cost between classes

C : Y × Y → ROutput distribution over continuous space Y

New geometric losses and associated link functions:1 Construction from duality between distributions and scores2 Need: Convex functional on distribution space

Provided by regularized optimal transport

New geometric losses and associated link functions:1 Construction from duality between distributions and scores

2 Need: Convex functional on distribution spaceProvided by regularized optimal transport

New geometric losses and associated link functions:1 Construction from duality between distributions and scores2 Need: Convex functional on distribution space

Provided by regularized optimal transport

Background: learning with a cost over outputs Y

Cost augmentation of losses1, 2:Convex cost-aware loss Lc : [1, d]× Rd → R

!4 Undefined link functions: Rd →4d: what to predict at test time ?

Use a Wasserstein distance between output distributions:Ground metric C defines a distance WC between distributionsPrediction with a softmax link

`(α, f) , WC(softmax(f), α))!4 Non-convex loss and costly to compute

1Ioannis Tsochantaridis et al. “Large margin methods for structured and interdependent output variables”. In: JMLR (2005).2Kevin Gimpel and Noah A Smith. “Softmax-margin CRFs: Training log-linear models with cost functions”. In: NAACL. 2010.

Background: learning with a cost over outputs Y

Cost augmentation of losses1, 2:Convex cost-aware loss Lc : [1, d]× Rd → R

!4 Undefined link functions: Rd →4d: what to predict at test time ?

Use a Wasserstein distance between output distributions3:Ground metric C defines a distance WC between distributionsPrediction with a softmax link

`(α, f) , WC(softmax(f), α))!4 Non-convex loss and costly to compute

1Ioannis Tsochantaridis et al. “Large margin methods for structured and interdependent output variables”. In: JMLR (2005).2Kevin Gimpel and Noah A Smith. “Softmax-margin CRFs: Training log-linear models with cost functions”. In: NAACL. 2010.3Charlie Frogner et al. “Learning with a Wasserstein loss”. In: NIPS. 2015.

1 Introduction

5 Applications

Predicting distributions from topological duality

All you need is a convex functional

Fenchel-Young losses4 5: Convex function Ω : 4d → R

and conjugateΩ?(f) = min

α∈4dΩ(α)− 〈α, f〉 `Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0

Define link functions between dual and primal

4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Fenchel-Young losses4 5: Convex function Ω : 4d → R and conjugateΩ?(f) = min

α∈4dΩ(α)− 〈α, f〉

`Ω(α, f) = Ω(α) + Ω?(f)− 〈α, f〉 > 0

Define link functions between dual and primal∇Ω(α) = argmin

f∈Rd

`Ω(α, f) ∇Ω?(f) = argminα∈4d

`Ω(α, f)4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Define link functions between dual and primal`Ω(α, f) = 0⇔ α = ∇Ω∗(f) ∇Ω?(f) = argmin

α∈4d

`Ω(α, f)4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018).5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Discrete canonical example: Shannon entropy

Ω(α) = −H(α) =d∑i=1

αi logαi Ω∗(f) = logsumexp(f)

Not defined on continuous distributions, cost-agnosticWe need an entropy defined on all distributions

Ω(α) = −H(α) =d∑i=1

Not defined on continuous distributions, cost-agnostic

We need an entropy defined on all distributions

Ω(α) = −H(α) =d∑i=1

1 Introduction

5 Applications

Detour: regularized optimal transport6

Distance between distribution on YOTC,ε(α, β) = min

π∈U(α,β)〈C, π〉+εKL(π|α⊗β)

Displacement cost for optimal couplingU(α, β)=π ∈M+

1 (Y2), π1 = α, π>1 = β

+ Entropic regularization: spread coupling

6Marco Cuturi. “Sinkhorn distances: Lightspeed computation of optimal transport”. In: NIPS. 2013.

Sinkhorn negentropy from regularized optimal transport

Self regularized optimal transportation distance:

ΩC(α) = −12OTC,ε=2(α, α) = − max

f∈C(Y)〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉

Continuous convex

Special cases

εΩC/ε(α) ε→+∞−→ 12〈α⊗ α,−C〉 MMD

0 ∞ . . .

∞ 0 . . .

Shannon entropyGini index

ΩC(α) = −12OTC,ε=2(α, α) = − max

Continuous convex

Special cases

εΩC/ε(α) ε→+∞−→ 12〈α⊗ α,−C〉 MMD

0 ∞ . . .

∞ 0 . . .

ΩC(α) = −12OTC,ε=2(α, α) = − max

Continuous convex

Special cases

εΩC/ε(α) ε→+∞−→ 12〈α⊗ α,−C〉 MMD

0 ∞ . . .

∞ 0 . . .

Dual mapping from Sinkhorn negentropy

Ω(α) = − maxf∈C(Y)

〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉

Dual mapping from Sinkhorn negentropy

∇Ω(α) = − argmaxf∈C(Y)

〈α, f〉 − log〈α⊗ α, exp(f ⊕ f − C2 )〉

Returning to primal: geometric softmax

Ω∗ = g-logsumexp : f 7→ − log minα∈M+

1 (Y)〈α⊗ α, exp(−f ⊕ f + C

2 )〉

Minimizes a simple quadratic.

Returning to primal: geometric softmax

∇Ω∗ = g-softmax : f 7→ argminα∈M+

1 (Y)〈α⊗ α, exp(−f ⊕ f + C

2 )〉

Minimizes a simple quadratic.

Geomtric loss construction and computation

Training with the geometric logistic loss:

`C(α, f) = geometric-LSEC(f) + sinkhorn-negentropyC(α)− 〈α, f〉

Tractable discrete Y: use mirror descent/L-BFGS10× as costly as a softmaxBackpropagation: ∇Ω? = geometric-softmax

Continuous case:Frank-Wolfe scheme, adding one Dirac at each iteration

Properties of the geometric-softmax

−2 0 2

f1 − f2

∇Ω?(f

) 1C =(0 γ

0.00 0.25 0.50 0.75 1.00

([α,1−α

γ =∞γ = 0.1

γ = 2

∇Ω? returns from Sinkhorn potentials: ∇Ω? ∇Ω = Id

∇Ω? non injective (∼ softmax)∇Ω ∇Ω∗ projects f onto symmetric Sinkhorn potentialsSparse ∇Ω∗(f) ← minimization on the simplex

−2 0 2

f1 − f2

∇Ω?(f

) 1C =(0 γ

0.00 0.25 0.50 0.75 1.00

([α,1−α

γ =∞γ = 0.1

γ = 2

∇Ω? returns from Sinkhorn potentials: ∇Ω? ∇Ω = Id∇Ω? non injective (∼ softmax)∇Ω ∇Ω∗ projects f onto symmetric Sinkhorn potentialsSparse ∇Ω∗(f) ← minimization on the simplex

ε∇Ω∗(f/ε)ε→ 0 Mode finding ε→∞ Positive deconvolution

1 Introduction

5 Applications

Consistent learning with the geometric logistic loss

Bregman divergence from Sinkhorn negentropy:Dg(α|β) = Ω(α)− Ω(β)− 〈∇Ω(α), α− β〉.

= asymmetric transport distance

Sample distribution (xi, αi)i ∈ X ×M+1 (Y).

Fisher consistency:min

β:X→M+1 (Y)

E [Dg (α, β (x))] = ming:X→C(Y)

E [`g (α, g-softmax (g (x)))]

β:X→M+1 (Y)

1 Introduction

5 Applications

Applications: variational auto-encoder

Goal: Generate nearly 1D distribution in 2D imagesDataset: Google QuickdrawTraditional sigmoid activation layer

← replaced by geometric softmax

Deconvolutional effectCost-informed non-linearity

Applications: variational auto-encoder

Goal: Generate nearly 1D distribution in 2D imagesDataset: Google QuickdrawTraditional sigmoid activation layer

← replaced by geometric softmax

Deconvolutional effectCost-informed non-linearity

Applications: variational auto-encoders

softmax g-softmax

Reconstruction

Generation

Better defined generated imagesArthur Mensch, Mathieu Blondel, Gabriel Peyre Geometric losses for distributional learning 16 / 17

Conclusion

Geometric softmax:New loss and projector onto output probabilitiesDiscrete/continuous, aware of a cost between outputsFenchel duality in Banach spaces + regularized optimal transportApplication in VAE and ordinal regressiongithub.com/arthurmensch/g-softmax

Future directions:How to improve computation methods (continuous FW)Geometric logisitic loss in super resolution

Conclusion

Geometric softmax:New loss and projector onto output probabilitiesDiscrete/continuous, aware of a cost between outputsFenchel duality in Banach spaces + regularized optimal transportApplication in VAE and ordinal regressiongithub.com/arthurmensch/g-softmax

Future directions:How to improve computation methods (continuous FW)Geometric logisitic loss in super resolution7

7Nicholas Boyd et al. “DeepLoco: Fast 3D localization microscopy using neural networks”. In: BioRxiv (2018), p. 267096.

Mathieu Blondel Gabriel Peyre

Poster # 179

Geometric losses for distributional learning...Geometric losses for distributional learning Arthur...

Documents

EVOLUTIONARY NOVELTIES AND LOSSES IN GEOMETRIC

Distributional Semanticsattach.matita.net/caterinamauri/sitovecchio/455463743_41.pdfRecap Distributional Semantics and The Distributional Hypothesis Syntagmatic and paradigmatic similarities

Distributional Scaling: An Algorithm for Structure ... · metric distortion. To assess the geometric distortion, we explore functions that reﬂect geometric properties. Our approach

Compositional Operators in Distributional Semantics · PDF file · 2015-10-05Compositional Operators in Distributional Semantics ... distributional semantics; compositionality; vector

Distributional Semanticsattach.matita.net/caterinamauri/sitovecchio/692485562_1.pdfDistributional Semantics This course: Theoretical foundations ( the distributional hypothesis ) Basic

Explaining Distributional Outcomes · 2019. 1. 16. · Explaining Distributional Outcomes Distributional conflicts—where a better bargain for one party leads to an-other party or

Distributional Schwarzschild Geometry fromDistribut

Distributional Semantic Concept Models for Entity … · Distributional Semantic Concept Models for Entity Relation Discovery ... distributional semantic concept models of each ca

Distributional cost-effectiveness analysis: a tutorial (CHE · Distributional cost-effectiveness analysis: a tutorial i Abstract Distributional cost-effectiveness analysis (DCEA)

Ukraine Gas Pricing Policy: Distributional Consequences of ... › external › pubs › ft › wp › 2012 › wp12247.pdfMispriced gas and heating tariffs create losses for NG and

Frontiers in Distributional Ecology

Distributional Effects of Monetary Policy · model matches asset positions by group of households, we can assign gains 3. or losses from the policy change directly to each model agent

Compositional Operators in Distributional Semantics - · PDF file · 2017-08-28Semantics and Distributional Semantics I provide an introduction to compositional and distributional

Distributional semantics meets MRS? · PDF file · 2012-07-03Introduction Distributional semantics and DELPH-IN ... I Combining compositional and distributional techniques, ... I

INF5820 Distributional Semantics: Extracting …...INF5820 Distributional Semantics: Extracting Meaning from Data Lecture 2 Distributional and distributed: inner mechanics of modern

The Distributional Denjoy Integral

Distributional Bootstrapping - CLIPS

Composition in distributional semantics

The Distributional Compatibility Relation

Real Zeros and the Alternatingly Increasing Property in ... · Major results and open questions in algebraic and geometric combinatorics center around distributional properties of