Sparse Sequence-to-Sequence Models · Eleições das Nações Unidas s0 vj hj c1 Encoder Aenon...

Sparse Sequence-to-Sequence Models

Ben Peters Ins tuto de Telecomunicações

→ Vlad Niculae IT

André Mar ns IT & Unbabel

github.com/deep-spin/entmax https://vene.ro @vnfrombucharest

Sequence-to-Sequence With Attention(Bahdanau et al., 2015)

United Na ons elec ons end today

Eleições das Nações Unidas

Encoder

A en on

Decoder

morphologicalinflec on!

Encoder

A en on

Decoder

hjEncoder

A en on

Decoder

Encoder

A en on

Decoder

a en on weightscomputed with

so max:for some decoder state st,compute contextually

weighted average of input ct:

zj = s⊤t W(a)hj

πj = softmaxj(z)

ct =∑jπjhj

morphological inflec on!

Encoder

A en on

Decoder

predic ve probability(also using so max!)

ut = tanh(W(u)[st; ct])P(yt | y1:t−1, x) = softmax(Vut)

P(y1 | x).70 Eleições.11 Os.10 As.09 Nações

...10−6 Bucarest

Encoder

A en on

Decoder

Encoder

A en on

Decoder

predic ve probabilityP(y2 | y1, x)

.40 das

.30 para

.20 ás...

10−7 esquerda

Encoder

A en on

Decoder

Encoder

A en on

Decoder

predic ve probabilityP(y3 | y2, y1, x)

.80 Nações

.11 Representações

.03 assembleias...

10−8 resultados

Encoder

A en on

Decoder

Encoder

A en on

Decoder

predic ve probabilityP(y4 | y3, y2, y1, x).90 Unidas.05 Shopping.01 ,

...10−5 aquá co

d o g N PL

d o g s

Encoder

A en on

Decoder

The Space of Outputs

d o g s

b u c z

q a b e

z y z y

· · ·

d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.60

d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.60p(·) = 0.13

d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.60p(·) = 0.13p(·) = 10−4

The Space of Outputs: Made Sparse!

d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.70p(·) = 0.20p(·) = 0 !!

So max plays two roles in seq2seq:

a en on weightsfor some decoder state st, compute

contextually weighted average of input ct:

zj = s⊤t W(a)hj

πj = softmaxj(z)

ct =∑jπjhj

output probabili espredict the probability of the next word:

ut = tanh(W(u)[st; ct])P(yt | y1:t−1, x) = softmax(Vut)

Our work: replace so maxwith a family of new sparsity-inducing alterna ves

Sparse Attention Weights / Alignments

k ö m k N GENSG PSS2

(Azeri)

w e l V IND PST

PFV 2 SG

(North Frisian)

(Peters and Mar ns, 2019)

Sparse Predictive Probabilities

d r a w e d </s>n </s>

What is softmax?O en defined via pi :=

exp zi∑j exp zj

, but where does it come from?

△ :={p ∈ Rd : p ≥ 0,∑

j pj = 1}p ∈ △: probability distribu on over choices

Expected score under p: Ei∼p zi = p⊤zargmax

maximizes expected score

Shannon entropy of p: Hs(p) := −∑

i pi log piso max maximizes expected score + entropy: 0.5 1

0.5 1 1.5

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

i pi log piso max maximizes expected score + entropy:

0.5 11.5

0.5 1 1.5

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

0.5 11.5

0.5 1 1.5

p = [0,1,0]

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

0.5 11.5

0.5 1 1.5

1.5 p = [0,0,1]

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

0.5 11.5

0.5 1 1.5

p = [1/3, 1/3, 1/3]

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

Expected score under p: Ei∼p zi = p⊤z

argmax

0.5 11.5

0.5 1 1.5

1.5z = [.7, .1,1.5]

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

0.5 11.5

0.5 1 1.5

1.5z = [.7, .1,1.5]

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

Expected score under p: Ei∼p zi = p⊤zargmax maximizes expected score

0.5 11.5

0.5 1 1.5

1.5z = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈△

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

i pi log pi

so max maximizes expected score + entropy:

(1, 0, 0) (0, 1, 0)

(0, 0, 1)

0.5 11.5

0.5 1 1.5

1.5z = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈△

exp zi∑j exp zj

△ :={p ∈ Rd : p ≥ 0,∑

(1, 0, 0) (0, 1, 0)

(0, 0, 1)

argmaxp∈△

p⊤z +Hs(p)

0.5 11.5

0.5 1 1.5

1.5z = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈△

Generalizing Softmax Using EntropiesπH(z) = argmax

p∈△p⊤z +H(p)

argmax: H0(p)=0

so max: Hs(p)=−∑

j pj log pjsparsemax: Hg(p)= 1/2

∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and

sparsemax (α = 2).

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017; Blondel et al., 2019)

p∈△p⊤z +H(p)

argmax: H0(p)=0

∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

sparsemax (α = 2).

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p∈△p⊤z +H(p)

argmax: H0(p)=0

j pj log pj

sparsemax: Hg(p)= 1/2∑

j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

sparsemax (α = 2).

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p∈△p⊤z +H(p)

argmax: H0(p)=0

∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

sparsemax (α = 2).

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Mar ns and Astudillo, 2016)

p∈△p⊤z +H(p)

argmax: H0(p)=0

∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

sparsemax (α = 2).

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p∈△p⊤z +H(p)

argmax: H0(p)=0

∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

sparsemax (α = 2).

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

πα([t,0])1

3 2 1 0 1 2 3t

1.0 = 1 (softmax)= 1.25= 1.5= 2 (sparsemax)= 4

Computing α-entmaxπHt

α(z) := argmax

p∈△p⊤z +Ht

Solu on has the form:

πHtα(z) = [(α − 1)z − τ1]1/α−1+

Algorithms:bisec on

• approximate; bracket τ ∈ [τlo, τhi]

• gain 1 bit per O(d) itera on• float32 has 23 man ssa bits

sort-based• exact algorithm, O(d log d)• available only for α ∈ {1.5,2}• huge speed-up from par al sor ngwhen expec ng sparse solu ons

Morphological inflectionSIGMORPHON 2018. Shared mul -lingual model.Medium: 1k pairs per language, 102 languages.High: 10k pairs per language, 86 languages.

Medium High70

+1.44accuracy

α = 1 (so max) α = 1.5 α = 2 (sparsemax)

Neural Machine Translation

DE-EN EN-DE JA-EN EN-JA RO-EN EN-RO0

30 +.47+.56 +.33

+.79+1.03 +.72BLEU

α = 1 (so max) α = 1.5 α = 2 (sparsemax)

Sparse Mappings Don’t Slow Down TrainingTraining ming on three DE-EN runs.

1000 2000 3000 4000 5000 6000 7000seconds

softmax1.5-entmax

Impact of Fine Tuning αGrid search on DE-EN.

attention 62%

1.00 1.25 1.50 1.75 2.00 2.25output

60%61%62%63%

Sparse Seq2Seq: Conclusionsp1

−1 0 1

New family of sparse mappings α-entmax,algorithms for efficient forward & backward passes.

sparse a en on weights

w e l V IND PST

PFV 2 SG

sparse output space

d r a w e d </s>

n </s>

performance improvements

1000 2000 3000 4000 5000 6000 7000seconds

softmax1.5-entmax

vlad@vene.ro github.com/deep-spin/entmax https://vene.ro @vnfrombucharest

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2019 and

CMUPERI/TIC/0046/2014 (GoLocal).

Some icons by Dave Gandy and Freepik via fla con.com.

References IBahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural machine transla on by jointly learning to align and

translate”. In: Proc. of ICLR.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning classifiers with Fenchel-Young losses: Generalized

entropies, margins, and algorithms”. In: Proc. AISTATS.Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: A sparse model of a en on and

mul -label classifica on”. In: Proc. of ICML.Niculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural a en on”. In: Proc. of

NeurIPS.Peters, Ben and André FT Mar ns (2019). “IT-IST at the SIGMORPHON 2019 shared task: Sparse two-headed models for

inflec on”. In: Proc. SIGMORPHON.Tsallis, Constan no (1988). “Possible generaliza on of Boltzmann-Gibbs sta s cs”. In: Journal of Sta s cal Physics 52,

pp. 479–487.

Sparse Sequence-to-Sequence Models · Eleições das Nações Unidas s0 vj hj c1 Encoder Aenon...

Documents

INTRODUCTION TO GIT · DAVID PARSONS - INTRODUCTION TO GIT 12 The Staging Area (a.k.a Index) • “Pay aenon now” (git book – Git basics) • In addi=on to the repository itself

Sequence Models Scenarios Scenarios Sequence Diagram Sequence Diagram Guidelines for Sequence Models Guidelines for Sequence Models

Learning sequence thinking ... - education.nsw.gov.au · Web viewLearning sequence thinking mathematically 3 Stage 1. Learning sequence description. This sequence of lessons provides

SoMax Social Event.€¦ · Introductie van jezelf. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Mike Verbruggen Social

Local Multiple Sequence Alignment Sequence Motifs

Convolutional Sequence to Sequence Learning

Name Sequence PART I: NAME SEQUENCE

Sequence Comparison: Pairwise Alignment · Sequence Comparison • Generally, sequence determines structure and structure determines function • By studying sequence similarity,

The Living Prosthesis: Limits of · are o en more interested in leisure than in intellectual or producve work. We are frequently too lazy to carry out tasks that require aenon and

Eagle Eye (Security Solution) - Websmemberfiles.freewebs.com/59/09/58640959/documents/SP PROMO Pri… · Eagle Eye (Security Solution) ... 1/3" SONY 540TVL ... SOMAX (USA), Biotech

Local Multiple Sequence Alignment Sequence Motifs

W: multiple sequence alignment through sequence weighting ...homes.di.unimi.it/valenti/SlideCorsi/.../allineamento.multiplo.2.pdf · multiple sequence alignment through sequence weighting,

SoMax Social Event. · Social als fundering van je digitale marketing strategie. Generate awareness among target audience. Stimulate consideration and provide more information on

DNAMAN Sequence Analysis Software Sequence Search and ... · Dotplot Analysis 1. Name of Sequence 1 2. Name of Sequence 2 3. Annotations of Sequence 1 4. Annotations of Sequence 2

Sequence to Sequence - Video to Text

March 2020 - Morton United Methodist Church · Sleep apnea has thankfully goen much aenon regarding this co ndion, and the use of weight loss programs and breathing apparatus has

Sequence Alignment & Searchcompbio.ucdenver.edu/Hunter/intro-course/sequence-alignment_Verspoor.pdfPairwise Sequence Alignment • Sequence similarity depends on an alignment. •

Fast Sequence Search Multiple Sequence Alignment

Sequence Stratigraphy: Methodology and …quaternary.stratigraphy.org/stratigraphicguide/sequence...Part one – Concepts Introduction 1. Definition of sequence stratigraphy Sequence

The Sloth Conservation Foundation - Fundraising …...Ÿ Share sloth facts and photos to catch aenon! Ÿ Post frequent updates about your fundraising progress. Ÿ Share conservaon