Sparse Sequence-to-Sequence Models · Eleições das Nações Unidas s0 vj hj c1 Encoder Aenon Decoder aenonweights computedwith somax : forsomedecoderstatest, computecontextually

Sparse Sequence-to-Sequence Models

Ben Peters Ins tuto de Telecomunicações

→ Vlad Niculae IT

André Mar ns IT & Unbabel

github.com/deep-spin/entmax https://vene.ro @vnfrombucharest

https://github.com/deep-spin/entmax

https://vene.ro

https://twitter.com/vnfrombucharest

Sequence-to-Sequence With Attention(Bahdanau et al., 2015)

United Na ons elec ons end today

Eleições das Nações Unidas

Encoder

A en on

Decoder

morphologicalinflec on!




vj

Encoder

A en on

Decoder





vj

hjEncoder

A en on

Decoder





s0

vj

hj

c1

Encoder

A en on

Decoder

a en on weightscomputed with

so max:for some decoder state st,compute contextually

weighted average of input ct:

zj = s⊤t W(a)hj

πj = softmaxj(z)

ct =∑jπjhj

morphological inflec on!




s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder

predic ve probability(also using so max!)

ut = tanh(W(u)[st; ct])P(yt | y1:t−1, x) = softmax(Vut)

P(y1 | x).70 Eleições.11 Os.10 As.09 Nações

...10−6 Bucarest





s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder





s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder

predic ve probabilityP(y2 | y1, x)

.40 das

.30 para

.20 ás...

10−7 esquerda





s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder





s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder

predic ve probabilityP(y3 | y2, y1, x)

.80 Nações

.11 Representações

.03 assembleias...

10−8 resultados





s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder





s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder

predic ve probabilityP(y4 | y3, y2, y1, x).90 Unidas.05 Shopping.01 ,

...10−5 aquá co



d o g N PL

d o g s

s0

vj

hj

c1

s1

y1

Encoder

A en on

Decoder


The Space of Outputs

d o g s

b u c z

q a b e

z y z y

· · ·


d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.60


d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.60p(·) = 0.13


d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.60p(·) = 0.13p(·) = 10−4

The Space of Outputs: Made Sparse!

d o g s

b u c z

q a b e

z y z y

· · ·

p(·) = 0.70p(·) = 0.20p(·) = 0 !!

So max plays two roles in seq2seq:

a en on weightsfor some decoder state st, compute

contextually weighted average of input ct:

zj = s⊤t W(a)hj

πj = softmaxj(z)

ct =∑jπjhj

output probabili espredict the probability of the next word:

ut = tanh(W(u)[st; ct])P(yt | y1:t−1, x) = softmax(Vut)

Our work: replace so maxwith a family of new sparsity-inducing alterna ves

Sparse Attention Weights / Alignments

k ö m k N GENSG PSS2

P

kö

m

yi

nizi

n</s>

(Azeri)

w e l V IND PST

PFV 2 SG

wulst

</s>

(North Frisian)

(Peters and Mar ns, 2019)

Sparse Predictive Probabilities

d r a w e d </s>n </s>

</s>

66.4%

32.2%

1.4%

What is softmax?O en defined via pi :=

exp zi∑j exp zj

, but where does it come from?

△ :={p ∈ Rd : p ≥ 0,∑

j pj = 1}p ∈ △: probability distribu on over choices

Expected score under p: Ei∼p zi = p⊤zargmax

maximizes expected score

Shannon entropy of p: Hs(p) := −∑

i pi log piso max maximizes expected score + entropy: 0.5 1

1.5

0.5 1 1.5

0.5

1

1.5


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑





i pi log piso max maximizes expected score + entropy:

0.5 11.5

0.5 1 1.5

0.5

1

1.5


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑






0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑






0.5 11.5

0.5 1 1.5

0.5

1

1.5 p = [0,0,1]


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑






0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [1/3, 1/3, 1/3]


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑


Expected score under p: Ei∼p zi = p⊤z

argmax




0.5 11.5

0.5 1 1.5

0.5

1

1.5z = [.7, .1,1.5]


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑






0.5 11.5

0.5 1 1.5

0.5

1

1.5z = [.7, .1,1.5]


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑


Expected score under p: Ei∼p zi = p⊤zargmax maximizes expected score



0.5 11.5

0.5 1 1.5

0.5

1

1.5z = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈△

p⊤z


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑




i pi log pi

so max maximizes expected score + entropy:

(1, 0, 0) (0, 1, 0)

(0, 0, 1)

0.800

1.000

1.000

0.800

0.5 11.5

0.5 1 1.5

0.5

1

1.5z = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈△

p⊤z


exp zi∑j exp zj


△ :={p ∈ Rd : p ≥ 0,∑





(1, 0, 0) (0, 1, 0)

(0, 0, 1)

1.000

1.400

1.600

1.600

argmaxp∈△

p⊤z +Hs(p)

0.5 11.5

0.5 1 1.5

0.5

1

1.5z = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈△

p⊤z

Generalizing Softmax Using EntropiesπH(z) = argmax

p∈△p⊤z +H(p)

argmax: H0(p)=0

so max: Hs(p)=−∑

j pj log pjsparsemax: Hg(p)= 1/2

∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )

Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and

sparsemax (α = 2).

p1

z10

1

−1 0 1

△

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017; Blondel et al., 2019)


p∈△p⊤z +H(p)

argmax: H0(p)=0



∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )


sparsemax (α = 2).

p1

z10

1

−1 0 1

△

[0,0,1]

[.3, .2, .5]

[.3,0, .7]



p∈△p⊤z +H(p)

argmax: H0(p)=0


j pj log pj

sparsemax: Hg(p)= 1/2∑

j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )


sparsemax (α = 2).

p1

z10

1

−1 0 1

△

[0,0,1]

[.3, .2, .5]

[.3,0, .7]



p∈△p⊤z +H(p)

argmax: H0(p)=0



∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )


sparsemax (α = 2).

p1

z10

1

−1 0 1

△

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


(Mar ns and Astudillo, 2016)


p∈△p⊤z +H(p)

argmax: H0(p)=0



∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )


sparsemax (α = 2).

p1

z10

1

−1 0 1

△

[0,0,1]

[.3, .2, .5]

[.3,0, .7]



p∈△p⊤z +H(p)

argmax: H0(p)=0



∑j pj(1 − pj)

α-entmax: Htα(p)=

1α(α−1)∑

j(pj − pαj )


sparsemax (α = 2).

p1

z10

1

−1 0 1

△

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


πα([t,0])1

3 2 1 0 1 2 3t

0.0

0.2

0.4

0.6

0.8

1.0 = 1 (softmax)= 1.25= 1.5= 2 (sparsemax)= 4

Computing α-entmaxπHt

α(z) := argmax

p∈△p⊤z +Ht

α(p)

Solu on has the form:

πHtα(z) = [(α − 1)z − τ1]1/α−1+

Algorithms:bisec on

• approximate; bracket τ ∈ [τlo, τhi]

• gain 1 bit per O(d) itera on• float32 has 23 man ssa bits

sort-based• exact algorithm, O(d log d)• available only for α ∈ {1.5,2}• huge speed-up from par al sor ngwhen expec ng sparse solu ons

Morphological inflectionSIGMORPHON 2018. Shared mul -lingual model.Medium: 1k pairs per language, 102 languages.High: 10k pairs per language, 86 languages.

Medium High70

80

90

100

+1.83

+1.44accuracy

α = 1 (so max) α = 1.5 α = 2 (sparsemax)

Neural Machine Translation

DE-EN EN-DE JA-EN EN-JA RO-EN EN-RO0

10

20

30 +.47+.56 +.33

+.79+1.03 +.72BLEU

α = 1 (so max) α = 1.5 α = 2 (sparsemax)

Sparse Mappings Don’t Slow Down TrainingTraining ming on three DE-EN runs.

1000 2000 3000 4000 5000 6000 7000seconds

57.0%

58.5%

60.0%

61.5%

63.0%

valid

atio

nac

cura

cy

softmax1.5-entmax

Impact of Fine Tuning αGrid search on DE-EN.

attention 62%

63%

64%

valid

atio

nac

cura

cy

1.00 1.25 1.50 1.75 2.00 2.25output

60%61%62%63%

valid

atio

nac

cura

cy

Sparse Seq2Seq: Conclusionsp1

z10

1

−1 0 1

New family of sparse mappings α-entmax,algorithms for efficient forward & backward passes.

sparse a en on weights

w e l V IND PST

PFV 2 SG

wulst

</s>

sparse output space

d r a w e d </s>

n </s>

</s>

66.4%

32.2%

1.4%

performance improvements

1000 2000 3000 4000 5000 6000 7000seconds

57.0%

58.5%

60.0%

61.5%

63.0%

valid

atio

nac

cura

cy

softmax1.5-entmax

[email protected] github.com/deep-spin/entmax https://vene.ro @vnfrombucharest

mailto:[email protected]

https://github.com/deep-spin/entmax

https://vene.ro

https://twitter.com/vnfrombucharest

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2019 and

CMUPERI/TIC/0046/2014 (GoLocal).

Some icons by Dave Gandy and Freepik via fla con.com.

https://www.flaticon.com/authors/dave-gandy

https://www.freepik.com/

https://www.flaticon.com/

References IBahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural machine transla on by jointly learning to align and

translate”. In: Proc. of ICLR.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning classifiers with Fenchel-Young losses: Generalized

entropies, margins, and algorithms”. In: Proc. AISTATS.Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: A sparse model of a en on and

mul -label classifica on”. In: Proc. of ICML.Niculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural a en on”. In: Proc. of

NeurIPS.Peters, Ben and André FT Mar ns (2019). “IT-IST at the SIGMORPHON 2019 shared task: Sparse two-headed models for

inflec on”. In: Proc. SIGMORPHON.Tsallis, Constan no (1988). “Possible generaliza on of Boltzmann-Gibbs sta s cs”. In: Journal of Sta s cal Physics 52,

pp. 479–487.

https://arxiv.org/abs/1409.0473







https://link.springer.com/article/10.1007/BF01016429

Documents

Sparse Sequence-to-Sequence Models · Eleições das Nações Unidas s0 vj hj c1 Encoder Aenon Decoder aenonweights computedwith somax : forsomedecoderstatest, computecontextually