Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Sparse Sequence-to-Sequence Models
Ben Peters Ins tuto de Telecomunicações
→ Vlad Niculae IT
André Mar ns IT & Unbabel
github.com/deep-spin/entmax https://vene.ro @vnfrombucharest
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
Encoder
A en on
Decoder
morphologicalinflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
vj
Encoder
A en on
Decoder
morphologicalinflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
vj
hjEncoder
A en on
Decoder
morphologicalinflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
Encoder
A en on
Decoder
a en on weightscomputed with
so max:for some decoder state st,compute contextually
weighted average of input ct:
zj = s⊤t W(a)hj
πj = softmaxj(z)
ct =∑jπjhj
morphological inflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
predic ve probability(also using so max!)
ut = tanh(W(u)[st; ct])P(yt | y1:t−1, x) = softmax(Vut)
P(y1 | x).70 Eleições.11 Os.10 As.09 Nações
...10−6 Bucarest
morphological inflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
morphologicalinflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
predic ve probabilityP(y2 | y1, x)
.40 das
.30 para
.20 ás...
10−7 esquerda
morphological inflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
morphologicalinflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
predic ve probabilityP(y3 | y2, y1, x)
.80 Nações
.11 Representações
.03 assembleias...
10−8 resultados
morphological inflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
morphologicalinflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
United Na ons elec ons end today
Eleições das Nações Unidas
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
predic ve probabilityP(y4 | y3, y2, y1, x).90 Unidas.05 Shopping.01 ,
...10−5 aquá co
morphological inflec on!
Sequence-to-Sequence With Attention(Bahdanau et al., 2015)
d o g N PL
d o g s
s0
vj
hj
c1
s1
y1
Encoder
A en on
Decoder
morphologicalinflec on!
The Space of Outputs
d o g s
b u c z
q a b e
z y z y
· · ·
The Space of Outputs
d o g s
b u c z
q a b e
z y z y
· · ·
p(·) = 0.60
The Space of Outputs
d o g s
b u c z
q a b e
z y z y
· · ·
p(·) = 0.60p(·) = 0.13
The Space of Outputs
d o g s
b u c z
q a b e
z y z y
· · ·
p(·) = 0.60p(·) = 0.13p(·) = 10−4
The Space of Outputs: Made Sparse!
d o g s
b u c z
q a b e
z y z y
· · ·
p(·) = 0.70p(·) = 0.20p(·) = 0 !!
So max plays two roles in seq2seq:
a en on weightsfor some decoder state st, compute
contextually weighted average of input ct:
zj = s⊤t W(a)hj
πj = softmaxj(z)
ct =∑jπjhj
output probabili espredict the probability of the next word:
ut = tanh(W(u)[st; ct])P(yt | y1:t−1, x) = softmax(Vut)
Our work: replace so maxwith a family of new sparsity-inducing alterna ves
Sparse Attention Weights / Alignments
k ö m k N GENSG PSS2
P
kö
m
yi
nizi
n</s>
(Azeri)
w e l V IND PST
PFV 2 SG
wulst
</s>
(North Frisian)
(Peters and Mar ns, 2019)
Sparse Predictive Probabilities
d r a w e d </s>n </s>
</s>
66.4%
32.2%
1.4%
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy: 0.5 1
1.5
0.5 1 1.5
0.5
1
1.5
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5 p = [0,0,1]
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [1/3, 1/3, 1/3]
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤z
argmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5z = [.7, .1,1.5]
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax
maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5z = [.7, .1,1.5]
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
0.5 11.5
0.5 1 1.5
0.5
1
1.5z = [.7, .1,1.5]
p⋆ = [0,0,1]
argmaxp∈△
p⊤z
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log pi
so max maximizes expected score + entropy:
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
0.800
1.000
1.000
0.800
0.5 11.5
0.5 1 1.5
0.5
1
1.5z = [.7, .1,1.5]
p⋆ = [0,0,1]
argmaxp∈△
p⊤z
What is softmax?O en defined via pi :=
exp zi∑j exp zj
, but where does it come from?
△ :={p ∈ Rd : p ≥ 0,∑
j pj = 1}p ∈ △: probability distribu on over choices
Expected score under p: Ei∼p zi = p⊤zargmax maximizes expected score
Shannon entropy of p: Hs(p) := −∑
i pi log piso max maximizes expected score + entropy:
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
1.000
1.400
1.600
1.600
argmaxp∈△
p⊤z +Hs(p)
0.5 11.5
0.5 1 1.5
0.5
1
1.5z = [.7, .1,1.5]
p⋆ = [0,0,1]
argmaxp∈△
p⊤z
Generalizing Softmax Using EntropiesπH(z) = argmax
p∈△p⊤z +H(p)
argmax: H0(p)=0
so max: Hs(p)=−∑
j pj log pjsparsemax: Hg(p)= 1/2
∑j pj(1 − pj)
α-entmax: Htα(p)=
1α(α−1)∑
j(pj − pαj )
Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and
sparsemax (α = 2).
p1
z10
1
−1 0 1
△
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017; Blondel et al., 2019)
Generalizing Softmax Using EntropiesπH(z) = argmax
p∈△p⊤z +H(p)
argmax: H0(p)=0
so max: Hs(p)=−∑
j pj log pjsparsemax: Hg(p)= 1/2
∑j pj(1 − pj)
α-entmax: Htα(p)=
1α(α−1)∑
j(pj − pαj )
Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and
sparsemax (α = 2).
p1
z10
1
−1 0 1
△
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017; Blondel et al., 2019)
Generalizing Softmax Using EntropiesπH(z) = argmax
p∈△p⊤z +H(p)
argmax: H0(p)=0
so max: Hs(p)=−∑
j pj log pj
sparsemax: Hg(p)= 1/2∑
j pj(1 − pj)
α-entmax: Htα(p)=
1α(α−1)∑
j(pj − pαj )
Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and
sparsemax (α = 2).
p1
z10
1
−1 0 1
△
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017; Blondel et al., 2019)
Generalizing Softmax Using EntropiesπH(z) = argmax
p∈△p⊤z +H(p)
argmax: H0(p)=0
so max: Hs(p)=−∑
j pj log pjsparsemax: Hg(p)= 1/2
∑j pj(1 − pj)
α-entmax: Htα(p)=
1α(α−1)∑
j(pj − pαj )
Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and
sparsemax (α = 2).
p1
z10
1
−1 0 1
△
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017; Blondel et al., 2019)
(Mar ns and Astudillo, 2016)
Generalizing Softmax Using EntropiesπH(z) = argmax
p∈△p⊤z +H(p)
argmax: H0(p)=0
so max: Hs(p)=−∑
j pj log pjsparsemax: Hg(p)= 1/2
∑j pj(1 − pj)
α-entmax: Htα(p)=
1α(α−1)∑
j(pj − pαj )
Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and
sparsemax (α = 2).
p1
z10
1
−1 0 1
△
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017; Blondel et al., 2019)
Generalizing Softmax Using EntropiesπH(z) = argmax
p∈△p⊤z +H(p)
argmax: H0(p)=0
so max: Hs(p)=−∑
j pj log pjsparsemax: Hg(p)= 1/2
∑j pj(1 − pj)
α-entmax: Htα(p)=
1α(α−1)∑
j(pj − pαj )
Tsallis α-entropy (Tsallis, 1988).Depicted: α = 1.5. Uncovers so max (α→ 1) and
sparsemax (α = 2).
p1
z10
1
−1 0 1
△
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017; Blondel et al., 2019)
πα([t,0])1
3 2 1 0 1 2 3t
0.0
0.2
0.4
0.6
0.8
1.0 = 1 (softmax)= 1.25= 1.5= 2 (sparsemax)= 4
Computing α-entmaxπHt
α(z) := argmax
p∈△p⊤z +Ht
α(p)
Solu on has the form:
πHtα(z) = [(α − 1)z − τ1]1/α−1+
Algorithms:bisec on
• approximate; bracket τ ∈ [τlo, τhi]
• gain 1 bit per O(d) itera on• float32 has 23 man ssa bits
sort-based• exact algorithm, O(d log d)• available only for α ∈ {1.5,2}• huge speed-up from par al sor ngwhen expec ng sparse solu ons
Morphological inflectionSIGMORPHON 2018. Shared mul -lingual model.Medium: 1k pairs per language, 102 languages.High: 10k pairs per language, 86 languages.
Medium High70
80
90
100
+1.83
+1.44accuracy
α = 1 (so max) α = 1.5 α = 2 (sparsemax)
Neural Machine Translation
DE-EN EN-DE JA-EN EN-JA RO-EN EN-RO0
10
20
30 +.47+.56 +.33
+.79+1.03 +.72BLEU
α = 1 (so max) α = 1.5 α = 2 (sparsemax)
Sparse Mappings Don’t Slow Down TrainingTraining ming on three DE-EN runs.
1000 2000 3000 4000 5000 6000 7000seconds
57.0%
58.5%
60.0%
61.5%
63.0%
valid
atio
nac
cura
cy
softmax1.5-entmax
Impact of Fine Tuning αGrid search on DE-EN.
attention 62%
63%
64%
valid
atio
nac
cura
cy
1.00 1.25 1.50 1.75 2.00 2.25output
60%61%62%63%
valid
atio
nac
cura
cy
Sparse Seq2Seq: Conclusionsp1
z10
1
−1 0 1
New family of sparse mappings α-entmax,algorithms for efficient forward & backward passes.
sparse a en on weights
w e l V IND PST
PFV 2 SG
wulst
</s>
sparse output space
d r a w e d </s>
n </s>
</s>
66.4%
32.2%
1.4%
performance improvements
1000 2000 3000 4000 5000 6000 7000seconds
57.0%
58.5%
60.0%
61.5%
63.0%
valid
atio
nac
cura
cy
softmax1.5-entmax
[email protected] github.com/deep-spin/entmax https://vene.ro @vnfrombucharest
Acknowledgements
This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2019 and
CMUPERI/TIC/0046/2014 (GoLocal).
Some icons by Dave Gandy and Freepik via fla con.com.
References IBahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural machine transla on by jointly learning to align and
translate”. In: Proc. of ICLR.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning classifiers with Fenchel-Young losses: Generalized
entropies, margins, and algorithms”. In: Proc. AISTATS.Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: A sparse model of a en on and
mul -label classifica on”. In: Proc. of ICML.Niculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural a en on”. In: Proc. of
NeurIPS.Peters, Ben and André FT Mar ns (2019). “IT-IST at the SIGMORPHON 2019 shared task: Sparse two-headed models for
inflec on”. In: Proc. SIGMORPHON.Tsallis, Constan no (1988). “Possible generaliza on of Boltzmann-Gibbs sta s cs”. In: Journal of Sta s cal Physics 52,
pp. 479–487.