Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
What can Statistical Machine Translation teach Neural
Machine Translation about Structured Prediction?
Graham Neubig @ ICLR Workshop on Deep Reinforcement Learning Meets Structured Prediction
5/6/2019
Types of Prediction
Types of Prediction• Two classes (binary classification)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
I hate this movie
very good good
neutral bad
very bad
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)
I hate this movie
very good good
neutral bad
very bad
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)I hate this movie PRP VBP DT NN
I hate this movie
very good good
neutral bad
very bad
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)I hate this movie PRP VBP DT NN
I hate this movie kono eiga ga kirai
I hate this movie
very good good
neutral bad
very bad
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)I hate this movie PRP VBP DT NN
I hate this movie kono eiga ga kirai
I hate this movie
very good good
neutral bad
very bad
...
Neubig & Watanabe, Computational Linguistics (2016)
...
Neubig & Watanabe, Computational Linguistics (2016)
Then: Symbolic Translation Models
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)• Reordering model -- e.g. P(<swap> | eiga, ga kirai)
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)• Reordering model -- e.g. P(<swap> | eiga, ga kirai)• Length model P(|Y|) -- e.g. word penalty for each word added
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)• Reordering model -- e.g. P(<swap> | eiga, ga kirai)• Length model P(|Y|) -- e.g. word penalty for each word added
• Second step: learning log-linear combination to maximize translation accuracy [Och 2004]
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
logP (Y | X) =X
i
�i�i(X,Y )/Z<latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit>
Now: Auto-regressive Neural Networks
Now: Auto-regressive Neural Networks
</s>
dec dec dec dec
</s>
I hate this movie
kono eiga ga kirai
I hate this movie
Encoder
Decoder
Now: Auto-regressive Neural Networks
</s>
dec dec dec dec
</s>
I hate this movie
kono eiga ga kirai
I hate this movie
Encoder
Decoder
• All parameters trained end-to-end, usually to maximize likelihood (not accuracy!)
Standard MT System Training/Decoding
Decoder StructureI
classifyclassify
I hate
hate
classify
this
this
classify
movie
movie
classify
</s>
encoder
P (E | F ) =TY
t=1
P (et | F, e1, . . . , et�1)<latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit>
Maximum Likelihood Training
• Maximum the likelihood of predicting the next word in the reference given the previous words
`(E | F ) = � logP (E | F )
= �TX
t=1
logP (et | F, e1, . . . , et�1)<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit>
Maximum Likelihood Training
• Maximum the likelihood of predicting the next word in the reference given the previous words
`(E | F ) = � logP (E | F )
= �TX
t=1
logP (et | F, e1, . . . , et�1)<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit>
• Also called "teacher forcing"
Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,
but at test time we may make mistakes that propagate
I
classifyclassify
I I
I
classify
I
encoder I
classify
I
I
classify
I
Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,
but at test time we may make mistakes that propagate
• Exposure bias: The model is not exposed to mistakes during training, and cannot deal with them at test
I
classifyclassify
I I
I
classify
I
encoder I
classify
I
I
classify
I
Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,
but at test time we may make mistakes that propagate
• Exposure bias: The model is not exposed to mistakes during training, and cannot deal with them at test
• Really important! One main source of commonly witnessed phenomena such as repeating.
I
classifyclassify
I I
I
classify
I
encoder I
classify
I
I
classify
I
Problem 2: Disregard to Evaluation Metrics
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
• Really important! Causes systematic problems:
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
• Really important! Causes systematic problems:
• Hypothesis-reference length mismatch
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
• Really important! Causes systematic problems:
• Hypothesis-reference length mismatch
• Dropped/repeated content
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
• Just training for (sentence-level) BLEU largely fixes length problems, and does much better than heuristics
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
Error and Risk
Error
Error• Generate a translation
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
Error• Generate a translation
• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>
Error• Generate a translation
• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)
• We would like to minimize error
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>
Error• Generate a translation
• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)
• We would like to minimize error
• Problem: argmax is not differentiable, and thus not conducive to gradient-based optimization
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>
In Phrase-based MT: Minimum Error Rate Training
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
• Pick a single direction in feature space
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
• Pick a single direction in feature space
• Exactly calculate the loss surface in this direction only (over an n-best list for every hypothesis)
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
• Pick a single direction in feature space
• Exactly calculate the loss surface in this direction only (over an n-best list for every hypothesis)
F1
φ1
φ2
φ3
err
E1,1 1 0 -1 0.6
E1,2 0 1 0 0
E1,3 1 0 1 1
F2
φ1
φ2
φ3
err
E2,1 1 0 -2 0.8
E2,2 3 0 1 0.3
E2,3 3 1 2 0
-4 -2 0 2 4
-4
-3
-2
-1
0
1
2
3
4
-4 -2 0 2 4
-4
-3
-2
-1
0
1
2
3
4(a) (b)
λ1=-1, λ
2=1, λ
3=0
-4 -2 0 2 40
1
-4 -2 0 2 40
1
-4 -2 0 2 40
1
2
(d)
α ←1.25
(c)F1 candidates
F2 candidates
F1 error
F2 error
total error
E1,1
E1,2
E1,3
E2,1
E2,2
E2,3
d1=0, d
2=0, d
3=1
λ1=-1, λ
2=1, λ
3=1.25
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
• Risk is defined as the expected error
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
• Risk is defined as the expected error
risk(F,E, ✓) =X
E
P (E | F ; ✓)error(E, E).
<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit>
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
• Risk is defined as the expected error
risk(F,E, ✓) =X
E
P (E | F ; ✓)error(E, E).
<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit>
• This is includes the probability in the objective function -> differentiable!
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
Sub-sampling
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
risk(F,E, S) =X
E2S
P (E | F )
Zerror(E, E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
• Samples can be created using random sampling or n-best search
risk(F,E, S) =X
E2S
P (E | F )
Zerror(E, E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
• Samples can be created using random sampling or n-best search
• If random sampling, make sure to deduplicate
risk(F,E, S) =X
E2S
P (E | F )
Zerror(E, E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>
Policy Gradient/REINFORCE
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
• Outputs that get a bigger reward will get a higher weight
`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
• Outputs that get a bigger reward will get a higher weight
• Can show this converges to minimum-risk solution
`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>
But Wait, why is Everyone Using MLE for NMT?
When Training goes Bad...
Minimum risk training for neural machine translation (Shen et al. 2015)
When Training goes Bad...
Chances are, this is you 😔
Minimum risk training for neural machine translation (Shen et al. 2015)
It Happens to the Best of Us
It Happens to the Best of Us
• Email from a famous MT researcher: "we also re-implemented MRT, but so far, training has been very unstable, and after a improving for a bit, our models develop a bias towards producing ever-shorter translations..."
My Current Recipe for Stabilizing MRT/Reinforcement Learning
Warm-start
Warm-start• Start training with maximum likelihood, then switch
over to REINFORCE
Warm-start• Start training with maximum likelihood, then switch
over to REINFORCE
• Works only in the scenarios where we can run MLE (not latent variables or standard RL settings)
Warm-start• Start training with maximum likelihood, then switch
over to REINFORCE
• Works only in the scenarios where we can run MLE (not latent variables or standard RL settings)
• MIXER (Ranzato et al. 2016) gradually transitions from MLE to the full objective
Adding a Baseline
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
“This is an easy sentence”“Buffalo Buffalo Buffalo”
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
“This is an easy sentence”“Buffalo Buffalo Buffalo”
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1“This is an easy sentence”
“Buffalo Buffalo Buffalo”
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected
`baseline(X) = �(R(Y , Y )�B(Y )) logP (Y | X)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected
`baseline(X) = �(R(Y , Y )�B(Y )) logP (Y | X)
• (Be careful to not backprop through the baseline)
Increasing Batch Size
Increasing Batch Size• Because each sample will be high variance, we
can sample many different examples before performing update
Increasing Batch Size• Because each sample will be high variance, we
can sample many different examples before performing update
• We can increase the number of examples (roll-outs) done before an update to stabilize
Increasing Batch Size• Because each sample will be high variance, we
can sample many different examples before performing update
• We can increase the number of examples (roll-outs) done before an update to stabilize
• We can also save previous roll-outs and re-use them when we update parameters (experience replay, Lin 1993)
Adding Temperature
Adding Temperaturerisk(F,E, ✓, ⌧, S) =
X
E2S
P (E | F ; ✓)1/⌧
Zerror(E, E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
Adding Temperature
• Temperature adjusts the peakiness of the distribution
risk(F,E, ✓, ⌧, S) =X
E2S
P (E | F ; ✓)1/⌧
Zerror(E, E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
Adding Temperature
• Temperature adjusts the peakiness of the distribution
• With a small sample, setting temperature > 1 accounts for unsampled hypotheses that should be in the denominator
risk(F,E, ✓, ⌧, S) =X
E2S
P (E | F ; ✓)1/⌧
Zerror(E, E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
Contrasting Phrase-based SMT and NMT
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Objective Risk Error
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Objective Risk Error
Metric Granularity Sentence Level Corpus Level
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Objective Risk Error
Metric Granularity Sentence Level Corpus Level
n-best Lists Re-generated Accumulated
Optimized Parameters
Optimized Parameters• Can we reduce the number of parameters
optimized for NMT?
Optimized Parameters
• Maybe we can optimize only some parts of the model?Freezing Subnetworks to Analyze Domain Adaptation in NMT. Thompson et al. 2018.
• Can we reduce the number of parameters optimized for NMT?
Optimized Parameters
• Maybe we can optimize only some parts of the model?Freezing Subnetworks to Analyze Domain Adaptation in NMT. Thompson et al. 2018.
• Maybe we can express models as a linear combination of a few hyper-parameters?Contextualized Parameter Generation for Universal NMT. Platanios et al. 2018.
• Can we reduce the number of parameters optimized for NMT?
W =X
i
↵iWi
<latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit>
Objective
Objective• Can we move closer to minimizing error, which is what we
want to do in the first place?
Objective• Can we move closer to minimizing error, which is what we
want to do in the first place?
• Maybe we can gradually anneal the temperature to move towards a peakier distribution?Minimum risk annealing for training log-linear models. Smith and Eisner 2006.
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
Training progression
Metric Granularity
Metric Granularity• Two ways of measuring metrics
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score• Regular BLEU is corpus-level, but mini-batch NMT
optimization algorithms calculate sentence level
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score• Regular BLEU is corpus-level, but mini-batch NMT
optimization algorithms calculate sentence level• This causes problems, e.g. in sentence length!
Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012.
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score• Regular BLEU is corpus-level, but mini-batch NMT
optimization algorithms calculate sentence level• This causes problems, e.g. in sentence length!
Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012.
• Maybe we can keep a running average of the sufficient statistics to approximate corpus BLEU?Online large-margin training of syntactic and structural translation features. Chiang et al. 2008.
N-best Lists
N-best Lists• In MERT for PBMT, we would accumulate n-best
lists across epochs:
new n-best 2
n-best 1
Epoch 1n-best 1
Epoch 2
new n-best 2
n-best 1
Epoch 3
new n-best 3
N-best Lists• In MERT for PBMT, we would accumulate n-best
lists across epochs:
new n-best 2
n-best 1
Epoch 1n-best 1
Epoch 2
new n-best 2
n-best 1
Epoch 3
new n-best 3
• Greatly stabilizes training! Even if model learns horrible parameters, it still has good hypotheses from which to recover.
N-best Lists• In MERT for PBMT, we would accumulate n-best
lists across epochs:
new n-best 2
n-best 1
Epoch 1n-best 1
Epoch 2
new n-best 2
n-best 1
Epoch 3
new n-best 3
• Greatly stabilizes training! Even if model learns horrible parameters, it still has good hypotheses from which to recover.
• Maybe we could do the same for NMT? Analogous to experience replay in RL:Self-improving reactive agents based on reinforcement learning, planning and teaching. Lin 1992.
Summary
Summary
Summary• Neural MT has come a long way, and we can
optimize for accuracy
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for• But no-one does it... Problems of stability speed.
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for• But no-one does it... Problems of stability speed.• Still lots to remember from the past!
Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for• But no-one does it... Problems of stability speed.• Still lots to remember from the past!
Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)
Thanks! Questions?