Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Reasoningaboutpragma0cs withneurallistenersandspeakers
JacobAndreasandDanKlein
Thereferencegame
2
Thereferencegame
3
Thereferencegame
4
Theonewiththesnake
Thereferencegame
5
Mikeisholdingabaseballbat
Thereferencegame
6
bataisholdingMikebaseball
Thereferencegame
7
Theyaresi4ngbyapicnictable
Thereferencegame
8
Thereisabat
Thereferencegame
9
Thereisabat
Thereferencegame
10
Whydowecareaboutthisgame?
Don’tyouthinkit’sali:lecoldinhere?
Doyouknowwhat<meitis?
Someofthechildrenplayedinthepark.
Derivingpragma0csfromreasoning
11
Mikeisholdingabaseballbat
12
Jennyisrunning fromthesnake
Derivingpragma0csfromreasoning
13
Mikeisholdingabaseballbat
Derivingpragma0csfromreasoning
Howtowin
14
DERIVEDSTRATEGY:Reasonaboutlistenerbeliefs
DIRECTSTRATEGY:Imitatesuccessfulhumanplay
Thereis asnake
Thereis asnake
Thereis asnake
?
Howtowin
15
[Maoetal.2015]
[Kazemzadehetal.2014]
[Fitzgeraldetal.,2013]
[MonroeandPoRs,2015]
[Smithetal.2013]
[Vogeletal.2013]
[Gollandetal.2010]
DERIVEDSTRATEGY:Reasonaboutlistenerbeliefs
DIRECTSTRATEGY:Imitatesuccessfulhumanplay
Howtowin
16
PRO:domainrepr“forfree”
CON:pastworkneedstargeteddata
PRO:pragma0cs“forfree”
CON:pastworkneedshand-engineering
DERIVEDSTRATEGY:Reasonaboutlistenerbeliefs
DIRECTSTRATEGY:Imitatesuccessfulhumanplay
Howtowin
17
DERIVEDSTRATEGY:Reasonaboutlistenerbeliefs
DIRECTSTRATEGY:Imitatesuccessfulhumanplay
Learnbasemodelsforinterpreta0on&genera0onwithoutpragma0ccontext
Explicitlyreasonaboutbasemodelstogetnovelbehavior
Data
AbstractScenesDataset
1000scenes10ksentencesFeaturerepresenta0ons
18
Approach
19
Literal speaker
Literal listener
Sampler
Reasoningspeaker
Aliteralspeaker(S0)
20
Mikeisholdingabaseballbat
Aliteralspeaker(S0)
21
Referentencoder
Referentdecoder
Mikeisholdingabaseballbat
Modulearchitectures
22
ReLUFC SoftmaxFC
referent
wordn
word<n wordn+1
FCreffeatures
referent
Referentencoder
Referentdecoder
TrainingS0
23
Mikeisholdingabaseballbat
S0
Aliteralspeaker(S0)
24
Mikeisholdingabaseballbat
Thesunisinthesky
Jennyisstanding nexttoMike
Aliterallistener(L0)
25
Mikeisholdingabaseballbat
Aliterallistener(L0)
26
Descr.encoder
Referentencoder
Referentencoder
Scorer
0.87Mikeisholdingabaseballbat
0.13
Modulearchitectures
27
Referentencoder
Referentdecoder
sentence
ReLUSum FC Softmax choice
referent
desc
FCngramfeatures
desc
TrainingL0
28
Mikeisholdingabaseballbat
(randomdistractor)
0.87
Aliterallistener(L0)
29
L0
Mikeisholdingabaseballbat
Areasoningspeaker(S1)
30
Mikeisholdingabaseballbat
?
Areasoningspeaker(S1)
31
Literal speaker
Thesunisinthesky
Jennyisstanding nexttoMike
Literal listener
0.9
0.5
0.7
Mikeis abaseballbat
Areasoningspeaker(S1)
32
Literal speaker
Thesunisinthesky
Jennyisstanding nexttoMike
Literal listener
0.9
0.5
0.7
Mikeis abaseballbat
0.05
0.09
0.08
Areasoningspeaker(S1)
33
Literal speaker
Thesunisinthesky
Jennyisstanding nexttoMike
Literal listener
0.91-λ
0.51-λ
0.71-λ
Mikeis abaseballbat
0.05
0.09
0.08
*0.05λ
*0.09λ
*0.09λ
Experiments
34
Baselines
• Literal:theL0modelbyitself
• ContrasIve:acondi0onalLMtrainedonboththetargetimageandarandomdistractor[Maoetal.2015]
35
Results(test)
Literal Contras0ve Reasoning
64%69%
81%
36
Accuracyandfluency
Figure 5: Tradeoff between speaker and listener models, con-trolled by the parameter � in Equation 8. With � = 0, all weightis placed on the literal listener, and the model produces highlydiscriminative but somewhat disfluent captions. With � = 1, allweight is placed on the literal speaker, and the model producesfluent but generic captions.
4.1 How good are the base models?
To measure the performance of the base models,we draw 10 samples djk for a subset of 100 pairs(r1,j , r2,j) in the Dev-All set. We collect human flu-ency and accuracy judgments for each of the 1000total samples. This allows us to conduct a post-hocsearch over values of �: for a range of �, we com-pute the average accuracy and fluency of the high-est scoring sample. By varying �, we can view thetradeoff between accuracy and fluency that resultsfrom interpolating between the listener and speakermodel—setting � = 0 gives samples from pL0, and� = 1 gives samples from pS0.
Figure 5 shows the resulting accuracy and fluencyfor various values of �. It can be seen that relyingentirely on the listener gives the highest accuracybut degraded fluency. However, by adding only avery small weight to the speaker model, it is possibleto achieve near-perfect fluency without a substantialdecrease in accuracy. Example sentences for an in-dividual reference game are shown in Figure 5; in-creasing � causes captions to become more generic.For the remaining experiments in this paper, we take� = 0.02, finding that this gives excellent perfor-mance on both metrics.
On the development set, � = 0.02 results in anaverage fluency of 4.8 (compared to 4.8 for the lit-eral speaker � = 1). This high fluency can be con-firmed by inspection of model samples (Figure 4).
We thus focus on accuracy or the remainder of theevaluation.
4.2 How many samples are needed?
Next we turn to the computational efficiency of thereasoning model. As in all sampling-based infer-ence, the number of samples that must be drawnfrom the proposal is of critical interest—if too manysamples are needed, the model will be too slow touse in practice. Having fixed � = 0.02 in the pre-ceding section, we measure accuracy for versions ofthe reasoning model that draw 1, 10, 100, and 1000samples. Results are shown in Table 1. We find thatgains continue up to 100 samples.
4.3 Is reasoning necessary?
Because they do not require complicated inferenceprocedures, direct approaches to pragmatics typi-cally enjoy better computational efficiency than de-rived ones. Having built an accurate derived speaker,can we bootstrap a more efficient direct speaker?
To explore this, we constructed a “compiled”speaker model as follows: Given reference candi-dates r1 and r2 and target t, this model producesembeddings e1 and e2, concatenates them togetherinto a “contrast embedding” [et, e�t], and then feedsthis whole embedding into a string decoder mod-ule. Like S0, this model generates captions withoutthe need for discriminative rescoring; unlike S0, thecontrast embedding means this model can in prin-ciple learn to produce pragmatic captions, if givenaccess to pragmatic training data. Since no suchtraining data exists, we train the compiled model on
(a) target (b) distractor
(prefer L0) 0.0 a hamburger on the ground0.1 mike is holding the burger
(prefer S0) 0.2 the airplane is in the sky
Figure 5: Captions for the same pair with varying �. Changing� alters both the naturalness and specificity of the output.
37
Howmanysamples?
Accuracy
50
60
70
80
90
100
#Samples
1 10 100 1000
38
Examples
(a) the sun is in the sky (d) the plane is flying in the sky[contrastive] [contrastive]
(c) the dog is standing beside jenny (b) mike is wearing a chef’s hat[contrastive] [non-contrastive]
Figure 4: Figure 4: Four randomly-chosen samples from our model. For each, the target image is shown on the left, the distractorimage is shown on the right, and description generated by the model is shown below. All descriptions are fluent, and generallysucceed in uniquely identifying the target scene, even when they do not perfectly describe it (e.g. (c)). These samples are broadlyrepresentative of the model’s performance (Table 2).
Dev acc. (%) Test acc. (%)
Model All Hard All Hard
Literal (S0) 66 54 64 53Contrastive 71 54 69 58Reasoning (S1) 83 73 81 68
Table 2: Success rates at RG on abstract scenes. “Literal” isa captioning baseline corresponding to the base speaker S0.“Contrastive” is a reimplementation of the approach of Maoet al. (2015). “Reasoning” is the model from this paper. Alldifferences between our model and baselines are significant(p < 0.05, Binomial).
of the base model (it improves noticeably over S0on scenes with 2–3 differences), the overall gain isnegligible (the difference in mean scores is not sig-nificant). The compiled model significantly under-performs the reasoning model. These results sug-gest either that the reasoning procedure is not easilyapproximated by a shallow neural network, or thatexample descriptions of randomly-sampled trainingpairs (which are usually easy to discriminate) do notprovide a strong enough signal for a reflex learner torecover pragmatic behavior.
4.4 Final evaluation
Based on the following sections, we keep � = 0.02and use 100 samples to generate predictions. We
# of differences1 2 3 4 Mean
Literal (S0) 50 66 70 78 66 (%)Reasoning 64 86 88 94 83Compiled (S1) 44 72 80 80 69
Table 3: Comparison of the “compiled” pragmatic speakermodel with literal and explicitly reasoning speakers. The mod-els are evaluated on subsets of the development set, arranged bydifficulty: column headings indicate the number of differencesbetween the target and distractor scenes.
evaluate on the test set, comparing this Reason-ing model S1 to two baselines: Literal, an imagecaptioning model trained normally on the abstractscene captions (corresponding to our L0), and Con-trastive, a model trained with a soft contrastive ob-jective, and previously used for visual referring ex-pression generation (Mao et al., 2015).
Results are shown in Table 2. Our reasoningmodel outperforms both the literal baseline and pre-vious work by a substantial margin, achieving an im-provement of 17% on all pairs set and 15% on hardpairs.2 Figures 4 and 6 show various representative
2 For comparison, a model with hand-engineered pragmaticbehavior—trained using a feature representation with indicatorson only those objects that appear in the target image but not thedistractor—produces an accuracy of 78% and 69% on all and
39
Examples(a) the sun is in the sky (d) the plane is flying in the sky
[contrastive] [contrastive]
(c) the dog is standing beside jenny (b) mike is wearing a chef’s hat[contrastive] [non-contrastive]
Figure 4: Figure 4: Four randomly-chosen samples from our model. For each, the target image is shown on the left, the distractorimage is shown on the right, and description generated by the model is shown below. All descriptions are fluent, and generallysucceed in uniquely identifying the target scene, even when they do not perfectly describe it (e.g. (c)). These samples are broadlyrepresentative of the model’s performance (Table 2).
Dev acc. (%) Test acc. (%)
Model All Hard All Hard
Literal (S0) 66 54 64 53Contrastive 71 54 69 58Reasoning (S1) 83 73 81 68
Table 2: Success rates at RG on abstract scenes. “Literal” isa captioning baseline corresponding to the base speaker S0.“Contrastive” is a reimplementation of the approach of Maoet al. (2015). “Reasoning” is the model from this paper. Alldifferences between our model and baselines are significant(p < 0.05, Binomial).
of the base model (it improves noticeably over S0on scenes with 2–3 differences), the overall gain isnegligible (the difference in mean scores is not sig-nificant). The compiled model significantly under-performs the reasoning model. These results sug-gest either that the reasoning procedure is not easilyapproximated by a shallow neural network, or thatexample descriptions of randomly-sampled trainingpairs (which are usually easy to discriminate) do notprovide a strong enough signal for a reflex learner torecover pragmatic behavior.
4.4 Final evaluation
Based on the following sections, we keep � = 0.02and use 100 samples to generate predictions. We
# of differences1 2 3 4 Mean
Literal (S0) 50 66 70 78 66 (%)Reasoning 64 86 88 94 83Compiled (S1) 44 72 80 80 69
Table 3: Comparison of the “compiled” pragmatic speakermodel with literal and explicitly reasoning speakers. The mod-els are evaluated on subsets of the development set, arranged bydifficulty: column headings indicate the number of differencesbetween the target and distractor scenes.
evaluate on the test set, comparing this Reason-ing model S1 to two baselines: Literal, an imagecaptioning model trained normally on the abstractscene captions (corresponding to our L0), and Con-trastive, a model trained with a soft contrastive ob-jective, and previously used for visual referring ex-pression generation (Mao et al., 2015).
Results are shown in Table 2. Our reasoningmodel outperforms both the literal baseline and pre-vious work by a substantial margin, achieving an im-provement of 17% on all pairs set and 15% on hardpairs.2 Figures 4 and 6 show various representative
2 For comparison, a model with hand-engineered pragmaticbehavior—trained using a feature representation with indicatorson only those objects that appear in the target image but not thedistractor—produces an accuracy of 78% and 69% on all and
40
Examples(a) the sun is in the sky (d) the plane is flying in the sky
[contrastive] [contrastive]
(c) the dog is standing beside jenny (b) mike is wearing a chef’s hat[contrastive] [non-contrastive]
Figure 4: Figure 4: Four randomly-chosen samples from our model. For each, the target image is shown on the left, the distractorimage is shown on the right, and description generated by the model is shown below. All descriptions are fluent, and generallysucceed in uniquely identifying the target scene, even when they do not perfectly describe it (e.g. (c)). These samples are broadlyrepresentative of the model’s performance (Table 2).
Dev acc. (%) Test acc. (%)
Model All Hard All Hard
Literal (S0) 66 54 64 53Contrastive 71 54 69 58Reasoning (S1) 83 73 81 68
Table 2: Success rates at RG on abstract scenes. “Literal” isa captioning baseline corresponding to the base speaker S0.“Contrastive” is a reimplementation of the approach of Maoet al. (2015). “Reasoning” is the model from this paper. Alldifferences between our model and baselines are significant(p < 0.05, Binomial).
of the base model (it improves noticeably over S0on scenes with 2–3 differences), the overall gain isnegligible (the difference in mean scores is not sig-nificant). The compiled model significantly under-performs the reasoning model. These results sug-gest either that the reasoning procedure is not easilyapproximated by a shallow neural network, or thatexample descriptions of randomly-sampled trainingpairs (which are usually easy to discriminate) do notprovide a strong enough signal for a reflex learner torecover pragmatic behavior.
4.4 Final evaluation
Based on the following sections, we keep � = 0.02and use 100 samples to generate predictions. We
# of differences1 2 3 4 Mean
Literal (S0) 50 66 70 78 66 (%)Reasoning 64 86 88 94 83Compiled (S1) 44 72 80 80 69
Table 3: Comparison of the “compiled” pragmatic speakermodel with literal and explicitly reasoning speakers. The mod-els are evaluated on subsets of the development set, arranged bydifficulty: column headings indicate the number of differencesbetween the target and distractor scenes.
evaluate on the test set, comparing this Reason-ing model S1 to two baselines: Literal, an imagecaptioning model trained normally on the abstractscene captions (corresponding to our L0), and Con-trastive, a model trained with a soft contrastive ob-jective, and previously used for visual referring ex-pression generation (Mao et al., 2015).
Results are shown in Table 2. Our reasoningmodel outperforms both the literal baseline and pre-vious work by a substantial margin, achieving an im-provement of 17% on all pairs set and 15% on hardpairs.2 Figures 4 and 6 show various representative
2 For comparison, a model with hand-engineered pragmaticbehavior—trained using a feature representation with indicatorson only those objects that appear in the target image but not thedistractor—produces an accuracy of 78% and 69% on all and
41
Conclusions
• Standardneuralkitofpartsforbasemodels• Probabilis0creasoningforhigh-levelgoals• AliRlebitofstructuregoesalongway!
42
Thankyou!
“Compiling”thereasoningmodel
Whatifwetrainthecontras0vemodelontheoutputofthereasoningmodel?
Results(dev)
Literal Compiled Reasoning
66% 69%
83%