Reasoning about pragma0cs with neural listeners and speakersjda/slides/ak_pragma.pdf · DIRECT STRATEGY: Imitate successful human play ... ing model S1 to two baselines: Literal,

Reasoningaboutpragma0cs withneurallistenersandspeakers

JacobAndreasandDanKlein

Thereferencegame

2

Thereferencegame

3

Thereferencegame

4

Theonewiththesnake

Thereferencegame

5

Mikeisholdingabaseballbat

Thereferencegame

6

bataisholdingMikebaseball

Thereferencegame

7

Theyaresi4ngbyapicnictable

Thereferencegame

8

Thereisabat

Thereferencegame

9

Thereisabat

Thereferencegame

10

Whydowecareaboutthisgame?

Don’tyouthinkit’sali:lecoldinhere?

Doyouknowwhat<meitis?

Someofthechildrenplayedinthepark.

Derivingpragma0csfromreasoning

11


12

Jennyisrunning fromthesnake


13



Howtowin

14

DERIVEDSTRATEGY:Reasonaboutlistenerbeliefs

DIRECTSTRATEGY:Imitatesuccessfulhumanplay

Thereis asnake

Thereis asnake

Thereis asnake

?

Howtowin

15

[Maoetal.2015]

[Kazemzadehetal.2014]

[Fitzgeraldetal.,2013]

[MonroeandPoRs,2015]

[Smithetal.2013]

[Vogeletal.2013]

[Gollandetal.2010]



Howtowin

16

PRO:domainrepr“forfree”

CON:pastworkneedstargeteddata

PRO:pragma0cs“forfree”

CON:pastworkneedshand-engineering



Howtowin

17



Learnbasemodelsforinterpreta0on&genera0onwithoutpragma0ccontext

Explicitlyreasonaboutbasemodelstogetnovelbehavior

Data

AbstractScenesDataset

1000scenes10ksentencesFeaturerepresenta0ons

18

Approach

19

Literal speaker

Literal listener

Sampler

Reasoningspeaker

Aliteralspeaker(S0)

20


Aliteralspeaker(S0)

21

Referentencoder

Referentdecoder


Modulearchitectures

22

ReLUFC SoftmaxFC

referent

wordn

word<n wordn+1

FCreffeatures

referent

Referentencoder

Referentdecoder

TrainingS0

23


S0

Aliteralspeaker(S0)

24


Thesunisinthesky

Jennyisstanding nexttoMike

Aliterallistener(L0)

25



26

Descr.encoder

Referentencoder

Referentencoder

Scorer

0.87Mikeisholdingabaseballbat

0.13

Modulearchitectures

27

Referentencoder

Referentdecoder

sentence

ReLUSum FC Softmax choice

referent

desc

FCngramfeatures

desc

TrainingL0

28


(randomdistractor)

0.87


29

L0


Areasoningspeaker(S1)

30


?


31

Literal speaker

Thesunisinthesky


Literal listener

0.9

0.5

0.7

Mikeis abaseballbat


32

Literal speaker

Thesunisinthesky


Literal listener

0.9

0.5

0.7

Mikeis abaseballbat

0.05

0.09

0.08


33

Literal speaker

Thesunisinthesky


Literal listener

0.91-λ

0.51-λ

0.71-λ

Mikeis abaseballbat

0.05

0.09

0.08

*0.05λ

*0.09λ

*0.09λ

Experiments

34

Baselines

• Literal:theL0modelbyitself

• ContrasIve:acondi0onalLMtrainedonboththetargetimageandarandomdistractor[Maoetal.2015]

35

Results(test)

Literal Contras0ve Reasoning

64%69%

81%

36

Accuracyandfluency

Figure 5: Tradeoff between speaker and listener models, con-trolled by the parameter � in Equation 8. With � = 0, all weightis placed on the literal listener, and the model produces highlydiscriminative but somewhat disfluent captions. With � = 1, allweight is placed on the literal speaker, and the model producesfluent but generic captions.

4.1 How good are the base models?

To measure the performance of the base models,we draw 10 samples djk for a subset of 100 pairs(r1,j , r2,j) in the Dev-All set. We collect human flu-ency and accuracy judgments for each of the 1000total samples. This allows us to conduct a post-hocsearch over values of �: for a range of �, we com-pute the average accuracy and fluency of the high-est scoring sample. By varying �, we can view thetradeoff between accuracy and fluency that resultsfrom interpolating between the listener and speakermodel—setting � = 0 gives samples from pL0, and� = 1 gives samples from pS0.

Figure 5 shows the resulting accuracy and fluencyfor various values of �. It can be seen that relyingentirely on the listener gives the highest accuracybut degraded fluency. However, by adding only avery small weight to the speaker model, it is possibleto achieve near-perfect fluency without a substantialdecrease in accuracy. Example sentences for an in-dividual reference game are shown in Figure 5; in-creasing � causes captions to become more generic.For the remaining experiments in this paper, we take� = 0.02, finding that this gives excellent perfor-mance on both metrics.

On the development set, � = 0.02 results in anaverage fluency of 4.8 (compared to 4.8 for the lit-eral speaker � = 1). This high fluency can be con-firmed by inspection of model samples (Figure 4).

We thus focus on accuracy or the remainder of theevaluation.

4.2 How many samples are needed?

Next we turn to the computational efficiency of thereasoning model. As in all sampling-based infer-ence, the number of samples that must be drawnfrom the proposal is of critical interest—if too manysamples are needed, the model will be too slow touse in practice. Having fixed � = 0.02 in the pre-ceding section, we measure accuracy for versions ofthe reasoning model that draw 1, 10, 100, and 1000samples. Results are shown in Table 1. We find thatgains continue up to 100 samples.

4.3 Is reasoning necessary?

Because they do not require complicated inferenceprocedures, direct approaches to pragmatics typi-cally enjoy better computational efficiency than de-rived ones. Having built an accurate derived speaker,can we bootstrap a more efficient direct speaker?

To explore this, we constructed a “compiled”speaker model as follows: Given reference candi-dates r1 and r2 and target t, this model producesembeddings e1 and e2, concatenates them togetherinto a “contrast embedding” [et, e�t], and then feedsthis whole embedding into a string decoder mod-ule. Like S0, this model generates captions withoutthe need for discriminative rescoring; unlike S0, thecontrast embedding means this model can in prin-ciple learn to produce pragmatic captions, if givenaccess to pragmatic training data. Since no suchtraining data exists, we train the compiled model on

(a) target (b) distractor

(prefer L0) 0.0 a hamburger on the ground0.1 mike is holding the burger

(prefer S0) 0.2 the airplane is in the sky

Figure 5: Captions for the same pair with varying �. Changing� alters both the naturalness and specificity of the output.

37

Howmanysamples?

Accuracy

50

60

70

80

90

100

#Samples

1 10 100 1000

38

Examples

(a) the sun is in the sky (d) the plane is flying in the sky[contrastive] [contrastive]

(c) the dog is standing beside jenny (b) mike is wearing a chef’s hat[contrastive] [non-contrastive]

Figure 4: Figure 4: Four randomly-chosen samples from our model. For each, the target image is shown on the left, the distractorimage is shown on the right, and description generated by the model is shown below. All descriptions are fluent, and generallysucceed in uniquely identifying the target scene, even when they do not perfectly describe it (e.g. (c)). These samples are broadlyrepresentative of the model’s performance (Table 2).

Dev acc. (%) Test acc. (%)

Model All Hard All Hard

Literal (S0) 66 54 64 53Contrastive 71 54 69 58Reasoning (S1) 83 73 81 68

Table 2: Success rates at RG on abstract scenes. “Literal” isa captioning baseline corresponding to the base speaker S0.“Contrastive” is a reimplementation of the approach of Maoet al. (2015). “Reasoning” is the model from this paper. Alldifferences between our model and baselines are significant(p < 0.05, Binomial).

of the base model (it improves noticeably over S0on scenes with 2–3 differences), the overall gain isnegligible (the difference in mean scores is not sig-nificant). The compiled model significantly under-performs the reasoning model. These results sug-gest either that the reasoning procedure is not easilyapproximated by a shallow neural network, or thatexample descriptions of randomly-sampled trainingpairs (which are usually easy to discriminate) do notprovide a strong enough signal for a reflex learner torecover pragmatic behavior.

4.4 Final evaluation

Based on the following sections, we keep � = 0.02and use 100 samples to generate predictions. We

# of differences1 2 3 4 Mean

Literal (S0) 50 66 70 78 66 (%)Reasoning 64 86 88 94 83Compiled (S1) 44 72 80 80 69

Table 3: Comparison of the “compiled” pragmatic speakermodel with literal and explicitly reasoning speakers. The mod-els are evaluated on subsets of the development set, arranged bydifficulty: column headings indicate the number of differencesbetween the target and distractor scenes.

evaluate on the test set, comparing this Reason-ing model S1 to two baselines: Literal, an imagecaptioning model trained normally on the abstractscene captions (corresponding to our L0), and Con-trastive, a model trained with a soft contrastive ob-jective, and previously used for visual referring ex-pression generation (Mao et al., 2015).

Results are shown in Table 2. Our reasoningmodel outperforms both the literal baseline and pre-vious work by a substantial margin, achieving an im-provement of 17% on all pairs set and 15% on hardpairs.2 Figures 4 and 6 show various representative

2 For comparison, a model with hand-engineered pragmaticbehavior—trained using a feature representation with indicatorson only those objects that appear in the target image but not thedistractor—produces an accuracy of 78% and 69% on all and

39

Examples(a) the sun is in the sky (d) the plane is flying in the sky

[contrastive] [contrastive]
















40

Examples(a) the sun is in the sky (d) the plane is flying in the sky

[contrastive] [contrastive]
















41

Conclusions

• Standardneuralkitofpartsforbasemodels• Probabilis0creasoningforhigh-levelgoals• AliRlebitofstructuregoesalongway!

42

Thankyou!

“Compiling”thereasoningmodel

Whatifwetrainthecontras0vemodelontheoutputofthereasoningmodel?

Results(dev)

Literal Compiled Reasoning

66% 69%

83%

Documents

Reasoning about pragma0cs with neural listeners and speakersjda/slides/ak_pragma.pdf · DIRECT STRATEGY: Imitate successful human play ... ing model S1 to two baselines: Literal,