93
Grounded Language Learning Models for Ambiguous Supervision Joohyun Kim Supervising Professor: Raymond J. Mooney Ph.D Thesis Defense Talk August 23, 2013

Grounded Language Learning Models for Ambiguous Supervision

  • Upload
    zeno

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Grounded Language Learning Models for Ambiguous Supervision. Joohyun Kim Supervising Professor: Raymond J. Mooney Ph.D Thesis Defense Talk August 23, 2013. Outline. Introduction/Motivation Grounded Language Learning in Limited Ambiguity ( Kim and Mooney, COLING 2010) - PowerPoint PPT Presentation

Citation preview

Page 1: Grounded Language Learning Models for Ambiguous  Supervision

Grounded Language Learning Models for Ambiguous Supervision

Joohyun KimSupervising Professor Raymond J Mooney

PhD Thesis Defense TalkAugust 23 2013

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

2

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

3

Language Grounding

bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts

bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)

bull Ideally we want computational system to learn from the similar way

4

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

5

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

Machine 6

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 2: Grounded Language Learning Models for Ambiguous  Supervision

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

2

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

3

Language Grounding

bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts

bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)

bull Ideally we want computational system to learn from the similar way

4

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

5

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

Machine 6

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 3: Grounded Language Learning Models for Ambiguous  Supervision

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

3

Language Grounding

bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts

bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)

bull Ideally we want computational system to learn from the similar way

4

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

5

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

Machine 6

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 4: Grounded Language Learning Models for Ambiguous  Supervision

Language Grounding

bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts

bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)

bull Ideally we want computational system to learn from the similar way

4

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

5

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

Machine 6

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 5: Grounded Language Learning Models for Ambiguous  Supervision

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

5

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

Machine 6

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 6: Grounded Language Learning Models for Ambiguous  Supervision

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

Machine 6

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 7: Grounded Language Learning Models for Ambiguous  Supervision

Language Grounding Machine

Iranrsquos goalkeeper blocks the ball

Block(IranGoalKeeper)

7

Computer VisionLanguage Learning

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 8: Grounded Language Learning Models for Ambiguous  Supervision

Natural Language and Meaning Representation

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

8

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 9: Grounded Language Learning Models for Ambiguous  Supervision

Natural Language and Meaning Representation

Natural Language (NL)

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

9

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 10: Grounded Language Learning Models for Ambiguous  Supervision

Natural Language and Meaning Representation

NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc

MRL Formal languages that machine can understand such as logic or any computer-executable code

Meaning Representation Language (MRL)Natural Language (NL)

10

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 11: Grounded Language Learning Models for Ambiguous  Supervision

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

MRL

Semantic Parsing (NL MRL)

11

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 12: Grounded Language Learning Models for Ambiguous  Supervision

Semantic Parsing and Surface Realization

NL

Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language

Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language

MRL

Semantic Parsing (NL MRL)

Surface Realization (NL MRL)

12

Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 13: Grounded Language Learning Models for Ambiguous  Supervision

Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable

Manually Annotated Training Corpora(NLMRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

13

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 14: Grounded Language Learning Models for Ambiguous  Supervision

Learning from Perceptual Environment

bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input

bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language

learning

14

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 15: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

식당에서 우회전 하세요Alice

Bob15Slide from David Chen

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 16: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Alice

Bob

병원에서 우회전 하세요

16Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 17: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

17Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 18: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

18Slide from David Chen

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 19: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2식당에서 우회전 하세요

병원에서 우회전 하세요

Make a right turn

19Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 20: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

20Slide from David Chen

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 21: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2

식당

21Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 22: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2병원에서 우회전 하세요

식당에서 우회전 하세요

22Slide from David Chen

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 23: Grounded Language Learning Models for Ambiguous  Supervision

Navigation Example

Scenario 1

Scenario 2병원

23Slide from David Chen

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 24: Grounded Language Learning Models for Ambiguous  Supervision

Thesis Contributionsbull Generative models for grounded language learning from

ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR

structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR

correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language

learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train

discriminative reranker

24

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 25: Grounded Language Learning Models for Ambiguous  Supervision

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

25

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 26: Grounded Language Learning Models for Ambiguous  Supervision

bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see

an elevator to your left bull Use virtual worlds and instructorfollower data

from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how

humans follow instructions

Navigation Task (Chen and Mooney 2011)

26

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 27: Grounded Language Learning Models for Ambiguous  Supervision

H

C

L

S S

B C

H

E

L

E

H ndash Hat Rack

L ndash Lamp

E ndash Easel

S ndash Sofa

B ndash Barstool

C - Chair

Sample Environment (MacMahon et al 2006)

27

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 28: Grounded Language Learning Models for Ambiguous  Supervision

Executing Test Instruction

28

>

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 29: Grounded Language Learning Models for Ambiguous  Supervision

Task Objectivebull Learn the underlying meanings of instructions by observing

human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions

(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney

2011) pairsndash Landmarks plan

Describe actions in the environment along with notable objects encountered on the way

Overestimate the meaning of the instruction including unnecessary details

Only subset of the plan is relevant for the instruction29

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 30: Grounded Language Learning Models for Ambiguous  Supervision

Challenges

30

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 31: Grounded Language Learning Models for Ambiguous  Supervision

Challenges

31

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Landmarks plan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 32: Grounded Language Learning Models for Ambiguous  Supervision

Challenges

32

Instruc-tion

at the easel go left and then take a right onto the blue path at the corner

Correctplan

Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )

Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 33: Grounded Language Learning Models for Ambiguous  Supervision

Previous Work (Chen and Mooney 2011)

bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining

landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR

components out of landmarks planndash Trains supervised semantic parser to map novel instruction

(NL) to correct formal plan (MR)ndash Loses information during refinement

Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all

33

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 34: Grounded Language Learning Models for Ambiguous  Supervision

Proposed Solution (Kim and Mooney 2012)

bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to

formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for

building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic

Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals

34

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 35: Grounded Language Learning Models for Ambiguous  Supervision

35

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

(Supervised) Semantic Parser Learner

Plan Refinement

Semantic Parser

Action Trace

System Diagram (Chen and Mooney 2011)

Landmarks Plan

Supervised Refined Plan

LearningInference

Possibleinformationloss

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 36: Grounded Language Learning Models for Ambiguous  Supervision

36

Learning system for parsing navigation instructions

Observation

Instruction

World State

Execution Module (MARCO)

Instruction

World State

TrainingTesting

Action TraceNavigation Plan Constructor

Probabilistic Semantic Parser Learner (from

ambiguous supervison)

Semantic Parser

Action Trace

System Diagram of Proposed Solution

Landmarks Plan

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 37: Grounded Language Learning Models for Ambiguous  Supervision

PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)

37

bull PCFG rules to describe generative process from MR components to corresponding NL words

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 38: Grounded Language Learning Models for Ambiguous  Supervision

Hierarchy Generation PCFG Model (Kim and Mooney 2012)

bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings

1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training

databull Proposed model

ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept

(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR

parse with semantic lexeme MRs38

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 39: Grounded Language Learning Models for Ambiguous  Supervision

bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and

context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w

bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()

Semantic Lexicon (Chen and Mooney 2011)

39

cooccurrenceof g and w

general occurrenceof g without w

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 40: Grounded Language Learning Models for Ambiguous  Supervision

Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes

by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic

conceptsndash Lexeme hierarchy = semantic

concept hierarchyndash Shows how complicated

semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings

40

Turn

RIGHT sideHATRACK

frontSOFA

steps3

atEASEL

Verify Travel Verify

Turn

atEASEL

Travel Verify

atEASEL

Verify

Turn

RIGHT sideHATRACK

Verify Travel

Turn

sideHATRACK

Verify

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 41: Grounded Language Learning Models for Ambiguous  Supervision

PCFG Construction

bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to

describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes

ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram

Markov process (Borschinger et al 2011)

ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible

combinations are estimated41

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 42: Grounded Language Learning Models for Ambiguous  Supervision

PCFG Construction

42

m

Child concepts are generated from parent concepts selec-tively

All semantic concepts gen-erate relevant NL words

Each semantic concept generates at least one NL word

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 43: Grounded Language Learning Models for Ambiguous  Supervision

Parsing New NL Sentences

bull PCFG rule weights are optimized by Inside-Outside algorithm with training data

bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating

NL wordsndash From the bottom of the tree mark only responsible MR

components that propagate to the top levelndash Able to compose novel MRs never seen in the training data

43

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 44: Grounded Language Learning Models for Ambiguous  Supervision

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn left and find the sofa then turn around

the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 45: Grounded Language Learning Models for Ambiguous  Supervision

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify

Turn

LEFT frontSOFA

Verify

Turn

LEFT

Turn

RIGHT

steps2

atSOFA

Travel Verify Turn

RIGHT

atSOFA

Travel Verify Turn

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 46: Grounded Language Learning Models for Ambiguous  Supervision

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

46

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 47: Grounded Language Learning Models for Ambiguous  Supervision

Unigram Generation PCFG Model

bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph

and k-permutationsndash Tend to over-fit to the training data

bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train

47

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 48: Grounded Language Learning Models for Ambiguous  Supervision

PCFG Construction

bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one

by onendash Permutations of the appearing orders of relevant

lexemes are already considered

48

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 49: Grounded Language Learning Models for Ambiguous  Supervision

PCFG Construction

49

Each semantic concept is generated by unigram Markov process

All semantic concepts gen-erate relevant NL words

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 50: Grounded Language Learning Models for Ambiguous  Supervision

Parsing New NL Sentences

bull Follows the similar scheme as in Hierarchy Generation PCFG model

bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for

generating NL wordsndash Mark relevant lexeme MR components in the

context MR appearing in the top nonterminal

50

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 51: Grounded Language Learning Models for Ambiguous  Supervision

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT

atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn left and find the sofa then turn around the corner

Most probable parse tree for a test NL instruction

NL

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Context MR

Context MR

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 52: Grounded Language Learning Models for Ambiguous  Supervision

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 53: Grounded Language Learning Models for Ambiguous  Supervision

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify

Turn

ContextMR

RelevantLexemes

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 54: Grounded Language Learning Models for Ambiguous  Supervision

54

Turn

LEFTfrontBLUEHALL

frontSOFA

steps2

atSOFA

Verify Travel Verify Turn

RIGHT

Turn

LEFT atSOFA

Travel Verify Turn

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 55: Grounded Language Learning Models for Ambiguous  Supervision

Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier

(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)

bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version

55

Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7

Paragraph Single sentenceTake the wood path towards the easel

At the easel go left and then take a right on the the blue path at the corner

Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward

Forward Turn left Forward Turn right

Turn

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 56: Grounded Language Learning Models for Ambiguous  Supervision

Data Statistics

56

Paragraph Single-Sentence

Instructions 706 3236

Avg sentences 50 (plusmn28) 10 (plusmn0)

Avg actions 104 (plusmn57) 21 (plusmn24)

Avg words sent

English 376 (plusmn211) 78 (plusmn51)

Chinese-Word 316 (plusmn181) 69 (plusmn49)

Chinese-Character 489 (plusmn283) 106 (plusmn73)

Vo-cabu-lary

English 660 629

Chinese-Word 661 508

Chinese-Character 448 328

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 57: Grounded Language Learning Models for Ambiguous  Supervision

Evaluationsbull Leave-one-map-out approach

ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy

bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy

selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)

ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data

57

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 58: Grounded Language Learning Models for Ambiguous  Supervision

Parse Accuracy

bull Evaluate how well the learned semantic parsers can parse novel sentences in test data

bull Metric partial parse accuracy

58

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 59: Grounded Language Learning Models for Ambiguous  Supervision

Parse Accuracy (English)

Precision Recall F1

9016

5541

6859

8836

5703

6931

8758

6541

7481

861

6879

7644

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

59

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 60: Grounded Language Learning Models for Ambiguous  Supervision

Parse Accuracy (Chinese-Word)

Precision Recall F1

8887

5876

7074

8056

7114

7553

7945

73667641

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

60

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 61: Grounded Language Learning Models for Ambiguous  Supervision

Parse Accuracy (Chinese-Character)

Precision Recall F1

9248

5647

7001

7977

6738

7305

7973

75527755

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

61

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 62: Grounded Language Learning Models for Ambiguous  Supervision

End-to-End Execution Evaluations

bull Test how well the formal plan from the output of semantic parser reaches the destination

bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one

single-sentence execution

62

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 63: Grounded Language Learning Models for Ambiguous  Supervision

End-to-End Execution Evaluations(English)

Single-Sentence Paragraph

544

1618

5728

1918

5722

2017

6714

2812

Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model

63

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 64: Grounded Language Learning Models for Ambiguous  Supervision

End-to-End Execution Evaluations(Chinese-Word)

Single-Sentence Paragraph

587

2013

6103

1908

634

2312

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

64

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 65: Grounded Language Learning Models for Ambiguous  Supervision

End-to-End Execution Evaluations(Chinese-Character)

Single-Sentence Paragraph

5727

1673

5561

1274

6285

2333

Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model

65

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 66: Grounded Language Learning Models for Ambiguous  Supervision

Discussionbull Better recall in parse accuracy

ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage

ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data

ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights

bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization

bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 67: Grounded Language Learning Models for Ambiguous  Supervision

Comparison of Grammar Size and EM Training Time

67

Data

Hierarchy GenerationPCFG Model

Unigram GenerationPCFG Model

|Grammar| Time (hrs) |Grammar| Time (hrs)

English 20451 1726 16357 878

Chinese (Word) 21636 1599 15459 805

Chinese (Character) 19792 1864 13514 1258

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 68: Grounded Language Learning Models for Ambiguous  Supervision

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

68

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 69: Grounded Language Learning Models for Ambiguous  Supervision

Discriminative Rerankingbull Effective approach to improve performance of generative

models with secondary discriminative modelbull Applied to various NLP tasks

ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)

ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)

ndash Part-of-speech tagging (Collins EMNLP 2002)

ndash Semantic role labeling (Toutanova et al ACL 2005)

ndash Named entity recognition (Collins ACL 2002)

ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)

ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)

bull Goal ndash Adapt discriminative reranking to grounded language learning

69

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 70: Grounded Language Learning Models for Ambiguous  Supervision

Discriminative Reranking

bull Generative modelndash Trained model outputs the best result with max probability

TrainedGenerative

Model

1-best candidate with maximum probability

Candidate 1

Testing Example

70

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 71: Grounded Language Learning Models for Ambiguous  Supervision

Discriminative Rerankingbull Can we do better

ndash Secondary discriminative model picks the best out of n-best candidates from baseline model

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Testing Example

TrainedSecondary

DiscriminativeModel

Best prediction

Output

71

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 72: Grounded Language Learning Models for Ambiguous  Supervision

How can we apply discriminative reranking

bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual

context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated

worldsUsed in evaluating the final end-task plan execution

ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update

Response signal is weak and distributed over all candidates

72

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 73: Grounded Language Learning Models for Ambiguous  Supervision

Reranking Model Averaged Perceptron (Collins 2000)

bull Parameter weight vector is updated when trained model predicts a wrong candidate

TrainedBaseline

GenerativeModel

GEN

hellip

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate nhellip

Training Example

Perceptron

Gold StandardReference

Best prediction

Updatefeaturevector119938120783

119938120784

119938120785

119938120786

119938119951119938119944

119938119944minus119938120786

perceptronscore

-016

121

-109

146

059

73

Our generative models

NotAvailable

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 74: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Weight Update

bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and

evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance

ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials

ndash Prefer the candidate with the best execution success rate during training

74

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 75: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Updatebull Select pseudo-gold reference based on MARCO execution

results

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Pseudo-goldReference

Best prediction

UpdateDerived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

179

021

-109

146

059

75

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 76: Grounded Language Learning Models for Ambiguous  Supervision

Weight Update with Multiple Parses

bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given

indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the

desired goal

bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently

best-predicted candidatendash Update with feature vector difference weighted by difference

between execution success rates

76

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 77: Grounded Language Learning Models for Ambiguous  Supervision

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (1)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

77

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 78: Grounded Language Learning Models for Ambiguous  Supervision

Weight Update with Multiple Parses

bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse

n-best candidates

Candidate 1

Candidate 2

Candidate 3

Candidate 4

Candidate n

hellip

Perceptron

Best prediction

Update (2)Derived

MRs

119924119929120783

119924119929120784

119924119929120785

119924119929120786

119924119929119951

Feature vector Difference

MARCOExecutionModule

ExecutionSuccess

Rate120782 120788120782 120786120782 120782

120782 120791

120782 120784

PerceptronScore

124

183

-109

146

059

78

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 79: Grounded Language Learning Models for Ambiguous  Supervision

bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)

L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)

Features

Turn left and find the sofa then turn around the corner

L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)

L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()

79

119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 80: Grounded Language Learning Models for Ambiguous  Supervision

Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)

bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according

parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR

plansGenerate sufficiently large 1000000-best parse trees from baseline

model80

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 81: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update vs Baseline(English)

81

Hierarchy Unigram

7481

7644

7332

7724

Parse F1

BaselineResponse-based

Hierarchy Unigram

5722

6714

5965

6827

Single-sentence

Baseline Single

Hierarchy Unigram

2017

2812

2262

292

Paragraph

BaselineResponse-based

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 82: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update vs Baseline (Chinese-Word)

82

Hierarchy Unigram

7553

7641

7726

7774

Parse F1

BaselineResponse-based

Hierarchy Unigram

6103

6346412

6564

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1908

2312

2129

2374

Paragraph

BaselineResponse-based

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 83: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update vs Baseline(Chinese-Character)

83

Hierarchy Unigram

7305

7755

7626

7976

Parse F1

BaselineResponse-based

Hierarchy Unigram

5561

62856408

655

Single-sentence

BaselineResponse-based

Hierarchy Unigram

1274

23332225

2535

Paragraph

BaselineResponse-based

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 84: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update vs Baseline

bull vs Baselinendash Response-based approach performs better in the final end-

task plan executionndash Optimize the model for plan execution

84

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 85: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update with Multiple vs Single Parses (English)

85

Hierarchy Unigram

7332

7724

7343

7781

Parse F1

Single Multi

Hierarchy Unigram

5965

6827

6281

6893

Single-sentence

Single Multi

Hierarchy Unigram

2262

292

2657

291

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 86: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update with Multiple vs Single Parses (Chinese-Word)

86

Hierarchy Unigram

7726

7774

788

7811

Parse F1

Single Multi

Hierarchy Unigram

6412

6564

6415

6627

Single-sentence

Single Multi

Hierarchy Unigram

2129

2374

2155

2595

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 87: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update with Multiple vs Single Parses (Chinese-Character)

87

Hierarchy Unigram

7626

79767944

7994

Parse F1

Single Multi

Hierarchy Unigram

6408

655

6408

6684

Single-sentence

Single Multi

Hierarchy Unigram

2225

2535

2258

2716

Paragraph

Single Multi

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 88: Grounded Language Learning Models for Ambiguous  Supervision

Response-based Update with Multiple vs Single Parses

bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak

feedbackndash Candidates with low execution success rates

produce underspecified plans or plans with ignorable details but capturing gist of preferred actions

ndash A variety of preferable parses help improve the amount and the quality of weak feedback

88

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 89: Grounded Language Learning Models for Ambiguous  Supervision

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

89

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 90: Grounded Language Learning Models for Ambiguous  Supervision

Future Directions

bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure

bull Large-scale datandash Data collection model adaptation to large-scale

bull Machine translationndash Application to summarized translation

bull Real perceptual datandash Learn with raw features (sensory and vision data)

90

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 91: Grounded Language Learning Models for Ambiguous  Supervision

Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and

Mooney COLING 2010)ndash Learning to sportscast

bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions

bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)

bull Future Directionsbull Conclusion

91

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 92: Grounded Language Learning Models for Ambiguous  Supervision

Conclusion

bull Conventional language learning is expensive and not scalable due to annotation of training data

bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain

bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision

bull Discriminative reranking is possible and effective with weak feedback from perceptual environment

92

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You
Page 93: Grounded Language Learning Models for Ambiguous  Supervision

Thank You

  • Grounded Language Learning Models for Ambiguous Supervision
  • Outline
  • Outline (2)
  • Language Grounding
  • Language Grounding Machine
  • Language Grounding Machine (2)
  • Language Grounding Machine (3)
  • Natural Language and Meaning Representation
  • Natural Language and Meaning Representation (2)
  • Natural Language and Meaning Representation (3)
  • Semantic Parsing and Surface Realization
  • Semantic Parsing and Surface Realization (2)
  • Conventional Language Learning Systems
  • Learning from Perceptual Environment
  • Navigation Example
  • Navigation Example (2)
  • Navigation Example (3)
  • Navigation Example (4)
  • Navigation Example (5)
  • Navigation Example (6)
  • Navigation Example (7)
  • Navigation Example (8)
  • Navigation Example (9)
  • Thesis Contributions
  • Outline (3)
  • Slide 26
  • Slide 27
  • Slide 28
  • Task Objective
  • Challenges
  • Challenges (2)
  • Challenges (3)
  • Previous Work (Chen and Mooney 2011)
  • Proposed Solution (Kim and Mooney 2012)
  • System Diagram (Chen and Mooney 2011)
  • System Diagram of Proposed Solution
  • PCFG Induction Model for Grounded Language Learning (Borschinge
  • Hierarchy Generation PCFG Model (Kim and Mooney 2012)
  • Semantic Lexicon (Chen and Mooney 2011)
  • Lexeme Hierarchy Graph (LHG)
  • PCFG Construction
  • PCFG Construction (2)
  • Parsing New NL Sentences
  • Slide 44
  • Slide 45
  • Slide 46
  • Unigram Generation PCFG Model
  • PCFG Construction (3)
  • PCFG Construction (4)
  • Parsing New NL Sentences (2)
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Data
  • Data Statistics
  • Evaluations
  • Parse Accuracy
  • Parse Accuracy (English)
  • Parse Accuracy (Chinese-Word)
  • Parse Accuracy (Chinese-Character)
  • End-to-End Execution Evaluations
  • End-to-End Execution Evaluations (English)
  • End-to-End Execution Evaluations (Chinese-Word)
  • End-to-End Execution Evaluations (Chinese-Character)
  • Discussion
  • Comparison of Grammar Size and EM Training Time
  • Outline (4)
  • Discriminative Reranking
  • Discriminative Reranking (2)
  • Discriminative Reranking (3)
  • How can we apply discriminative reranking
  • Reranking Model Averaged Perceptron (Collins 2000)
  • Response-based Weight Update
  • Response-based Update
  • Weight Update with Multiple Parses
  • Weight Update with Multiple Parses (2)
  • Weight Update with Multiple Parses (3)
  • Features
  • Evaluations (2)
  • Response-based Update vs Baseline (English)
  • Response-based Update vs Baseline (Chinese-Word)
  • Response-based Update vs Baseline (Chinese-Character)
  • Response-based Update vs Baseline
  • Response-based Update with Multiple vs Single Parses (English
  • Response-based Update with Multiple vs Single Parses (Chinese
  • Response-based Update with Multiple vs Single Parses (Chinese (2)
  • Response-based Update with Multiple vs Single Parses
  • Outline (5)
  • Future Directions
  • Outline (6)
  • Conclusion
  • Thank You