1 David Chen Advisor: Raymond Mooney Research Preparation Exam August 21, 2008 Learning to Sportscast: A Test of Grounded Language Acquisition

1

David ChenAdvisor: Raymond Mooney

Research Preparation Exam

August 21, 2008

Learning to Sportscast: A Test of Grounded Language Acquisition

2

Semantics of Language

• The meaning of words, phrases, etc

• Crucial in communications

• Example: “Spanish goalkeeper Iker Casillas blocks the ball”– Merriam-Webster: (transitive verb) to interfere

usually legitimately with (as an opponent) in various games or sports

– WordNet: (v) parry, deflect

33

Language Grounding

• Problem: We are circularly defining the meanings of words in terms of other words.

• The meanings of many words are grounded in our perception of the physical world: red, ball, cup, run, hit, fall, etc.– Symbol Grounding: Harnad (1990)

• Even many abstract words and meanings are metaphorical abstractions of terms grounded in the physical world: up, down, over, in, etc.– Lakoff and Johnson’s Metaphors We Live By

• It’s difficult to put my ideas into words.• Interest in competitions is up.

4

Grounding Language

Casillas blocks the ball

5

Grounding Language


Block(Casillas)

6

Natural Language and Meaning Representation

Natural Language (NL)

NL: A language that has evolved naturally, such as English, German, French, Chinese, etc

MRL: Formal languages such as logic or any computer-executable code

Meaning Representation Language (MRL)


Block(Casillas)

7

Semantic Parsing and Tactical Generation

NL

Semantic Parsing: maps a natural-language sentence to a complete, detailed semantic representation

Tactical Generation: Generates a natural-language sentence from a meaning representation.

MRL

Semantic Parsing (NL MRL)

Tactical Generation (NL MRL)


Block(Casillas)

8

Learning Approach

Manually Annotated Training Corpora(NL/MRL pairs)

Semantic Parser

MRLNL

Semantic Parser Learner

9

Learning Approach

Manually Annotated Training Corpora(NL/MRL pairs)

Tactical Generator

MRLNL

Tactical Generator Learner

10

Example of Annotated Training Corpus

Alice passes the ball to Bob

Bob turns the ball over to John

John passes to Fred

Fred shoots for the goal

Paul blocks the ball

Paul kicks off to Nancy

…

Pass(Alice, Bob)

Turnover(Bob, John)

Pass(John, Fred)

Kick(Fred)

Block(Paul)

Pass(Paul, Nancy)

…



11

Example of Annotated Training Corpus

Alice passes the ball to Bob

Bob turns the ball over to John

John passes to Fred

Fred shoots for the goal

Paul blocks the ball

Paul kicks off to Nancy

…

P1(C1, C2)

P2(C2, C3)

P1(C3, C4)

P3(C4)

P4(C5)

P5(C5, C6)

…



12

Learning Language from Perceptual Context

• Constructing annotated corpora for language learning is difficult

• Children acquire language through exposure to linguistic input in the context of a rich, relevant, perceptual environment

• Ideally, a computer system can learn language in the same manner

13

Goals

• Learn to ground the semantics of language

Block

• Learn language through correlated linguistic and visual inputs

14

Challenge

“西班牙守門員擋下了球”

15

Challenge

A linguistic input may correspond to many possible events

?

??


16

Challenge

A linguistic input may correspond to many possible events

?

??

Pass(GermanyPlayer1, GermanyPlayer2)

Kick(GermanyPlayer2)

Block(SpanishGoalie)


17

Overview

• Sportscasting task

• Related works

• Tactical generation

• Strategic generation

• Human evaluation

18

Learning to Sportscast

• Robocup Simulation League games

• No speech recognition– Record commentaries in text form

• No computer vision– Ruled-based system to automatically extract

game events in symbolic form

• Concentrate on linguistic issues

19

Robocup Simulation League

Purple goalie blocked the ball

20

Learning to Sportscast

• Learn to sportscast by observing sample human sportscasts

• Build a function that maps between natural language (NL) and meaning representation (MR)– NL: Textual commentaries about the game– MR: Predicate logic formulas that represent

events in the game

21

Robocup Sportscaster TraceNatural Language Commentary Meaning Representation

Purple goalie turns the ball over to Pink8

badPass ( Purple1, Pink8 )

Pink11 looks around for a teammate

Pink8 passes the ball to Pink11

Purple team is very sloppy today

Pink11 makes a long pass to Pink8

Pink8 passes back to Pink11

turnover ( Purple1, Pink8 )

pass ( Pink11, Pink8 )


ballstopped


kick ( Pink11 )

kick ( Pink8)

kick ( Pink11 )

kick ( Pink11 )

kick ( Pink8 )

22

Robocup Sportscaster TraceNatural Language Commentary Meaning Representation


P6 ( C1, C19 )






P5 ( C1, C19 )

P2 ( C22, C19 )

P2 ( C19, C22 )

P0

P2 ( C19, C22 )

P1 ( C22 )

P1( C19 )

P1 ( C22 )

P1 ( C22 )

P1 ( C19 )

23

Robocup Data

• Collected human textual commentary for the 4 Robocup championship games from 2001-2004.– Avg # events/game = 2,613– Avg # sentences/game = 509

• Each sentence matched to all events within previous 5 seconds.– Avg # MRs/sentence = 2.5 (min 1, max 12)

• Manually annotated with correct matchings of sentences to MRs (for evaluation purposes only).

24

Overview


• Related works




25

Semantic Parser Learners

• Learn a function from NL to MR

NL: “Purple3 passes the ball to Purple5”

MR: Pass ( Purple3, Purple5 )

Semantic Parsing (NL MR)

Tactical Generation (MR NL)

•We experiment with two semantic parser learners–WASP (Wong & Mooney, 2006; 2007)–KRISP (Kate & Mooney, 2006)

26

• Uses statistical machine translation techniques– Synchronous context-free grammars (SCFG)

[Wu, 1997; Melamed, 2004; Chiang, 2005]– Word alignments [Brown et al., 1993; Och &

Ney, 2003]

• Capable of both semantic parsing and tactical generation

WASP: Word Alignment-based Semantic Parsing

27

KRISP: Kernel-based Robust Interpretation by Semantic Parsing

• Productions of MR language are treated like semantic concepts

• SVM classifier is trained for each production with string subsequence kernel

• These classifiers are used to compositionally build MRs of the sentences

• More resistant to noisy supervision but incapable of tactical generation

28

KRISPER: KRISP with EM-like Retraining

• Extension of KRISP that learns from ambiguous supervision [Kate & Mooney, 2007]

• Uses an iterative EM-like method to gradually converge on a correct meaning for each sentence.

29

KRISPER











ballstopped


kick ( Pink11 )

kick ( Pink8)

kick ( Pink11 )

kick ( Pink11 )

kick ( Pink8 )

1. Assume every possible meaning for a sentence is correct

30











ballstopped


kick ( Pink11 )

kick ( Pink8)

kick ( Pink11 )

kick ( Pink11 )

kick ( Pink8 )

1/21/2

1/31/3

1/3

1/31/31/3

1/31/3

1/3

1/21/2

KRISPER

2. Resulting NL-MR pairs are weighted and given to semantic parser learner

31

KRISPER











ballstopped


kick ( Pink11 )

kick ( Pink8)

kick ( Pink11 )

kick ( Pink11 )

kick ( Pink8 )

3. Estimate the confidence of each NL-MR pair using the resulting trained semantic parser

0.650.87

0.220.35

0.130.85 0.81

0.37

0.760.49

0.76

0.670.86

32


KRISPER










ballstopped


kick ( Pink11 )

kick ( Pink8)

kick ( Pink11 )

kick ( Pink11 )

kick ( Pink8 )

4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]

0.650.87

0.220.35

0.130.85 0.81

0.37

0.760.49

0.76

0.670.86

33


KRISPER










ballstopped


kick ( Pink11 )

kick ( Pink8)

kick ( Pink11 )

kick ( Pink11 )

kick ( Pink8 )

5. Give the best pairs to the semantic parser learner in the next iteration, and repeat until convergence

34

Overview


• Related works




35

Tactical Generation

• Learn how to generate NL from MR

• Example:

• Two steps1. Disambiguate the training data

2. Learn a language generator

Pass(Pink2, Pink3) “Pink2 kicks the ball to Pink3”

36

WASPER

• WASP with EM-like retraining to handle ambiguous training data.

• Same augmentation as added to KRISP to create KRISPER.

37

• First train KRISPER to disambiguate the data

• Then train WASP on the resulting unambiguously supervised data.

KRISPER-WASP

38

WASPER-GEN

• Determines the best matching based on generation (MR→NL).

• Score each potential NL/MR pair by using the currently trained WASP-1 generator.

• Compute NIST MT score [NIST report, 2002] between the generated sentence and the potential matching sentence.

39

NIST scores

Target: Purple2 passes quickly to Purple3Candidate: Purple2 passes to Purple3

1-grams: Purple2, passes, to, Purple3

2-grams: Purple2 passes, passes to, to Purple3

3-grams: Purple2 passes to, passes to Purple3

4-gram: Purple2 passes to Purple3

4/4

2/3

0/2

0/2

40

System Overview

Purple7 loses the ball to Pink2

Sportscaster Robocup Simulator

Ambiguous Training Data

Pink2 kicks the ball to Pink5


Pink8 shoots the ball

Turnover ( purple7 , pink2 )

Pass ( pink5 , pink8)

Pass ( purple5, purple7 )

Kick ( pink2 )Pass ( pink2 , pink5 )Kick ( pink5 )

BallstoppedKick ( pink8 )

SemanticParser


Unambiguous Training Data

Pink2 kicks the ball to Pink5Pink5 makes a long pass

to Pink8Pink8 shoots the ball Kick ( pink8 )

Pass ( pink2 , pink5 )

Kick ( pink5 )



41

KRISPER and WASPER












SemanticParser






Kick ( pink5 )


(KRISP/WASP)


42

WASPER-GEN












TacticalGenerator






Kick ( pink5 )

Tactical Generator Learner(WASP)


43

Disambiguation Learning language generator

WASP Random WASP

KRISPER

(Kate & Mooney, 2007)

KRISP N/A

WASPER WASP WASP

KRISPER-WASP KRISP WASP

WASPER-GEN WASP’s language generator

WASP

WASP with gold matching

N/A WASP

Lower baseline

Upper baseline

Systems

44


WASP Random WASP

KRISPER


KRISP N/A

WASPER WASP WASP



WASP


N/A WASP

Lower baseline

Upper baseline

Systems

Matching

45

Matching

• 4 Robocup championship games from 2001-2004.– Avg # events/game = 2,613

– Avg # sentences/game = 509

• Leave-one-game-out cross-validation• Metric:

– Precision: % of system’s annotations that are correct

– Recall: % of gold-standard annotations correctly produced

– F-measure: Harmonic mean of precision and recall

46

Matching Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Average results on leave-one-outexperiments

F-m

easu

re WASPKRISPERWASPERWASPER-GEN

47


WASP Random WASP

KRISPER


KRISP N/A

WASPER WASP WASP



WASP


N/A WASP

Lower baseline

Upper baseline

SystemsTactical

Generation

48

Tactical Generation

• 4 Robocup championship games from 2001-2004.– Avg # events/game = 2,613– Avg # sentences/game = 509

• Leave-one-game-out cross-validation• NIST score [NIST report, 2002]

– Evaluate the quality of machine translations based on matching n-grams

49

Tactical Generation Results

0

1

2

3

4

5

6

Average results on leave-one-out experiments

NIS

T

WASP

WASPER

KRISPER-WASPWASPER-GEN

WASP with goldmatching

50

Overview


• Related works




51

Strategic Generation

• Generation requires not only knowing how to say something (tactical generation) but also what to say (strategic generation).

• For automated sportscasting, one must be able to effectively choose which events to describe.

52

Example of Strategic Generation

pass ( purple7 , purple6 )

ballstopped

kick ( purple6 )


ballstopped

kick ( purple2 )


kick ( purple3 )

badPass ( purple3 , pink9 )

turnover ( purple3 , pink9 )

53

Strategic Generation

• For each event type (e.g. pass, kick) estimate the probability that it is described by the sportscaster.

• Requires correct NL/MR matching– Use estimated matching from tactical

generation– Iterative Generation Strategy Learning

54

Iterative Generation Strategy Learning (IGSL)

• Directly estimates the likelihood of an event being commented on

• Self-training iterations to improve estimates

• Uses events not associated with any NL as negative evidence

55

Strategic Generation Performance

• Evaluate how well the system can predict which events a human comments on

• Metric:– Precision: % of system’s annotations that are correct

– Recall: % of gold-standard annotations correctly produced

– F-measure: Harmonic mean of precision and recall

56

Strategic Generation Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Average results on leave-one-game-out cross-validation

F-m

easu

re

inferred fromWASPinferred fromKRISPERinferred fromWASPERinferred fromWASPER-GENIGSL

inferred fromgold matching

57

Overview


• Related works




58

• 4 fluent English speakers as judges• 8 commented game clips

– 2 minute clips randomly selected from each of the 4 games

– Each clip commented once by a human, and once by the machine

• Presented in random counter-balanced order• Judges were not told which ones were human or

machine generated

Human Evaluation (Quasi Turing Test)

59

Demo Clip

• Game clip commentated using WASPER-GEN with IGSL, since this gave the best results for generation.

• FreeTTS was used to synthesize speech from textual output.

60

Human Evaluation

CommentatorEnglishFluency

Semantic Correctness

SportscastingAbility

Human 3.94 4.25 3.63

Machine 3.44 3.56 2.94

Difference 0.5 0.69 0.69

ScoreEnglish Fluency

Semantic Correctness

Sportscasting Ability

5 Flawless Always Excellent

4 Good Usually Good

3 Non-native Sometimes Average

2 Disfluent Rarely Bad

1 Gibberish Never Terrible

61

Future Work

• Expand MRs to beyond simple logic formulas• Apply approach to learning situated language in a

computer video-game environment (Gorniak & Roy, 2005)

• Apply approach to captioned images or video using computer vision to extract objects, relations, and events from real perceptual data (Fleischman & Roy, 2007)

62

Conclusion

• Current language learning work uses expensive, unrealistic training data.

• We have developed a language learning system that can learn from language paired with an ambiguous perceptual environment.

• We have evaluated it on the task of learning to sportscast simulated Robocup games.

• The system learns to sportscast almost as well as humans.

Documents

1 David Chen Advisor: Raymond Mooney Research Preparation Exam August 21, 2008 Learning to Sportscast: A Test of Grounded Language Acquisition