Upload
shaman
View
18
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga Tokyo Institute of Technology . IJCNLP 2011 (Nov 9 2011). Research background. Typical coreference/anaphora resolution - PowerPoint PPT Presentation
Citation preview
Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues
Ryu IidaMasaaki YasuharaTakenobu TokunagaTokyo Institute of Technology
IJCNLP 2011 (Nov 9 2011)
2
Research background Typical coreference/anaphora resolution
Researchers have tackled problems provided by MUC, ACE and CoNLL shared tasks (a.k.a. OntoNote)
Mainly focused on linguistic aspect of reference function
Multi-modal research community(Byron, 2005; Prasov and Chai, 2008; Prasov and Chai, 2010; Schütte et al., 2010, Iida et al. 2010) Essential for human-computer interaction Identifying referents of referring expressions in a
static scene or a situated world, taking extra-linguistic clues into account
3
Multi-modal reference resolution
move the triangle to the left
Rotate the triangle at top right 60 degrees clockwise
All right… done it..
O.K.
dialogue history
…piece 1: move (X:230,Y:150)piece 7: move (X:311, Y:510)piece 3: rotate 60°
action history
eye-gaze
4
Aim Integrate several types of multi-modal
information into a machine learning-based reference resolution model
Investigate which kinds of clues are effective on multi-modal reference resolution
5
Multi-modal problem setting:related work 3D virtual world (Byron 2005, Stonia et al.
2008) e.g. Participants controlled an avatar in a virtual
world for exploring hidden treasures Frequently occurring scene updates Referring expressions will be relatively skewed to exophoric cases
Static scene (Dale 1992) Centrality and size of each object in computer
display is fixed through dialogues Change of visual salience of objects is not observed
6
Evaluation data set creation REX-J corpus (Spanger et al. 2010)
Dialogues and transcripts of collaborative work (solving Tangram puzzles) by two Japanese participants
Designed the puzzle solving task to require the frequent use of both anaphoric and exophoric referring expressions
7
solver operator
Setting of collecting data
not availabl
eshield screen
working area working area
goal shape
8
Collecting eye gaze data Recruited 18 Japanese graduate students
split them into 9 pairs All pairs knew each other previously and were of
the same sex and approximately the same age Introduced to solve 4 different Tangram puzzles
Use the Tobii T60 Eye Tracker, sampling at 60 Hz for recording users’ eye gaze with 0.5 degrees in accuracy 5 dialogues in which the tracking results
contained more than 40% errors were removed
9
Annotating referring expressions Conducted using a multimedia
annotation tool, ELAN Annotator manually detects a referring
expression and then selects its referent out of the possible puzzle pieces shown on the computer display
Total number of annotated referring expressions:1,462 instances in 27 dialogues 1,192 instances in solver’s utterances
(81.5%) 270 instances in operator’s utterances
(18.5%)
10
Multi-modal reference resolution Base model
Ranking candidate referents is important for better accuracy (Iida et al. 2003, Yang et al. 2003, Denis & Baldridge 2008)
Apply Ranking SVM algorithm (Joachims, 2002) Learn a weight vector to rank candidates for a
given partial ranking of each referent Training instances
To define the partial ranking of candidate referents, simply rank referents referred to by a given referring expression as first place and any other referents as second place
11
Feature set1. Linguistic features: Ling (Iida et al. 2010):
10 features Capture the linguistic salience of each referent
based on the discourse history2. Task-specific features: TaskSp (Iida et al.
2010):12 features Consider the visual salience based on the recent
movements of mouse cursor and recent pieces manipulated by the operator
3. Eye-gaze features: Gaze (proposed):14 features
12
Eye gaze as clues of reference function
Eye gaze Saccades: quick, simultaneous movements of
both eyes in the same direction Eye-fixations: maintaining of the visual gaze
on a single location Direction of eye gaze directly reflects the
focus of attention (Richardson et al., 2007) Used the eye fixations as clues for identifying
the pieces focused on Separating saccades and eye fixations:
Dispersion-threshold identification (Salvucci and Anderson, 2001)
13
Eye gaze features
time“First you need to move the smallest triangle to the left”
a
bcd
e
f
g
fixating on piece_b
t-TT = 1500msec (Prasov and Chai 2010) t t’
fixating on piece_a
how frequently orhow long the speaker fixates on each piece
14
Empirical evaluation Compared models with different
combinations of the three types of features Conducted 5-fold cross-validation
Proposed model with model separation (Iida et al. 2010) the referential behaviour of pronouns is
completely different from non-pronounsSeparately create two reference resolution models; pronoun model: identifies a referent of a given
pronoun non-pronoun model: identifies a referent of all
other expressions (e.g. NP)
15
Results of (non-)pronouns
model pronoun non-pronounLing 56.0 65.4
Gaze 56.7 48.0
TaskSp 79.2 21.1
Ling+Gaze 66.5 75.7
Ling+TaskSp 79.0 67.1
TaskSp+Gaze 78.0 48.4
Ling+TaskSp+Gaze 78.7 76.0
16
Overall results
model accuracyLing 61.8
Gaze 51.2
TaskSp 42.8
Ling+Gaze 72.3
Ling+TaskSp 71.5
TaskSp+Gaze 59.5
Ling+TaskSp+Gaze 77.0
17
Investigation of the significance of features
Calculate the weight of features according to the following formula
set of the support vectors in a ranker weight of the
support vector x
function that returns 1 if f occurs in x
18
Weights of features in each model
pronoun model non-pronoun modelrank feature weight feature weight
1 TaskSp1 0.4744 Ling6 0.61492 TaskSp3 0.2684 Gaze10 0.15663 Ling1 0.2298 Gaze9 0.15664 TaskSp7 0.1929 Gaze7 0.12555 TaskSp9 0.1605 Gaze11 0.12256 Gaze10 0.1547 Gaze14 0.11347 Gaze9 0.1547 Gaze13 0.11348 Ling6 0.1442 Gaze12 0.10269 Gaze7 0.1267 Ling2 0.1014
10 Ling2 0.1164 Gaze1 0.0750
TaskSp1: mouse cursor was over a piece at the beginning of uttering a referring expression
TaskSp3: time distance is less than or equal to 10 sec after the mouse cursor was over a piece
19
Weights of features in each model
pronoun model non-pronoun modelrank feature weight feature weight
1 TaskSp1 0.4744 Ling6 0.61492 TaskSp3 0.2684 Gaze10 0.15663 Ling1 0.2298 Gaze9 0.15664 TaskSp7 0.1929 Gaze7 0.12555 TaskSp9 0.1605 Gaze11 0.12256 Gaze10 0.1547 Gaze14 0.11347 Gaze9 0.1547 Gaze13 0.11348 Ling6 0.1442 Gaze12 0.10269 Gaze7 0.1267 Ling2 0.1014
10 Ling2 0.1164 Gaze1 0.0750
Ling6: shape attributes of a piece are compatible with the attributes of a referring expression
Gaze10: there exists the fixation time of a piece in the time period [t − T , t]
Gaze 9: the fixation time of a piece in the time period [t − T , t] is longest out of all pieces
20
Summary Investigated the impact of multi-modal
information on reference resolution in Japanese situated dialogues
The results demonstrate The referents of pronouns rely on the visual focus
of attention such as is indicated by moving the mouse cursor
Non-pronouns are strongly related to eye fixations on its referent
Integrating these two types of multi-modal information into linguistic information contributes to increasing accuracy of reference resolution
21
Future work Need further data collection
All objects in Tangram puzzle (i.e. puzzle pieces) have nearly the same size Rejecting the factor that a relatively larger
object occupying the computer display has higher prominence over smaller objects
Zero-anaphors in utterances need to be annotated frequent use of them in Japanese