35
www.amiproject.org Collaborative Annotation of Collaborative Annotation of the AMI Meeting Corpus the AMI Meeting Corpus Jean Carletta University of Edinburgh

Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

Embed Size (px)

Citation preview

Page 1: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Collaborative Annotation of the Collaborative Annotation of the

AMI Meeting CorpusAMI Meeting Corpus

Jean Carletta

University of Edinburgh

Page 2: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20072

AMI PartnersAMI Partners

Page 3: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20073

NXT Major Development NXT Major Development SitesSites

Page 4: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20074

AMI's aimAMI's aim

• aim: to develop technologies for browsing meetings and to assist people during meetings

• interdisciplinary: signal processing, language engineering, theoretical linguistics, human-computer interfaces, organizational psychology, ...

Page 5: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20075

Why annotation?Why annotation?

• For basic scientific understanding - e.g.,• How do people choose a next speaker? • What is the relationship between speech

and gesture during deixis?

• For machine learning• Hand-code e.g. statement vs. question• Identify features for each like word

sequences and prosody• Use the data to fit a statistical classifier that

codes new data automatically

Page 6: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20076

Page 7: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20077

Page 8: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20078

AMI Meeting RoomsAMI Meeting Rooms

4 close- and 2 wide-view cameras, 4 head-set and 8 array microphones, presentation screen capture, whiteboard capture, pen devices, plus extra site-dependent devices

TNO Edinburgh IDIAP

Page 9: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 20079

IS1004d, 3:07 - 4:11IS1004d, 3:07 - 4:11

Page 10: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200710

Corpus OverviewCorpus Overview

• 100 hrs of well-recorded meetings

• orthographically transcribed with word timings by forced alignment

• ASR output

• heavily annotated by hand for communicative behaviours

• Creative Commons Share-Alike licensing, with demo DVD

Page 11: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200711

Hand AnnotationsHand Annotations

• transcription with word-level timings from forced alignment (100%)

• timestamping against signal (10-30%)• head gestures; hand gestures for

addressing and interactions with objects; location in room; gaze; emotion?

• discourse structure (70%)• dialogue acts (some w/ addressing), named

entities, topic segments, linked extractive and abstractive summaries

Page 12: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200712

Costs in person-hrs/hrCosts in person-hrs/hr

transcription 30

topic segments + abstractive summaries 6-10

dialogue acts w/ some relations 20

addressing 12

extractive summaries linked to abstract 1

named entities 2-5

hand gestures (rough timings) 6

head gestures (rough timings) 6

head gestures (precision timings) 20

movement around room 4

Page 13: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200713

Core ProblemsCore Problems

• How do we represent all of these kinds of annotation on the same base data, including both structural relationships and timing?

• How do we allow for multiple (human and machine) annotations of the same property, so that we can compare them?

Page 14: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200714

Page 15: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200715

Page 16: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200716

NITE XML ToolkitNITE XML Toolkit• Mature toolkit for handling annotations with

temporal ordering and full structural relations • Data storage format designed to support

distributed corpus development• Libraries for data handling, query, and writing

graphical user interfaces• End user annotation tools for common tasks• Command line utilities for analysis, feature

extraction

• Open source

Page 17: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200717

NXT corpus designNXT corpus design• data model is multi-rooted tree with arbitrary

graph structure over the top• each node has one set of children, multiple parents

• annotations often naturally map to a tree• corpus design to decide where trees intersect

• NXT can represent arbitrary graphs but the more the data has this character, the less useful the query language is

Page 18: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200718

extract from Bdb001.A.words.xml

<w nite:id="Bdb001.w.1,342" starttime="356.39" endtime="" c="W">time</w> <w nite:id="Bdb001.w.1,343" starttime="" endtime="" c="HYPH">-</w> <w nite:id="Bdb001.w.1,344" starttime="" endtime="356.59" c="W">line</w>

extract from Bdb001.A.speech-quality.xml<speechquality nite:id="Bdb001.emphasis.16" type="emphasis"> <nite:child href="Bdb001.A.words.xml#id(Bdb001.w.1,342)..id(Bdb001.w.1,344)" /> </speechquality>

Stand-off XMLStand-off XML

Page 19: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200719

Metadata fileMetadata file

Like set of DTDs for the XML files plus:

• connections between the files

• list of "observations" (coded dialogues/group discussions/texts)

• catalog for finding signals and data on disk

Page 20: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200720

Simple example querySimple example query

($w word)($r reference): ($w@POS = “NN”) && ($r ^ $w)

Return list of 2-tuples of words and referring expressions where the word’s part of speech is NN and the word is in the referring expression.

Page 21: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200721

General features of the General features of the languagelanguage

• Match variable by no type, single type, or disjunctive type

• Attribute and content tests for existence, ordering, equality, match to regexp

• The usual boolean combinators• Quantifiers forall and exists • Filtering by passing results to another query

to create a result tree (not list)

Page 22: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200722

Uses for queriesUses for queries

• Exploring the data in a browser• Basic frequency counts• Verifying data quality• Indexing complexes for further use• Finding things for screen rendering

in GUI

Page 23: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200723

Only configuration Only configuration needed to:needed to:

• search/index data in NXT format• display data in a standardized

(ugly) way• Set up annotation tools for some

common tasks• dialogue act• named entity• time-stamped labelling

Page 24: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200724

• [named entity demo]

Page 25: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200725

Programming Programming tailored interfacestailored interfaces

• development time is 1.5 days - 2 weeks depending on • how clear the spec is• complexity of the interface and

whether our "transcription view" middleware fits

• familiarity with Swing

Page 26: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200726

Named entity coderNamed entity coder

Page 27: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200727

Page 28: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200728

Page 29: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200729

Page 30: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200730

Page 31: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200731

Page 32: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200732

Page 33: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200733

Page 34: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200734

Page 35: Www.amiproject.org Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh

ww

w.a

mip

roje

ct.o

rg

Carletta 20 June 200735

SummarySummary

• NXT provides infrastructure for collaborative annotation that • Is distributed• Provides structural relationships• Provides timing w.r.t signals• Works for large-scale projects

• NXT’s best current demonstration is in the AMI Meeting Corpus