82
1 Penn HP Labs Bangalore, 8/21/2003 Annotation as Algebra: a formal framework for linguistic annotation Mark Liberman University of Pennsylvania [email protected] (joint work with Steven Bird, Melbourne University)

Annotation as Algebra: a formal framework for linguistic annotation

  • Upload
    armine

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Annotation as Algebra: a formal framework for linguistic annotation. Mark Liberman University of Pennsylvania [email protected]. (joint work with Steven Bird, Melbourne University). Outline. Motivation Sketch of the idea Survey of linguistic annotation - PowerPoint PPT Presentation

Citation preview

Page 1: Annotation as Algebra: a formal framework for linguistic annotation

1

Penn

HP Labs Bangalore, 8/21/2003

Annotation as Algebra:a formal framework for linguistic

annotation

Mark LibermanUniversity of Pennsylvania

[email protected]

(joint work with Steven Bird, Melbourne University)

Page 2: Annotation as Algebra: a formal framework for linguistic annotation

2

Penn

HP Labs Bangalore, 8/21/2003

Outline

Motivation Sketch of the idea Survey of linguistic annotation Annotation graphs as a formal framework Practical implementations and experience Issues for the future

Page 3: Annotation as Algebra: a formal framework for linguistic annotation

3

Penn

HP Labs Bangalore, 8/21/2003

What linguistic annotation is (and isn’t) “Linguistic annotation” means

symbolic descriptions of specific linguistic signals e.g. transcriptions, parses, etc.

it does not include things like: metadata

e.g. information about speakers, recordings, documents, etc.

typically stored in RDB referenced by elements of linguistic annotation

lexicons but these can be treated in a common

framework

Page 4: Annotation as Algebra: a formal framework for linguistic annotation

4

Penn

HP Labs Bangalore, 8/21/2003

Motivation

A jungle of annotation file formats e.g. more than 20 common formats

for time-marked orthographic transcriptions Many new formats every year

Multiple annotations of the same data No good way to search annotations

different coding needed for each format extra difficulty of searches across formats

Problems for: tool builders researchers corpus builders and maintainers

Page 5: Annotation as Algebra: a formal framework for linguistic annotation

5

Penn

HP Labs Bangalore, 8/21/2003

Basic idea #1: what to do Abstract away from file formats,

to the logical structure of linguistic annotation Replace two-level model with three-level model

as in database technology several decades ago so many applications can access many kinds of data

through a consistent API

Choose a logical structure with good properties simple, conceptually natural, computationally efficient algebra to facilitate boolean combination of queries

Page 6: Annotation as Algebra: a formal framework for linguistic annotation

6

Penn

HP Labs Bangalore, 8/21/2003

Two-level model:

Page 7: Annotation as Algebra: a formal framework for linguistic annotation

7

Penn

HP Labs Bangalore, 8/21/2003

Three-level model:

Page 8: Annotation as Algebra: a formal framework for linguistic annotation

8

Penn

HP Labs Bangalore, 8/21/2003

Basic idea #2: how to do it Three kinds of assertion recur in linguistic

annotation assigning a label

“This chunk of stuff has property X” sequencing labels

“chunk B immediately follows chunk A” anchoring the edges of labels

“this chunk boundary has coordinates k” (in time, space, text...)

Formalized as a labeled DAG, these primitives provides a logical structure

adequate for all linguistic annotation The result also defines an algebra

useful for searching and in other ways

Page 9: Annotation as Algebra: a formal framework for linguistic annotation

9

Penn

HP Labs Bangalore, 8/21/2003

Associate a “label” (typed, structured symbolic information) with a region of a linguistic signal

Basic assertion type 1: Labeling

Page 10: Annotation as Algebra: a formal framework for linguistic annotation

10

Penn

HP Labs Bangalore, 8/21/2003

Basic assertion type 2: sequencing

Example:

The stretch of signal labeled “this”is followed by a stretch of signal labeled “is”

Page 11: Annotation as Algebra: a formal framework for linguistic annotation

11

Penn

HP Labs Bangalore, 8/21/2003

Basic assertion type 3: anchoring

Example:

The stretch of signal labeled “this”begins 137.4592 secondsfrom the start of file XYZ.

Page 12: Annotation as Algebra: a formal framework for linguistic annotation

12

Penn

HP Labs Bangalore, 8/21/2003

Informal formalization

An “annotation graph” (AG) is: a directed acyclic graph whose arcs are labeled with fielded records

e.g. phoneme=“p” or word=“this”

whose nodes may be labeled with signal coordinates

e.g. 3.45692 seconds

Labeling → arc labelsSequencing → Anchoring → signal coordinates on nodes

That’s all!

Page 13: Annotation as Algebra: a formal framework for linguistic annotation

13

Penn

HP Labs Bangalore, 8/21/2003

Outcome

API, open source toolkit (C,C++,TCL,Python); sample tools:

Java version (“ATLAS”) developed by NIST

Page 14: Annotation as Algebra: a formal framework for linguistic annotation

14

Penn

HP Labs Bangalore, 8/21/2003

Annotation formats & tools

Surveyed in 1999 by Liberman and Bird

Documented on web pagehttp://ldc.upenn.edu/annotation

Used in designing annotation graphsystem & AG software

Survey is updated periodically

Page 15: Annotation as Algebra: a formal framework for linguistic annotation

15

Penn

HP Labs Bangalore, 8/21/2003

Some animals in the annotation zoo1 TIMIT2 BAS Partitur3 CHILDES4 LACITO5 LDC CALLHOME6 NIST UTF7 Switchboard (four types of

annotation)8 ... etc. ...

Page 16: Annotation as Algebra: a formal framework for linguistic annotation

16

Penn

HP Labs Bangalore, 8/21/2003

train/dr1/fjsp0/sa1.wrd: train/dr1/fjsp0/sa1.phn:2360 5200 she 0 2360 h#5200 9680 had 2360 3720 sh9680 11077 your 3720 5200 iy11077 16626 dark 5200 6160 hv16626 22179 suit 6160 8720 ae22179 24400 in 8720 9680 dcl24400 30161 greasy 9680 10173 y30161 36150 wash 10173 11077 axr36720 41839 water 11077 12019 dcl41839 44680 all 12019 12257 d44680 49066 year ...

Sample TIMIT data

Page 17: Annotation as Algebra: a formal framework for linguistic annotation

17

Penn

HP Labs Bangalore, 8/21/2003

5200 6160 96808720

had

hv ae dcl

TIMIT interpreted graphically

Page 18: Annotation as Algebra: a formal framework for linguistic annotation

18

Penn

HP Labs Bangalore, 8/21/2003

W = word level5200 9680 had

P = phoneme level5200 6160 hv6160 8720 ae8720 9680 dcl

TIMIT as Annotation Graph

Page 19: Annotation as Algebra: a formal framework for linguistic annotation

19

Penn

HP Labs Bangalore, 8/21/2003

BAS Partitur

Goal: a common format for research results

from many German speech projects.

A multi-tier description of speech signals:

KAN - the canonical transcriptionORT - orthographic transcriptionTRL - transliterationMAU - phonetic transcriptionDAS - dialogue act transcription

Page 20: Annotation as Algebra: a formal framework for linguistic annotation

20

Penn

HP Labs Bangalore, 8/21/2003

BAS Partitur: example

KAN:0 j'a: ORT:0 ja MAU: 4160 1119 0 jKAN:1 S'2:n@n ORT:1 schönen MAU: 5280 2239 0 a:KAN:2 d'aNk ORT:2 Dank MAU: 7520 2399 1 SKAN:3 das+ ORT:3 das MAU: 9920 1599 1 2:KAN:4 vE:r@+ ORT:4 wäre MAU: 11520 479 1 nKAN:5 z'e:6 ORT:5 sehr MAU: 12000 479 1 nKAN:6 n'Et ORT:6 nett MAU: 12480 479 -1

DAS:0,1,2 @(THANK_INIT BA)DAS:3,4,5,6 @(FEEDBACK_ACKNOWLEDGEMENT BA)

Page 21: Annotation as Algebra: a formal framework for linguistic annotation

21

Penn

HP Labs Bangalore, 8/21/2003

j'a: S'2:n@n

KAN:

ORT: ja sch"onen

DAS: @(THANK_INIT BA)

4160 5280

7520

j a:MAU

:

BAS Partitur graphical structure:

KAN:0 j'a: ORT:0 ja MAU: 4160 1119 0 jKAN:1 S'2:n@n ORT:1 sch"onen MAU: 5280 2239 0 a:DAS:0,1,2 @(THANK_INIT BA)

Page 22: Annotation as Algebra: a formal framework for linguistic annotation

22

Penn

HP Labs Bangalore, 8/21/2003

Partitur differences from TIMIT

File organization:everything is in a single file

(even metadata)Time marking:

time anchors are in only one tier (MAU)

time anchors use <start offset, duration-1>

Relationship between the tiers:KAN tier supplies a set of identifiersMAU tier: several lines for each KAN lineDAS tier: one line for several KAN lines

Temporal structure:MAU and DAS define convex intervals

Page 23: Annotation as Algebra: a formal framework for linguistic annotation

23

Penn

HP Labs Bangalore, 8/21/2003

BAS Partitur: Annotation graph

ORT: 0 ja MAU: 4160 1119 0 jORT: 1 sch"onen MAU: 5280 2239 0 a: MAU: 7520 2399 1 S MAU: 9920 1599 1 2: MAU: 11520 479 1 n

DAS:0,1,2 @(THANK_INIT BA)

Page 24: Annotation as Algebra: a formal framework for linguistic annotation

24

Penn

HP Labs Bangalore, 8/21/2003

CHILDES

Child language acquisition data Archive organized by Brian

MacWhinney at CMU

CHAT transcription format Tools for creating, browsing, searching Contributions by many researchers

around the world

Page 25: Annotation as Algebra: a formal framework for linguistic annotation

25

Penn

HP Labs Bangalore, 8/21/2003

CHILDES Annotation

*ROS: yahoo.%snd: "boys73a.aiff" 7349 8338*FAT: you got a lot more to do # don't you?%snd: "boys73a.aiff" 8607 9999*MAR: yeah.%snd: "boys73a.aiff" 10482 10839*MAR: because I'm not ready to go to <the bathroom> [>] +/.%snd: "boys73a.aiff" 11621 13784

Page 26: Annotation as Algebra: a formal framework for linguistic annotation

26

Penn

HP Labs Bangalore, 8/21/2003

CHILDES differences from TIMIT

long recordings with multiple speakers time specified at turn level only there are gaps between the turns the transcription contains embedded

annotations

Page 27: Annotation as Algebra: a formal framework for linguistic annotation

27

Penn

HP Labs Bangalore, 8/21/2003

CHILDES annotation graph

*ROS: yahoo.%snd: "boys73a.aiff" 7349 8338*FAT: you got a lot more to do # don't you?%snd: "boys73a.aiff" 8607 9999

NB: incomplete time info, disconnected structure

Page 28: Annotation as Algebra: a formal framework for linguistic annotation

28

Penn

HP Labs Bangalore, 8/21/2003

CHILDES: RDB connection

ID NAME ROLE AGE SEX BIRTH

1 Ross Child 6;3.11 male 23-DEC-1977

2 Mark Child 4;4.15 male 19-NOV-1979

3 Brian Father

4 Mary Mother

“metadata” about speakers, recordings etc. stored separately in relational tables

Page 29: Annotation as Algebra: a formal framework for linguistic annotation

29

Penn

HP Labs Bangalore, 8/21/2003

LACITO

Langues et Civilisations a Tradition Orale recordings of unwritten languages,

collected and transcribed over three decades preservation and dissemination

Based on XML markup for alignment to audio signal different XSL style sheets for display

generating HTML with hyperlinks to audio clips

Page 30: Annotation as Algebra: a formal framework for linguistic annotation

30

Penn

HP Labs Bangalore, 8/21/2003

LACITO example

<S id="s1"> <AUDIO start="2.3656" end="7.9256"/> <TRANSCR> <W><FORM>nakpu</FORM> <GLS>deux</GLS></W> <W><FORM>nonotso</FORM> <GLS>soeurs</GLS></W> <W><FORM>si&x014b;</FORM> <GLS>bois</GLS></W> <W><FORM>pa</FORM> <GLS>faire</GLS></W> <W><FORM>la&x0294;natshem</FORM> <GLS>allerent</GLS></W> <W><FORM>are</FORM> <GLS>dit.on</GLS></W> <PONCT>.</PONCT> </TRANSCR> <TRADUC lang="Francais">On raconte que deux soeurs allerent chercher du bois.</TRADUC> <TRADUC lang="Anglais">They say that two sisters went to get firewood.</TRADUC></S>

Page 31: Annotation as Algebra: a formal framework for linguistic annotation

31

Penn

HP Labs Bangalore, 8/21/2003

LACITO as AG

<AUDIO start="2.3656" end="7.9256"/><W><FORM>nakpu</FORM> <GLS>deux</GLS></W><W><FORM>nonotso</FORM> <GLS>soeurs</GLS></W><W><FORM>si&x014b;</FORM> <GLS>bois</GLS></W><W><FORM>pa</FORM> <GLS>faire</GLS></W><TRADUC lang="Francais">On raconte que deux ...</TRADUC><TRADUC lang="Anglais">They say that two ...</TRADUC>

Page 32: Annotation as Algebra: a formal framework for linguistic annotation

32

Penn

HP Labs Bangalore, 8/21/2003

LACITO discussion

Two kinds of partiality for times: where they are simply unknown where they are inappropriate

Unknown times: the annotation is incomplete time-alignment is coarse-grained

Inappropriate times: for word boundaries in the phrasal

translation for punctuation?

Page 33: Annotation as Algebra: a formal framework for linguistic annotation

33

Penn

HP Labs Bangalore, 8/21/2003

LDC Call Home example

980.18 989.56 A: you know, given how he's how far he's gotten, you know, he got his degree at &Tufts and all, I found that surprising that for the first time as an adult they're diagnosing this. %um 989.42 991.86 B: %mm. I wonder about it. But anyway. 991.75 994.65 A: yeah, but that's what he said. And %um 994.19 994.46 B: yeah. 995.21 996.59 A: He %um 996.51 997.61 B: Whatever's helpful. 997.40 1002.55 A: Right. So he found this new job as a financial consultant and seems to be happy with that. 1003.14 1003.45 B: Good.

Page 34: Annotation as Algebra: a formal framework for linguistic annotation

34

Penn

HP Labs Bangalore, 8/21/2003

LDC CallHome as AG

995.21 996.59 A: He %um 996.51 997.61 B: Whatever's helpful. 997.40 1002.55 A: Right. So ...

Page 35: Annotation as Algebra: a formal framework for linguistic annotation

35

Penn

HP Labs Bangalore, 8/21/2003

CallHome discussion

Speaker overlap No special devices, just turn time-marks Scales for an arbitrary number of

speakers Information about word-level overlap

is left ambiguous Additional time references

could easily specify word overlap

Page 36: Annotation as Algebra: a formal framework for linguistic annotation

36

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF (circa 1999)

NIST: National Institute for Standards and Technology(USA)

UTF: “Universal Transcription Format” Intended to generalize over several earlier

LDC broadcast news and conversation transcription formats

Special treatment for: metadata, time stamps, speaker overlap,

contractions

N.B. now abandoned in favor of AG-based representations

Page 37: Annotation as Algebra: a formal framework for linguistic annotation

37

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF example (from BN)

<turn speaker="Roger_Hedgecock" spkrtype="male" dialect= "native" start="2348.811875" end="2391.606000" mode="spontaneous" fidelity="high"> <time sec="2387.353875"> on welfare and away from real ownership \{breath and <contraction e_form="[that=>that]['s=>is]">that's a real problem in this <b_overlap start="2391.115375" end="2391.606000"> country<e_overlap></turn><turn speaker="Gloria_Allred" spkrtype="female" dialect= "native" start="2391.299625" end="2439.820312" mode="spontaneous" fidelity="high"> <b_overlap start="2391.299625" end="2391.606000">well i<e_overlap> think the real problem is that %uh these kinds of republican attacks <time sec="2395.462500">i see as code words for discrimination</turn>

Page 38: Annotation as Algebra: a formal framework for linguistic annotation

38

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: turn element

<turn speaker="Roger_Hedgecock" spkrtype="male" dialect= "native" start="2348.811875" end="2391.606000" mode="spontaneous" fidelity="high">

Page 39: Annotation as Algebra: a formal framework for linguistic annotation

39

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: Contraction

<contraction e_form="[that=>that]['s=>is]"> that's

Page 40: Annotation as Algebra: a formal framework for linguistic annotation

40

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: overlap

<b_overlap start="2391.115375" end="2391.606000">country<e_overlap>

Page 41: Annotation as Algebra: a formal framework for linguistic annotation

41

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: discussion

Relational data (e.g. speaker demographics)is embedded in the annotation (redundantly).

Time stampsare stored in three different places.

Speaker overlapis convolved with the speaker turn,so time relation with an external event disrupts the internal structure of a turn

Contractionsare treated in a way that facilitates link to

lexicon,but may be hard to ignore in a search function

Page 42: Annotation as Algebra: a formal framework for linguistic annotation

42

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF as AG

Page 43: Annotation as Algebra: a formal framework for linguistic annotation

43

Penn

HP Labs Bangalore, 8/21/2003

AG contraction treatment

Additional textual annotations: e.g. for expanding a contraction don't complicate the existing representation

--facilitates search

Page 44: Annotation as Algebra: a formal framework for linguistic annotation

44

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF / AG version

Metadatastored in a separate RDB table (cf.

CHILDES)Time stamps

stored in a single place -- AG nodesSpeaker overlap

not convolved with the speaker turn so temporal relationship with an external

event remains external to the structure of a turn

Contractionsno new device, easily ignored in search

No artificial order on speaker turns

Page 45: Annotation as Algebra: a formal framework for linguistic annotation

45

Penn

HP Labs Bangalore, 8/21/2003

Switchboard

Corpus of 2400 5-minute telephone conversations collected at Texas Instruments in 1991Transcribed and aligned on three levels:

conversation, speaker turn, wordSubsequently annotated for:

POS, syntactic structure,breath groups, disfluencies,speech acts,phonetic segments,etc.

Then re-transcribed with many corrections!

--Proliferation of layers with different tokenizations--Problem of correction after annotation

Page 46: Annotation as Algebra: a formal framework for linguistic annotation

46

Penn

HP Labs Bangalore, 8/21/2003

SWB example (1, 2)

B 21.86 0.26 MetricB 22.12 0.26 system,B 22.38 0.18 noB 22.56 0.06 one'sB 22.86 0.32 very,B 23.88 0.14 uh,B 24.02 0.16 noB 24.18 0.32 oneB 24.52 0.28 wantsB 24.80 0.06 itB 24.86 0.12 atB 24.98 0.22 allB 25.66 0.22 seemsB 25.88 0.22 like.

[ Metric/JJ system/NN ],/, [ no/DT one/NN ]'s/BESvery/RB ,/, [ uh/UH ] ,/, [ no/DT one/NN ]wants/VBZ [ it/PRP ]at/IN [ all/DT ]seems/VBZlike/IN ./.

Page 47: Annotation as Algebra: a formal framework for linguistic annotation

47

Penn

HP Labs Bangalore, 8/21/2003

SWB example (3, 4)

B.22: Yeah, / no one seems to be adopting it. / Metric system, [ no one's very, + {F uh, } no one wants ] it at all seems like. /

((S (NP-TPC Metric system) , (S-TPC-1 (EDITED (RM [) (S (NP-SBJ no one) (VP 's (ADJP-PRD-UNF very))) , (IP +)) (INTJ uh) , (NP-SBJ no one) (VP wants (RS ]) (NP it) (ADVP at all))) (NP-SBJ *) (VP seems (SBAR like (S *T*-1))) . E_S))

Page 48: Annotation as Algebra: a formal framework for linguistic annotation

48

Penn

HP Labs Bangalore, 8/21/2003

Switchboard: AG

Page 49: Annotation as Algebra: a formal framework for linguistic annotation

49

Penn

HP Labs Bangalore, 8/21/2003

Another multiple annotation

It is quite realistic to have this many diverse annotations (and more!)

for the same material...

Page 50: Annotation as Algebra: a formal framework for linguistic annotation

50

Penn

HP Labs Bangalore, 8/21/2003

AG formalization: Background

Annotation - the basic action: associate a label with an extent of signal labels may be of different types different types may span different

amounts of time; need not form a hierarchy

Minimal formalization: directed graph typed, fielded records on the arcs optional time references on the nodes

Page 51: Annotation as Algebra: a formal framework for linguistic annotation

51

Penn

HP Labs Bangalore, 8/21/2003

TimelinesNodes are anchored to signals using offsetsAn annotation may reference more than one

signal e.g. simultaneous audio and video signals

signals from multiple microphonesaudio and physiological signals

All the signals covered by a given annotation must be from the same "flow of time" = timeline T

but signals may cover a timeline only partially(Other ordered sets,

such as the sequence of characters in a text,may also be treated as timelines... )

Page 52: Annotation as Algebra: a formal framework for linguistic annotation

52

Penn

HP Labs Bangalore, 8/21/2003

Two Signals, One Timeline

(Could be treated as a single multi-channel signal --but different channels might be in different files,have different frame rates, etc.)

Page 53: Annotation as Algebra: a formal framework for linguistic annotation

53

Penn

HP Labs Bangalore, 8/21/2003

AG: Formal Definition

An Annotation Graph G over a label set L and timeline T is a 3-tuple <N,A,t>:

N = set of nodes A = set of arcs labelled with elements of L t = partial function from N to T

satisfying the following conditions:1 <N,A> is acyclic, with no nodes of degree

zero2 for any path from node n1 to n2, if t(n1)

and t(n2) are defined, then t(n1) <= t(n2)

Page 54: Annotation as Algebra: a formal framework for linguistic annotation

54

Penn

HP Labs Bangalore, 8/21/2003

Condition 1

1. <N,A> is acyclic, with no nodes of degree zero

1a. AGs are acyclic expresses the linearity of signal

annotations an important property wrt implementations

and to QLs containing path expressions

1b. AGs have no orphan nodes the only point of nodes is to anchor the

arcs avoids the situation of AGs that are

identical but for orphan nodes

Page 55: Annotation as Algebra: a formal framework for linguistic annotation

55

Penn

HP Labs Bangalore, 8/21/2003

Condition 2for any path from node n1 to n2, if t(n1) and

t(n2) are defined, then t(n1) <= t(n2)

2. AGs respect the flow of time(or the structure of another anchoring

space)

1 12 1.23

1 122 3.15

1 2

Page 56: Annotation as Algebra: a formal framework for linguistic annotation

56

Penn

HP Labs Bangalore, 8/21/2003

AG: Interpretation of LabelsArc labels may be interpreted as:

substantive content conforming to a coding practice as meta-commentary as a reference to other material as an identifier as arbitrary binary data

Choice of label interpretations falls outside the scope of the formalism

Page 57: Annotation as Algebra: a formal framework for linguistic annotation

57

Penn

HP Labs Bangalore, 8/21/2003

AG: ExpressivenessIs the formalism too minimalist?Some things that some people want:

1. cross-reference from a label to another arbitrary label, arc or node

2. labels as well as anchors for nodes3. anchoring nodes to arcs or labels rather than timelines4. anchoring arcs/labels in 2- or 3-dimensional spaces5. recursive structures in labels

“Core AG” has sufficient expressive capacity to encode, in an intuitive way, all commonly used formats,and also good properties wrt creation, maintenance, search

Our strategy:- see how far we can go with this core- dispense with more complex syntax and focus on

semantics- but some of (1) has been added in core AG

implementation,and (4) has been added in “ATLAS” (NIST version)

Page 58: Annotation as Algebra: a formal framework for linguistic annotation

58

Penn

HP Labs Bangalore, 8/21/2003

Structures for a single layer

All of these have (one or more) natural representations

in the basic AG formalism.

Multiple layers can of course be added in a general way.

Page 59: Annotation as Algebra: a formal framework for linguistic annotation

59

Penn

HP Labs Bangalore, 8/21/2003

Equivalence classes

Equivalence classes (joint reference to an external ID)provide a way to establish symmetrical inter-label

linkageswithout any new formal devices

Page 60: Annotation as Algebra: a formal framework for linguistic annotation

60

Penn

HP Labs Bangalore, 8/21/2003

AG as algebra An AG can be represented as a set of arcs

each with an associated labeland (optionally-anchored)source and destination nodes

The power set of this arc setdefines a boolean algebra (as usual)

Every member of the power setis itself a well-defined AG

This algebra can be used for queries,just as the relational algebra is for RDBs

Adding e.g. pointers from labels to other arccompromises this property(because arc subsets are not well-formed

if pointers cannot be dereferenced)

Page 61: Annotation as Algebra: a formal framework for linguistic annotation

61

Penn

HP Labs Bangalore, 8/21/2003

AG as RDB

An AG can therefore also be interpretedas a relational table

or (more conveniently) as a set of three relational tables

This allows standard RDB implementationsto be used for AG storage and

retrieval Obvious advantages,

though standard RDBmay not use AG structure optimally...

Page 62: Annotation as Algebra: a formal framework for linguistic annotation

62

Penn

HP Labs Bangalore, 8/21/2003

Relational Representation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

Three relations: anchor, annotation (=arc), feature

(=label)

Page 63: Annotation as Algebra: a formal framework for linguistic annotation

63

Penn

HP Labs Bangalore, 8/21/2003

Anchor Relation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

AnchorId Offseta1 t1a2 t2

Page 64: Annotation as Algebra: a formal framework for linguistic annotation

64

Penn

HP Labs Bangalore, 8/21/2003

Annotation (arc) Relation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

AnnotationId Source DestinationAnn1 a1 a2

Page 65: Annotation as Algebra: a formal framework for linguistic annotation

65

Penn

HP Labs Bangalore, 8/21/2003

Feature Relation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

AnnotationId Feature ValueAnn1 F1 l1Ann1 F2 l2... ... ...

Page 66: Annotation as Algebra: a formal framework for linguistic annotation

66

Penn

HP Labs Bangalore, 8/21/2003

Queries across multiple tables

ID Sex DR Ht

AKS0 F 1 5'04"

ASW0 F 5 5'06"

BJL0 F 5 5'07"

train/dr2/fbjl0/

ha /hh aa1/

habit /hh ae1 b ix t/

had /hh ae1 d/

hafta /hh ae1 f t ax/

Page 67: Annotation as Algebra: a formal framework for linguistic annotation

67

Penn

HP Labs Bangalore, 8/21/2003

Queries on AG Tablesselect * from FEATURE where

FEATURE.AGID="Timit:AG80"select ANNOTATIONID,SPKRINFO.ID

from FEATURE,SPKRINFOwhere SPKRINFO.DR=1and SPKRINFO.Ht=70and FEATURE.VALUE="dark"

Page 68: Annotation as Algebra: a formal framework for linguistic annotation

68

Penn

HP Labs Bangalore, 8/21/2003

AG software

AGTK provides API and language bindings version 2.0 recently released

Sample applications Open-source license Available on sourceforge:

Page 69: Annotation as Algebra: a formal framework for linguistic annotation

69

Penn

HP Labs Bangalore, 8/21/2003

AGTK architecture

Page 70: Annotation as Algebra: a formal framework for linguistic annotation

70

Penn

HP Labs Bangalore, 8/21/2003

API Summary Functions for creating, accessing,

modifying, storing and loading AGs C++ library Compiles on Unix and Windows Scripting language access:

Python, Tcl/tk

Page 71: Annotation as Algebra: a formal framework for linguistic annotation

71

Penn

HP Labs Bangalore, 8/21/2003

File I/O LibraryApproach:

build import methods for all widely used formats

public API & documentation to encourage others to contribute code for their formats

Currently supported: AIF (ATLAS Interchange Format -

XML) BAS, BU, CALLHOME, CSV,

Switchboard, TIMIT, Treebank, xlabel

Page 72: Annotation as Algebra: a formal framework for linguistic annotation

72

Penn

HP Labs Bangalore, 8/21/2003

Integration with other tools

Example: WaveSurfer/SNACKSjölander and Beskowwww.speech.kth.se/wavesurfer/

open source software for sound visualization, analysis and manipulation

Linux, Windows 95/98/NT/2k, Mac, Solaris, ... customizable, extensible, embeddable can read and write:

wav, au, aiff, mp3, csl, sd, sphere unlimited file size

Unicode support

Page 73: Annotation as Algebra: a formal framework for linguistic annotation

73

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 1

Page 74: Annotation as Algebra: a formal framework for linguistic annotation

74

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 2

Page 75: Annotation as Algebra: a formal framework for linguistic annotation

75

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 3

Page 76: Annotation as Algebra: a formal framework for linguistic annotation

76

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 4

Page 77: Annotation as Algebra: a formal framework for linguistic annotation

77

Penn

HP Labs Bangalore, 8/21/2003

Annotation Component: Spreadsheet (TRAINS+DAMSL)

Annotation here presented in spreadsheet mode

Each row is an annotation of stretch of signalEach column is a type of annotation

Page 78: Annotation as Algebra: a formal framework for linguistic annotation

78

Penn

HP Labs Bangalore, 8/21/2003

TableTrans tool

Seamless integration of AGTK for annotation,and Wavesurfer for audio display and playback.

Page 79: Annotation as Algebra: a formal framework for linguistic annotation

79

Penn

HP Labs Bangalore, 8/21/2003

Components in TableTrans

Page 80: Annotation as Algebra: a formal framework for linguistic annotation

80

Penn

HP Labs Bangalore, 8/21/2003

Another annotation GUI

Page 81: Annotation as Algebra: a formal framework for linguistic annotation

81

Penn

HP Labs Bangalore, 8/21/2003

Issues for the future

Some positive things “stand-off” (rather than in-line) annotation

is now common though by no means universal but in-line annotators mostly realize they are

sinful AGTK implementation is mature

libraries are well designed & implemented good integration with GUIs and DB backends can read/write many common formats

Some AG-based tools are good basically, those that have really been used demand pull & influence of users on

development

Page 82: Annotation as Algebra: a formal framework for linguistic annotation

82

Penn

HP Labs Bangalore, 8/21/2003

Issues for the future

Some things need more work AG API and AGTK are not yet widely used Many AG-based tools are rough sketches NIST ATLAS is not popular with researchers

(java, complexity) For many projects,

something simpler & less general is still the local optimum:

lines of tab-separated fields, or in-line mark-up (XML or ad hoc), or other legacy or new ad hoc formats

but it’s still early days...