48
Feb 23, 2005 1 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon Univeristy

Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Embed Size (px)

Citation preview

Page 1: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 1

Interlingua Annotation of Multilingual Corpora

(IAMTC) Project

Lori Levin and Teruko Mitamura

Language Technologies Institute

Carnegie Mellon Univeristy

Page 2: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 2

IAMTC project members

Collaboration: New Mexico, Maryland, Columbia, MITRE, CMU, ISI

Members: Bonnie Dorr (Maryland)David Farwell (NMSU) Rebecca Green (Maryland)Nizar Habash (Columbia)Stephen Helmreich (NMSU)Eduard Hovy (ISI)Lori Levin (CMU)Keith Miller (MITRE) Teruko Mitamura (CMU)Owen Rambow (Columbia) Flo Reeder (MITRE)Advaith Siddharthan (Columbia)

Page 3: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 3

IL-Annotation Outcomes

• IL design – Three levels of depth: IL0, IL1, and IL2

• Annotation methodology – Manuals, tools, evaluations

• Annotated parallel texts – Foreign language original and multiple English

translations– Foreign languages: Arabic, French, Hindi,

Japanese, Korean, Spanish

Page 4: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 4

Uniqueness of Annotation Effort

• Multi-parallel – Three versions of each text

• Original language and two English translations

– Shows multiple surface realizations of the same meaning

• Multi-lingual– Each text is in at least two languages (English and one

other)– The methodology is applied to multi-parallel corpora in

six languages.• Arabic, French, Hindi, Japanese, Korean, Spanish

Page 5: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 5

Motivation

• Interlingua designed for MT– Multiple English translations of same source show

translation divergences. Some phenomena: • Lexical level: word changes • Syntactic level: phrasing, thematization, nominalization • Semantic level: additional/different content • Discourse level: multi-clause structure, anaphor • Pragmatic level: Speech Acts, implicatures, style, interpersonal

• Causes of divergence– Genuine ambiguity/vagueness of source meaning – Translator error/reinterpretation

Page 6: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 6

IL Development: Staged, deepening • IL0:

– Shows simple dependency structure

• IL1: – Replace open class lexical items with concept

names– Replace grammatical relation labels with semantic

role labels

• IL2: (under development)– Separates shared portions and unresolved portions

of divergent sentences

Page 7: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 7

Details of IL0• Deep syntactic dependency representation:

– Removes auxiliary verbs, determiners, and some function words

– Normalizes passives, clefts, etc. – Removes strongly governed prepositions– Includes syntactic roles (Subj, Obj)

Page 8: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 8

Construction of IL0

• Dependency parsers • Connexor (English), Tapanainen and Jarvinen, 1997

• Kabocha (Japanese)

• Hand-corrected

• Extensive manual and instructions on IAMTC Wiki website – for English, Spanish, Japanese, and possibly

others

Page 9: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 9

Syntactic Variation Resolved at IL0

• Passive• The gangster killed at least 3 innocent bystanders.

• At least 3 innocent bystanders were killed by the gangster.

• Other transitivity alternations

Page 10: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 10

Example of IL0

TrEd, Pajas, 1998

Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

Page 11: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 11

Example of IL0• Sheikh Mohammed, who is also the Defense Minister of the United

Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

announced V RootMohamed PN Subj

Sheikh PN ModDefense_Minister PN Mod

who Pron Subjalso Adv Modof P Mod

UAE PN Objat P Mod

ceremony N Objinauguration N Mod

Page 12: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 12

Details of IL1

• Associate open-class lexical items with Omega Ontology items

• Replace syntactic relations by one of approx. 20 semantic (theta) roles (from Dorr) e.g., AGENT, THEME, GOAL, INSTR…

• No treatment of prepositions, quantification, negation, time, modality, idioms, proper names, NP-internal structure…

• Nodes may receive more than one concept– Average: about 1.2

Page 13: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 13

Construction of IL1

• TIAMAT annotation tool

• Manual for converting IL0 to IL1 is available

Page 14: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 14

Syntactic Variation Resolved at IL1

• Lexical Synonymy– The toddler sobbed, and he attempted to

console her.– The baby wailed, and he tried to comfort her.

• Thematic Divergence– Bob enjoys playing with his kids.– Playing with his kids pleases Bob.

Page 15: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 15

Example of IL1 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

Page 16: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 16

Example of IL1: internal representation

The study led them to ask the Czech government to recapitalize CSA at this level.[3, lead, V, lead, Root, LEAD<GET, GUIDE][2, study, N, study, AGENT, SURVEY<WORK, REPORT][4, they, N, they, THEME, ---, ---][6, ask, V, ask, PROPOSITION, ---, ---] [9, government, N, government, GOAL, AUTHORITIES,

GOVERNMENTAL-ORGANIZATION] [8, Czech, Adj, Czech, MOD, CZECH~CZECHOSLOVAKIA, ---] [11, recapitalize, V, recapitalize, PROP, CAPITALIZE<SUPPLY, INVEST] [12, csa, N, csa, THEME, AIRLINE<LINE, ---] [16, at, P, value_at, GOAL, ---, ---] [15, level, N, level, ---, DEGREE, MEASURE] [14, this, Det, this, ---, ---, ---]

Semantic Roles

Concepts from the Omega Ontology

Page 17: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 17

Tiamat: annotation interface

For each new sentence:

For each word to be annotated (shown with dependents)

Page 18: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 18

Tiamat: annotation interface

For each new sentence:

Candidate concepts Step 1: find Omega concepts for objects and events

Page 19: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 19

Tiamat: annotation interface(note: similarity to PDT annotation interface)

For each new sentence:

Candidate concepts Step 1: find Omega concepts for objects and events

Step 2: select event frame (theta roles)

Page 20: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 20

Details of IL2 • Start capturing meaning:

– Handle proper names: one of around 5 classes (PERSON, LOCATION, TIME, ORGANIZATION…)

– Conversives (buy vs. sell) at the FrameNet level– Non-literal language usage (open the door to customers vs. start

doing business) – Extended paraphrases involving syntax, lexicon, grammatical

features– Possible incorporation of other ‘standardized’ notations for

temporal and spatial expressions

• Still excluded: – Quantification and negation – Discourse structure – Pragmatics

Page 21: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 21

Variation Resolved at IL2

• Morphological Derivation– I was surprised that he destroyed the old house.

– I was surprised by his destruction of the old house.

• Differences in clause subordination– This is Joe’s new car, which he bought in New York.

– This is Joe’s new car. He bought it in New York.

• N-N Compounds– She loves velvet dresses.

– She loves dresses made of velvet.

Page 22: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 22

IL2 (continued)

• Head Switching– Mike Mussina excels at pitching.– Mike Mussina pitches well.– Mike Mussina is a good pitcher.

• Lexical Conflation– Lindbergh flew across the Atlantic Ocean.– Lindbergh crossed the Atlantic Ocean by plane.

Page 23: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 23

Not normalized

• Comparitives vs. Superlatives– He’s smarter than everybody else.– He’s the smartest one.

• Different Sentence Types– Who composed the Brandenburg Concertos?– Tell me who composed the Brandenburg Concertos.

• Inverse Relationship– Only 20% of the participants arrived on time.– 80% of the participants were late.

• Inference– The Porto player kicked the ball into the net.– The Porto player scored a goal.

• Viewpoint Variation– Stop getting in the way.– Stop trying to help.

Page 24: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 24

Note from Lori

• In my version of Powerpoint the color blocks on the next slide don’t line up with the text correctly.

• I didn’t have time to fix it, so I inserted the other version of the same slide.

• If you have time to fix the color box version, then you can delete the two slides after that.

• Otherwise, you can delete the color box version.

Page 25: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 25

Theoretical goal: Getting at meaning

• Semantically identical

K1E1: Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF. … The Subscribers cannot switch again to another provider for the first 3 months, but they can cancel the switch in 14 days if they are not satisfied with services like voice quality. K1E2: Starting January 1st of next year, customers of SK Telecom can change their service company to LG Telecom or KTF … Once a service company swap has been made, customers are not allowed to change companies again within the first three months, although they can cancel the change anytime within 14 days if problems such as poor call quality are experienced.

• Semantically equivalent• Additional/less information• Semantically different:• Different information

Page 26: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 26

Getting at Meaning(Two translations of Korean original text)

Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF.

The Subscribers cannot switch again to another provider for the first 3 months, but they can cancel the switch in 14 days if they are not satisfied with services like voice quality.

Starting January 1st of next yearcustomers of SK Telecom can change their service company toLG Telecom or KTF … Once a service company swap has

been made, customers are not allowed to change companies again within the first three months, although they can cancel the change anytime within 14 days if problems such as poor call quality are experienced.

Page 27: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 27

Color Key

• Black: same meaning and same expression

• Green: small syntactic difference

• Blue: Lexical difference

• Red: Not contained in the other text

• Purple: Larger difference.– Need to use some inference to know that the

meaning is the same

Page 28: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 28

Getting at meaning(Two translations of a Japanese original text)

• This year, • too, • in addition to • the birth • of Mitsubishi Chemical, • which has already been

announced, • other rather large-scale

mergers • may continue, • and be recorded • as a "year of mergers."

• This year, • which has already seen • the announcement • of the birth • of Mitsubishi Chemical

Corporation • as well as • the continuous • numbers of big mergers, • may • too • be recorded • as the “year of the merger”• for all we know.More lexical similarity.

More differences in dependency relations.

Page 29: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 29

Common Aspects of MeaningThis year, too, in addition to the birth of Mitsubishi Chemical, which has

already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers.“

This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the “year of the merger” for all we know.

• Big mergers continue this year

• Mergers continue in addition to the birth of Mitsubishi Chemical

• Birth of Mitsubishi Chemical

• Someone announces the birth of Mitsubishi Chemical

• Someone records this year as the year of the merger

Page 30: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 30

Divergences that can be resolvedThis year, too, in addition to the birth of Mitsubishi Chemical, which has

already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers.“

This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the “year of the merger” for all we know.

• Mergers are big

• Someone announces the birth of Mitsubishi Chemical

• Someone records something as the year of the merger

Page 31: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 31

Benefits for Other Projects

• MT

• Question Answering

• Summarization

• Information Retrieval

• Information Extraction

• Text Mining

• Etc.

Page 32: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 32

Approaches to Evaluation• Inter-annotator agreement — completed • Sentence generation from extracted annotation

structure• Comparison of interlingual structures (graph

comparisons)• Ontology growth (or shrinkage) rate (per unit of text)

– Competing goals:• Addressing coverage gaps (1/3 of open class words marked as having

no concept)• Omega seems too rich: Hard to distinguish between senses;

Granularity of concept selection

Page 33: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 33

Inter-annotator Agreement

• Is the IL sufficiently defined to permit consistent annotation?– Ontology

– Theta-roles

– Coverage and precision

Page 34: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 34

Evaluation webpage

Page 35: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 35

Inter-annotator agreement

• Difficulty is that more than one sense can be selected for a given annotation– Standard kappa does not apply in this case

• Two alternatives for calculating expected probability of agreement:– Agreement and kappa for positive senses– Agreement and kappa for all senses

• Both were explored– Positive sense agreement, kappa shown here

Page 36: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 36

Positive agreement annotations

• Construct a table for each word:– For each annotator and each

sense whether or not that sense was selected by that annotator

• Calculate agreement =

• Calculate kappa using Monte Carlo simulation of P(E)

S1 S2 S3 S4 … Sn A1 1 0 1 0 … 0 A2 0 1 1 0 … 0 N(S) 1 1 2 0 … 0

n

ii

n

iii

S

SS

N

NN

)(

)1)(()(

Page 37: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 37

Evaluation results – positive examples

Annotators who finished95% of their annotations

Annotators who finished 90% of their annotations

Annotators who finished 50% of their

annotationsAll annotators

A# APA Kappa A# APA Kappa A# APA Kappa A# APA Kappa

Mikro-kosmos

3.50 0.7445 0.7432 4.42 0.7310 0.7296 6.33 0.6105 0.6085 9.42 0.4552 0.4540

Word-Net

6.08 0.6600 0.6565 7.00 0.6538 0.6502 8.33 0.5982 0.5941 9.42 0.5174 0.5125

Theta Roles

5.75 0.5378 0.5089 6.58 0.5492 0.5210 8.00 0.4845 0.4522 9.42 0.3924 0.3544

Page 38: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 38

All cases count

• Count 0,0 and 1,1 agreements – T00 , T11

• Count 0,1 and 1,0 disagreements – T10 , T01

• Count number of 0 & 1 for annotators 1 & 2 - A01, A11; A02, A12

• Divide all counts by number senses• Agreement = T00 + T11

• Kappa = 2 * ((T00 * T11) – (T10 * T01)) /

((A01 * A12) + (A02 * A11)) [marginal prob.]

Page 39: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 39

All Cases Agreement / Kappa

Zero-Pairs

All casesExclude zero-

pairs

Agree Kappa Agree Kappa

Theta Roles

78.58 0.945 0.418 0.943 0.392

WordNet

112.16 0.886 0.564 0.879 0.534

Mikrokosmos

258.5 0.811 0.522 0.784 0.433

Page 40: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 40

Annotation Issues

1. Post-annotation consistency checking– Novice annotators may make inconsistent annotations

within the same text.

– Intra-annotator consistency checking procedure• e.g. If two nodes in different sentences are co-indexed, then

annotators must ensure that the two nodes carry the same meaning in the context of the two different sentences

2. Post-annotation reconciliation

Page 41: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 41

Post-annotation reconciliation • Question: How much can annotators be brought into

agreement? • Procedure:

– Annotator sees all annotations, votes Yes/Maybe/No on each – Annotators then discuss all differences (telephone conf) – Annotators then vote again, independently – We collapse all Yes and Maybe votes, compare them with

No to identify all serious disagreement

Page 42: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 42

Results of Reconciliation

• Annotators derive common methodology

• Small errors and oversights removed during discussion

• Inter-annotator agreement improved

• Serious problems of interpretation or error identified

Page 43: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 43

Annotation across Translations Question: How different are the translations? • Procedure:

– Annotator sees annotations across both translations, identifies differences of form and meaning

– Annotator selects ‘true’ meaning(s)

• Results (work still in progress): – Impacts ontology richness/conciseness – Improvement in Interlingua representation ‘depth’– Useful for IL2 design development

• Observations: – This is very hard work – Methodology unclear: what is seen first, how to show

alternatives, what to do with results…

Page 44: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 44

Outcomes—how have we done? • IL design

– IL0 and IL1 finished– IL2 in the works

• Annotation methodology– Manuals for IL0 in at least three languages– Manual for converting IL0 to IL1– Annotation tools for IL0 and IL1– Evaluation of inter-coder agreement– Procedure for annotator reconciliation

• Around 144 annotated parallel texts in IL0 and IL1– Six texts from six different source languages– Two English translations of each text– 10-12 annotators for each text

Page 45: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 45

Next Steps

• Foreign language annotation standards and tools

• Development of IL2

• Addressing coverage gaps (1/3 of open class words marked as having no concept)

Page 46: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 46

Contact information

• URLs and Wiki pages: – Project website: http://aitc.aitcnet.org/nsf/iamtc/ – PIs: http://sparky.umiacs.umd.edu:8000/IAMTC/IAMTC.wiki

– Annotators: http://sparky.umiacs.umd.edu:8000/IAMTC-Annotator/IAMTC-Annotator.wiki

• Text Annotation: anyone interested to try??? – Download the tools – Download the texts – Have fun (if you’re so inclined!)…

Page 47: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 47

Extra Slides

Page 48: Feb 23, 20051 Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon

Feb 23, 2005 48

IAMTC Tasks• Interlingua Content Development

– Three level design: IL0, IL1, IL2 (and possibly more…)

– Linguistic/semantic divergences• Noun-noun compound

• Thematic roles

• Named entities and Time expressions

• Conjunctions

• Ontology reduction

• Tool Development• Evaluation Methodology

• Annotation of 7 languages