15
Active Annotation of Corpora Kepa J. Rodriguez Text Analysis Seminar at the Göttingen Center of Digital Humanities 02.05.2012

Active Annotation of Corpora

Embed Size (px)

DESCRIPTION

Text Analysis Seminar at the Göttingen Center of Digital Humanities. 02.05.2012

Citation preview

Page 1: Active Annotation of Corpora

Active Annotation of CorporaKepa J. RodriguezText Analysis Seminar at the Göttingen Center of Digital Humanities02.05.2012

Page 2: Active Annotation of Corpora

Outline

• Goal of the presentation.• The LUNA corpus.• Active annotation.

– Concept– Algorithm.– Evaluation.

• Potential use of Active Annotation in projects in humanities.

Page 3: Active Annotation of Corpora

Goal of the presentation

• Introduce concepts of: – Active Learning – Active Annotation.

• Present its use in the annotation of the LUNA corpus.• Discuss the utility of the Active Annotation in projects in

humanities.

Page 4: Active Annotation of Corpora

The LUNA Corpus (1)• Corpus consists of:

– 3000 Human-Human and 8100 WOZ dialogues– Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts,

etc.– in French, Italian and Polish.

• French subcorpus:– Application domains: travel information and reservation, IT help desk, telecom costumer

care and financial information transaction– Human-Machine dialogues: 7100

• Italian subcorpus:– Application domain: IT helpdesk– 2500 Human-Human and 500 WOZ dialogues

• Polish subcorpus:– Application domain: public transportation information– 500 Human-Human and 500 WOZ dialogues

More information about annotation scheme and levels: http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf

Page 5: Active Annotation of Corpora

The LUNA Corpus (2)

[Operator:] allora m'ha detto che [non riusciva]c1 ad [accedere]c2 [al computer]c3 e [le manca]c4 [la procedura]c5

so, you have told me that you cannot access the computer, and that you need the procedure

c1 trouble : unable_toc2 action : accessc3 computer-hardware : pcc4 trouble : lack_ofc5 computer-software : procedure

[Caller:] esattoexactly[Operator:] allora avrei bisogno [dell' RWS]c6 [del PC]c7so I need the RWS of the computer

c6 code-identificationCode : rwsc7 computer-hardware : pc

[Caller:] si allora [tredici zero ottantasei]c8yes, 13 0 86

c8 code-identificationCode-rws : 13086

Page 6: Active Annotation of Corpora

Active annotation (1)

Components of the active annotation are:• Active learning paradigm

– Selection of examples for annotation.• Potential error detection

– Cases in which manual annotation seems to be ambiguous or contradictory.

Page 7: Active Annotation of Corpora

Active annotation (2)

• Active learning paradigm: – Statistical learning based paradigm– A first small set will randomly chosen and manually annotated.– Use this set to train a model and annotate the rest of samples.– Selection of the most informative examples to update the statistical

model• Most informative = lower confidence score

• Use of active learning:– Speed-up annotation– Support annotators in their work– Select examples to be annotated: which examples from a big

amount of data will be useful for my purposes?

Page 8: Active Annotation of Corpora

Active annotation (3)

Learn curve comparison: active vs. random learning (Riccardi and Takkani-Tür, 2005 )

Page 9: Active Annotation of Corpora

Active annotation (4)

• Likely error detection:– Re-annotate the training data using the statistical model.– Extract examples in which manual annotation and automatic

annotation are different.– Send them to human supervision.

• Use of the likely error detection:– If manual annotation is correct, example is hard to learn:

• Analyze which new features can be implemented to enrich the model.– If the annotation is erroneous:

• Correct it.

Page 10: Active Annotation of Corpora

Annotation algoritm

1. Select randomly a small amount of dialogues and annotate it manually from scratch (SL).

2. Train a model M using SL3. while (labeler/data available)

a) Use M to automatically annotate the unannotated part of the corpus (Su).b) Rank automatically annotated examples of (Su) according to the confidence

measure given by Mc) Select a batch of k dialogues with the lowest score (Sk)d) Ask for human control/correction on Ske) Use M to automatically annotate SL and produce SaLf) Look at the difference between SL and produce SaL

i. HARD TO LEARN EXAMPLE: Add new features when training Mii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL

g) SL = SL + Skh) Train a new model M with SLi) Go to 3.1

Page 11: Active Annotation of Corpora

Evaluation (2)

• Annotator point of view:– Annotation from scratch: 80-90 minutes/file.– Supervision after 3rd active annotation loop: 25-20 min/file.– Annotators more concentrated in:

• Difficult/interesting issues.• Giving feedback about the model.

• Error detection: no statistics.– Most of the reported feedback requests were annotation errors.– Some of the reported feedback requests were caused by ambiguities and

helped to add features to enrich the model.

Page 12: Active Annotation of Corpora

Evaluation (1)• Wizard of Oz dialogues

• Human-human dialogues

Act-turn Size in turns Error rate

1 200 59.2%2 400 44.4%3 600 39.3%4 800 6.4%

5 1200 0.0%

Act-turn Size in dialogues Error rate1 10 71.2%2 20 59.5%

3 30 54.0%

4 40 51.1%5 60 45.7%

6 80 42.4%

Page 13: Active Annotation of Corpora

Discussion

• Questions• Annotation tasks in the GCDH:

– Corpus of Coptic Texts.– …..

Page 14: Active Annotation of Corpora

References

• LUNA project: http://www.ist-luna.eu • Raymond, Rodriguez and Riccardi (2008): Active Annotation in the

LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008).Marrakech. Marrocco.

• Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and applications to automatic speech recognition. In IEEE Transactions on Speech and Audio Processing.

Page 15: Active Annotation of Corpora

Text Analysis Seminar at the Göttingen Center of Digital Humanities

Thanks!!!