Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)

Centre for Text Technology (CTexT)Research Unit: Languages and Literature in the South African Context

North-West University, Potchefstroom Campus (PUK)South Africa

{Gerhard.VanHuyssteen; Martin.Puttkammer; Sulene.Pilon; Handre.Groenewald}@nwu.ac.za

30 September 2007; Borovets

Gerhard B van Huyssteen, Martin J Puttkammer, Suléne Pilon and Hendrik J Groenewald

Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically

30 September 2007; Borovets Van Huyssteen, Puttkammer, Pilon & Groenewald

Overview

• Introduction

• End-User Requirements

• Solution: Design & Implementation

• Evaluation

• Conclusion


Human Language Technologies

• HLTs depends on availability of linguistic data

• Specialized lexicons• Annotated and raw corpora• Formalized grammar rules

• Creation of such resources • Expensive and protractive• Especially for less-resourced languages

IntroductionEnd-User Requirements

Solution: Design & ImplementationEvaluationConclusion

Human Language TechnologiesLess-resourced LanguagesMethodology


Less-resourced Languages

• "languages for which few digital resources exist; and thus, languages whose computerization poses unique challenges. [They] are languages with limited financial, political, and legal resources… " (Garrett, 2006)

• Implicit in this definition:– Lacks human resources (little attention in research or discussions)– Lacks computational linguists working on these languages

• Research question:– How could one facilitate development of linguistic data by

enabling non-experts to collaborate in the computerization of less-resourced languages?





Methodology I• Empowering linguists and mother-tongue

speakers to deliver annotated data– High quality– Shortest possible time

• Escalate the annotation of linguistic data by mother-tongue speakers – User-friendly environments– Bootstrapping– Machine learning instead of rule-based

techniques





Methodology II• The general idea:

– Development of gold standards– Development of annotated data – Bootstrapping

• With the click of a button:– Annotate data– Train machine-learning algorithm





Central Point of Departure I • Annotators are invaluable resources• Based on experiences with less-resourced

languages– Annotators have mostly word processing skills – Used to a GUI-based environment– Usually limited skills in a computational or

programming environment• Worst cases annotators have difficulties with

– File management– Unzipping– Proper encoding of text files



AssumptionsInterviews


Central Point of Departure II • Aim of this project: Enabling

annotators to focus on what they are good at: Enriching data with expert linguistic knowledge

• Training the machine learner occurs automatically





End-user Requirements I

• Unstructured interviews with four annotators 1. What do you find unpleasant

about your work as an annotator?2. What will make your life as an

annotator easier?





End-user Requirements II1. What do you find unpleasant

about your work as an annotator?

–Repetitiveness • Lack of concentration/motivation

–Feeling “useless”• Do not see results





End-user Requirements III2. What will make your life as an annotator

easier?– Friendly environment (i.e. GUI-based, and not

lists of words)– Bite-sizes of data rather than endless lists– Rather correct data than annotate from scratch

• Program should already suggest a possible annotation

– Click or drag– Reference works need to be available– Automatic data management





Solution: TurboAnnotate

• User-friendly annotating environment – Bootstrapping with machine learning

– Creating gold standards/annotated lists

• Inspired by DictionaryMaker (Davel and Peche, 2006) and Alchemist (University of Chicago, 2004)



Functional Specifications & SolutionsTechnical Specifications & SolutionsUser Instructions


DictionaryMaker





Alchemist





StartMake

TrainingSet

Make GoldStandard

GoldStandard

AutoEvaluateAutoMake Classifier

Verify AnnotatedSet

AutoMake AnnotatedSet

Continue?

AnnotatedSet

End

No

Yes

TrainingSet

Simplified Workflow of TurboAnnotate





Step 1: Create Gold Standard

• Create gold standard –Independent test set for evaluating

performance–1000 random instances used–Annotator only has to select one

data file





StartMake

TrainingSet

Make GoldStandard

GoldStandard


Verify AnnotatedSet


Continue?

AnnotatedSet

End

No

Yes

TrainingSet






Step 2: Verify Annotations

• New data sourced from base list – Automatically annotated by classifier – Presented to annotator in the "Annotate" tab





TurboAnnotate : Annotation Environment




TurboAnnotateTurboAnnotate

Project Search/Edit Results

v r e e s l i ko n t s t e l l e n dm i s v o r m ds w i e r i g v o o r t r e f l i k

s p o g * g e * r i g Accept

l e l * i kp r a g * t i gv i e s * l i ko u * l i kf a n * t a s * t i e s

Options

Incoming:

Done:

To Do:

Help

Cancel & Exit

Save & Exit

Save & Train

Annotate


StartMake

TrainingSet

Make GoldStandard

GoldStandard


Verify AnnotatedSet


Continue?

AnnotatedSet

End

No

Yes

TrainingSet






Step 3: Verify Annotated Set

• Bootstrapping – inspired by DictionaryMaker

• 200 words per chunk – trained in background

• Annotator verifies– Click “accept” or correct the instance

• Verified data serve as training data• Iterative process till desired results





The Machine Learning System I• Tilburg Memory-Based Learner (TiMBL).

– Wide success and applicability in the field of natural language processing

– Available for research purposes– Relative ease to use

• On the down-side– Performs best with large quantities of data

• For the tasks of hyphenation and compound analysis, TiMBL performs well with small quantities of data





The Machine Learning System II

• Default parameter settings used

• Task specific feature selection

• Performance is evaluated against gold standard–For hyphenation and compound

analysis, accuracy is determined on word-level and not per instance





Features I

• All input words converted feature vectors– Splitting window– Context 3 positions (left and right)

• Class– Hyphenation: indicating a break– Compound Analysis: 3 possible classes

• + indicating word boundary• _ indicating valence morpheme• = no break





Features II




• Example: eksamenlokaal -‘examination room’


Parameter Optimisation I

• Large variations in accuracy occur when parameter settings of MBL algorithms are changed

• Finding the best combination of parameters– Exhaustive searches undesirable– Slow and computationally expensive





Parameter Optimisation II

• Alternative: Paramsearch (Van den Bosch, 2005)– delivers combinations of algorithmic

parameters that are estimated to perform well

• PSearch– Our own modification of Paramsearch– Only implemented after all data has been

annotated– Ensures the best possible classifier





Criteria• Two criteria

– Accuracy– Human effort (time)

• Evaluated on the tasks of hyphenation and compound analysis for Afrikaans and Setswana

• Four human annotators– Two well-experienced in annotating– Two considered novices in the field



CriteriaAccuracyEffort


Accuracy• Two kinds of accuracy

– Classifier accuracy– Human accuracy

• Expressed as percentage of correctly annotated words over total number of words

• Gold standard excluded as training data





Classifier Accuracy (Hyphenation)

# Words in Training Data

Accuracy: Afrikaans Accuracy: Setswana

200 38.60% 94.50%

600 54.00% 98.30%

1000 58.30% 98.80%

2000 68.50% 98.90%





Human Accuracy• Human accuracy

– Two separate unseen datasets of 200 words for each language

– First dataset annotated in an ordinary text editor

– The second dataset annotated with TurboAnnotate.





Human Accuracy




Annotation Tool

Accuracy (Hyph)

Time (s) (Hyph)

Accuracy (CA)

Time (s)

(CA)

Text Editor (200 Words)

93.25% 1325 91.50% 802

TurboAnnotate (200 words)

98.34% 1258 94.00% 748


Human Effort I

• Two questions – Is it faster to annotate with TurboAnnotate?– What would the predicted saving on human effort be on

a large dataset?





Human Effort II




# Words in Training Set

Time (s) (Hyph)

Time (s)

(CA)0 1258 748

600 663 614

2000 573 582


Human Effort III• 1 minute faster to annotate 200 words with

TurboAnnotate• Larger dataset (40,000 words)

– Difference of only circa 3.5 uninterrupted human hours

• This picture changes when the effect of bootstrapping is considered– Extrapolating to 42,967 words

• Saving of 51 hours (68%) for hyphenation• Saving of 9 hours (41%) for compound analysis





Conclusion• TurboAnnotate helps to increase the

accuracy of human annotators• Saves human effort


Solution: Design & ImplementationEvaluation

Conclusion

ConclusionFuture WorkObtaining TurboAnnotateAcknowledgements


Future Work• Other lexical annotation tasks

– Creating lexicons for spelling checkers – Creating data for morphological analysis

• Stemming• Lemmatization

• Improve GUI• Network solution• Active Learning• Experiment with C5.0



Conclusion



TurboAnnotate• Requirements:

– Linux– Perl 5.8– Gtk+ 2.10– TiMBL 5.1

• Open-source• Available at http://www.nwu.ac.za/ctext



Conclusion



Acknowledgements• This work was supported by a grant from the

South African National Research Foundation (GUN: FA2004042900059).

• We also acknowledge the inputs and contributions of – Ansu Berg– Pieter Nortjé– Rigardt Pretorius– Martin Schlemmer– Wikus Slabbert




Conclusion

Documents

Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)