Upload
iwanrg
View
219
Download
3
Embed Size (px)
Citation preview
MANDIAC: A Web-based Annotation System For Manual
Arabic Diacritization
Collaborators: Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer
Ossama ObeidCarnegie Mellon University in Qatar
Introduction• Arabic text is composed of consonants, long vowels, and short
vowels (diacritics).• Absence of diacritics:
oAdds lexical and morphological ambiguity.oConfusing to beginners.o Impacts performance of Arabic NLP tasks.
• Very few texts are diacritized.
Introduction
Possible pronunciation and meanings of the undiacritized Arabic word ذكر.
Introduction• Most automatic diacritization systems trained on Arabic
Treebanks.• Different genre and dialects need new datasets:
oTime consuming.oMust insure data quality and consistency.
Currently Available Annotation Tools• Very basic text-editor-like interfaces.• Can’t handle a large number of documents and annotators.• Not easily customizable.
MANDIAC• Web-based.• Intuitive and easy to use.• Easily manages thousands of documents.• Distributes tasks (including IAA evaluation tasks) to tens of
annotators .• Doubles annotation speed!• Based on QAWI.• Provides Annotation and Annotation Management interfaces.
Annotation Interface• Token-based annotation system similar to QAWI.• Annotators can choose pre-computed diacritizations (derived
using MADAMIRA) and/or manually edit diacritics.• Additional features to increase annotator productivity.
Annotation InterfaceExtra Features:• Undo/Redo buttons• Edits restricted to diacritics only• Timer• Counter indicating number of words left to annotate• Link to annotation guidelines• Token highlighting:
o Annotated wordso Tokens that should not be edited (eg digits, non-Arabic words, punctuation)
• Flag documents• Mark tokens as ambiguous
Annotation Interface
Annotation Interface at a glance
Annotation Interface
Dropdown showing top 3 automatically diacritized
candidates.
Manual token editor
Management InterfaceUser Management• Add/remove users.• Add users to annotation groups.• Display user activity log and statistics.
Management InterfaceAnnotation Workflow Management:• Upload files in various formats.• Organize files into groups.• Assign files to individuals or to a group (for IAA).• Highlight tasks as untouched, edited, or completed.
Management InterfaceEvaluation and Monitoring:• Evaluate IAA.• Compare annotations to gold reference.• Use WER and DER as metrics.• 10% of assigned documents are randomly assigned for IAA.
Management Interface
User management view
Management Interface
Task assignment popup
Tasklistview
System Design and Architecture• Four main components:
oAnnotation interfaceoManagement interfaceoBack-end serveroMADAMIRA
Component interaction diagram
System Design and ArchitectureData storage:• Relational database (SQL):
o Fast data search and retrieval.o Almost any SQL database can be used.
• Annotation data stored as JSON blobs:o Flexible data format.o Quickly add new functionality and annotation modes with little back-end
modification.
EvaluationExperimental setup:• Around 1,500 words were extracted from Penn Arabic Treebank.• Five annotators were asked to fully diacritize the extracted words:
o First half of the text using a text editor.o Second half of the text with MANDIAC:
−Use automatically diacritized candidate if possible.−Manually edit otherwise.
Evaluation
• Experimental results:oUsing a text editor: 302 words/houroUsing MANDIAC: 618 words/hour
• Using the text editor introduced typos.
Acknowledgements• This project has been funded by the Qatar National Research
Fund (grant NPRP 6-1020-1-199).• We also thank the annotators for their feedback on MANDIAC.
Thank You!