Upload
iwanrg
View
220
Download
1
Embed Size (px)
Citation preview
Wajdi ZaghouaniCarnegie Mellon University Qatar
Toward an Arabic Punctuated Corpus: Annotation Guidelines and EvaluationWajdi Zaghouani and Dana Awad
LREC 2016
2
Outline
• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation
3
Context and MotivationPunctuation Annotation
• Punctuation marks are used to create sense, clarity and stress in sentences.
• They are also used to structure and organize text.• In Arabic, punctuation marks are relatively a
modern innovation since Arabic did not use punctuation.
• Punctuation rules in Arabic are not always consistently used as it is highly individual and it depends on the writing style/preference.
4
Context and MotivationPunctuation Annotation
• From NLP perspective, punctuation marks can be useful in: – the automatic sentence segmentation tasks (sentences
boundaries and phrase boundaries).
• Absence of punctuation could be confusing both for humans and computers.– Jones et al. (2003), for example, showed that sentence
breaks are critical for text legibility.
5
The Larger Context:The QALB Project
• QALB (Qatar Arabic Language Bank) Project. • Funded by the Qatar National Research Fund Grant
# NPRP-4-1058-1-168.– Columbia/NYUAD (Habash) & CMUQ (Oflazer and
Mohit).– Resources and system construction for Arabic error
correction including punctuation correction/insertion.– 2 million words corpus from:
• Native news comments.• Native and non-native essays.• Machine Translation output correction.
6
Outline
• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation
Goals
• Advance the research on Arabic punctuation correction in Modern Standard Arabic Text and in Arabic NLP in general.
• Use the annotated corpus to develop automatic punctuation insertion/correction systems.
• Create and share with the research community a unique large text corpus of Arabic with punctuation added/corrected in a consistent manner.
7
8
Outline
• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation
Method
• Created punctuation correction guidelines.• Used the Arabic standard general punctuation
rules commonly used today and described in Awad (2013) PhD. thesis.
• Created an implemented an annotation procedure to ensure a consistent annotation using quality control measures.
10
Outline
• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation
The QALB Guidelines
The punctuation guidelines are part of the general QALB project guidelines.
Spelling ErrorsPunctuation ErrorsLexical ErrorsMorphology ErrorsSyntactic ErrorsDialectal Usage Correction
11
Punctuation Guidelines (1)
• Our punctuation guidelines focus on each type of punctuation errors.
• It describes the correction process for existing wrong punctuation.
• When to add the missing punctuation marks.
Punctuation Guidelines (2)
• We provide the annotators with a detailed annotation procedure and we explain how to deal with borderline cases.
• We include many annotated examples to illustrate some specific cases of punctuation correction rules.
Guidelines Example The Arabic Comma
Our Arabic comma correction rules state the following four uses as valid:
1. Separate coordinated and main-clause sentences. 2. During enumeration to avoid repetition.3. Provide an explanation or a definition of the
previous word.4. Separate between parts of the conditional
sentences.
Example of punctuation errors
More rules…
• The annotators should convert any Latin punctuation sign to the equivalent Arabic.
–The Arabic comma ،versus ,
–The Arabic semicolon ؛ versus ;
–The Arabic question mark ؟ versus ?
17
Outline
• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation
18
QALB Annotation Interface• QAWI: QALB Annotation Web Interface (Obeid et al., 2013)
• Move
• Delete
• Edit
• Split
• Merge
QAWI Annotation Interface
QALB Annotation Management Interface
Edited words are highlighted in blue
Annotation Action History
22
Outline
• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation
Evaluation
• To evaluate the punctuation annotation quality, we measure the inter-annotator agreement (IAA) on randomly selected files to ensure that the annotators are consistently following the annotation guidelines.
• The IAA is measured over all pairs of annotations to compute the AWER (Average Word Error Rate).
Evaluation
• The IAA results shown in the Table were computed over 10 files from each corpus (30 files and 4,116 words total) annotated by at least three different annotators.
Average percent Inter-Annotator Agreement (IAA) and the Average WER (AWER) obtained for the three corpora
Evaluation
• A higher agreement was observed with the MT corpus. – Our analysis revealed that this was partially
caused by a much less occurrence of the comma in the MT corpus as compared to L1 and L2 corpus as shown in the table of the next slide.
• The comma punctuation mark in Arabic is not always consistently used as it is optional in many cases.
Distribution of each punctuation mark in the three corpora.
THANK YOU