P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Wajdi ZaghouaniCarnegie Mellon University Qatar

Toward an Arabic Punctuated Corpus: Annotation Guidelines and EvaluationWajdi Zaghouani and Dana Awad

LREC 2016

2

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

3

Context and MotivationPunctuation Annotation

• Punctuation marks are used to create sense, clarity and stress in sentences.

• They are also used to structure and organize text.• In Arabic, punctuation marks are relatively a

modern innovation since Arabic did not use punctuation.

• Punctuation rules in Arabic are not always consistently used as it is highly individual and it depends on the writing style/preference.

4

Context and MotivationPunctuation Annotation

• From NLP perspective, punctuation marks can be useful in: – the automatic sentence segmentation tasks (sentences

boundaries and phrase boundaries).

• Absence of punctuation could be confusing both for humans and computers.– Jones et al. (2003), for example, showed that sentence

breaks are critical for text legibility.

5

The Larger Context:The QALB Project

• QALB (Qatar Arabic Language Bank) Project. • Funded by the Qatar National Research Fund Grant

# NPRP-4-1058-1-168.– Columbia/NYUAD (Habash) & CMUQ (Oflazer and

Mohit).– Resources and system construction for Arabic error

correction including punctuation correction/insertion.– 2 million words corpus from:

• Native news comments.• Native and non-native essays.• Machine Translation output correction.

6

Outline


Goals

• Advance the research on Arabic punctuation correction in Modern Standard Arabic Text and in Arabic NLP in general.

• Use the annotated corpus to develop automatic punctuation insertion/correction systems.

• Create and share with the research community a unique large text corpus of Arabic with punctuation added/corrected in a consistent manner.

7

8

Outline


Method

• Created punctuation correction guidelines.• Used the Arabic standard general punctuation

rules commonly used today and described in Awad (2013) PhD. thesis.

• Created an implemented an annotation procedure to ensure a consistent annotation using quality control measures.

10

Outline


The QALB Guidelines

The punctuation guidelines are part of the general QALB project guidelines.

Spelling ErrorsPunctuation ErrorsLexical ErrorsMorphology ErrorsSyntactic ErrorsDialectal Usage Correction

11

Punctuation Guidelines (1)

• Our punctuation guidelines focus on each type of punctuation errors.

• It describes the correction process for existing wrong punctuation.

• When to add the missing punctuation marks.

Punctuation Guidelines (2)

• We provide the annotators with a detailed annotation procedure and we explain how to deal with borderline cases.

• We include many annotated examples to illustrate some specific cases of punctuation correction rules.

Guidelines Example The Arabic Comma

Our Arabic comma correction rules state the following four uses as valid:

1. Separate coordinated and main-clause sentences. 2. During enumeration to avoid repetition.3. Provide an explanation or a definition of the

previous word.4. Separate between parts of the conditional

sentences.

Example of punctuation errors

More rules…

• The annotators should convert any Latin punctuation sign to the equivalent Arabic.

–The Arabic comma ،versus ,

–The Arabic semicolon ؛ versus ;

–The Arabic question mark ؟ versus ?

17

Outline


18

QALB Annotation Interface• QAWI: QALB Annotation Web Interface (Obeid et al., 2013)

• Move

• Delete

• Edit

• Split

• Merge

QAWI Annotation Interface

QALB Annotation Management Interface

Edited words are highlighted in blue

Annotation Action History

22

Outline


Evaluation

• To evaluate the punctuation annotation quality, we measure the inter-annotator agreement (IAA) on randomly selected files to ensure that the annotators are consistently following the annotation guidelines.

• The IAA is measured over all pairs of annotations to compute the AWER (Average Word Error Rate).

Evaluation

• The IAA results shown in the Table were computed over 10 files from each corpus (30 files and 4,116 words total) annotated by at least three different annotators.

Average percent Inter-Annotator Agreement (IAA) and the Average WER (AWER) obtained for the three corpora

Evaluation

• A higher agreement was observed with the MT corpus. – Our analysis revealed that this was partially

caused by a much less occurrence of the comma in the MT corpus as compared to L1 and L2 corpus as shown in the table of the next slide.

• The comma punctuation mark in Arabic is not always consistently used as it is optional in many cases.

Distribution of each punctuation mark in the three corpora.

THANK YOU

Education

P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation