27
Wajdi Zaghouani Carnegie Mellon University Qatar Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation Wajdi Zaghouani and Dana Awad LREC 2016

P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

  • Upload
    iwanrg

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Wajdi ZaghouaniCarnegie Mellon University Qatar

Toward an Arabic Punctuated Corpus: Annotation Guidelines and EvaluationWajdi Zaghouani and Dana Awad

LREC 2016

Page 2: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

2

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

Page 3: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

3

Context and MotivationPunctuation Annotation

• Punctuation marks are used to create sense, clarity and stress in sentences.

• They are also used to structure and organize text.• In Arabic, punctuation marks are relatively a

modern innovation since Arabic did not use punctuation.

• Punctuation rules in Arabic are not always consistently used as it is highly individual and it depends on the writing style/preference.

Page 4: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

4

Context and MotivationPunctuation Annotation

• From NLP perspective, punctuation marks can be useful in: – the automatic sentence segmentation tasks (sentences

boundaries and phrase boundaries).

• Absence of punctuation could be confusing both for humans and computers.– Jones et al. (2003), for example, showed that sentence

breaks are critical for text legibility.

Page 5: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

5

The Larger Context:The QALB Project

• QALB (Qatar Arabic Language Bank) Project. • Funded by the Qatar National Research Fund Grant

# NPRP-4-1058-1-168.– Columbia/NYUAD (Habash) & CMUQ (Oflazer and

Mohit).– Resources and system construction for Arabic error

correction including punctuation correction/insertion.– 2 million words corpus from:

• Native news comments.• Native and non-native essays.• Machine Translation output correction.

Page 6: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

6

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

Page 7: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Goals

• Advance the research on Arabic punctuation correction in Modern Standard Arabic Text and in Arabic NLP in general.

• Use the annotated corpus to develop automatic punctuation insertion/correction systems.

• Create and share with the research community a unique large text corpus of Arabic with punctuation added/corrected in a consistent manner.

7

Page 8: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

8

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

Page 9: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Method

• Created punctuation correction guidelines.• Used the Arabic standard general punctuation

rules commonly used today and described in Awad (2013) PhD. thesis.

• Created an implemented an annotation procedure to ensure a consistent annotation using quality control measures.

Page 10: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

10

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

Page 11: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

The QALB Guidelines

The punctuation guidelines are part of the general QALB project guidelines.

Spelling ErrorsPunctuation ErrorsLexical ErrorsMorphology ErrorsSyntactic ErrorsDialectal Usage Correction

11

Page 12: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Punctuation Guidelines (1)

• Our punctuation guidelines focus on each type of punctuation errors.

• It describes the correction process for existing wrong punctuation.

• When to add the missing punctuation marks.

Page 13: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Punctuation Guidelines (2)

• We provide the annotators with a detailed annotation procedure and we explain how to deal with borderline cases.

• We include many annotated examples to illustrate some specific cases of punctuation correction rules.

Page 14: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Guidelines Example The Arabic Comma

Our Arabic comma correction rules state the following four uses as valid:

1. Separate coordinated and main-clause sentences. 2. During enumeration to avoid repetition.3. Provide an explanation or a definition of the

previous word.4. Separate between parts of the conditional

sentences.

Page 15: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Example of punctuation errors

Page 16: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

More rules…

• The annotators should convert any Latin punctuation sign to the equivalent Arabic.

–The Arabic comma ،versus ,

–The Arabic semicolon ؛ versus ;

–The Arabic question mark ؟ versus ?

Page 17: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

17

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

Page 18: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

18

QALB Annotation Interface• QAWI: QALB Annotation Web Interface (Obeid et al., 2013)

• Move

• Delete

• Edit

• Split

• Merge

Page 19: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

QAWI Annotation Interface

Page 20: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

QALB Annotation Management Interface

Edited words are highlighted in blue

Page 21: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Annotation Action History

Page 22: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

22

Outline

• Context and Motivation• Goals• Method• Guidelines• Tool• Evaluation

Page 23: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Evaluation

• To evaluate the punctuation annotation quality, we measure the inter-annotator agreement (IAA) on randomly selected files to ensure that the annotators are consistently following the annotation guidelines.

• The IAA is measured over all pairs of annotations to compute the AWER (Average Word Error Rate).

Page 24: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Evaluation

• The IAA results shown in the Table were computed over 10 files from each corpus (30 files and 4,116 words total) annotated by at least three different annotators.

Average percent Inter-Annotator Agreement (IAA) and the Average WER (AWER) obtained for the three corpora

Page 25: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Evaluation

• A higher agreement was observed with the MT corpus. – Our analysis revealed that this was partially

caused by a much less occurrence of the comma in the MT corpus as compared to L1 and L2 corpus as shown in the table of the next slide.

• The comma punctuation mark in Arabic is not always consistently used as it is optional in many cases.

Page 26: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

Distribution of each punctuation mark in the three corpora.

Page 27: P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation

THANK YOU