22
Towards A New Arabic Corpus of Dyslexic Texts Maha Alamri Elp003@ bangor.ac.uk William John Teahan [email protected] School of Computer Science. Bangor University.

P02- Towards a New Arabic Corpus of Dyslexic Texts

  • Upload
    iwanrg

  • View
    179

  • Download
    4

Embed Size (px)

Citation preview

Page 1: P02- Towards a New Arabic Corpus of Dyslexic Texts

Towards A New Arabic Corpus of Dyslexic Texts

Maha A lamr i E [email protected] .ukWi l l iam John TeahanW. J [email protected] .uk Schoo l o f Computer Sc ience .

Bangor Un ivers i ty .

Page 2: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

2

Outline Introduction.

Arabic Corpus of Dyslexic Texts.

Towards Automatic Correction of Dyslexic Errors.

Conclusion.

Page 3: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

3

Introduction The focus of this presentation is the creation of a new Arabic corpus of texts written by dyslexics and software for automatic spelling correction for Arabic texts written by dyslexics.

Dyslexia:

Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word ‘-lexia’,

which means language or word.

Inability to master the utilization of written language, including issues with

comprehension.

1 IN 10 people have a dyslexia.

Page 4: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

4

Introduction The main area of interest lies in the zone of convergence represented by the overlap area as illustrated:

Dyslexia Arabic Corpus

Automatic spelling correction

The term denotes the way in which a misspelled word is identified by a program and is then altered to its correct form.

Page 5: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

5

Spelling Errors Common Spelling Errors (Damerau, 1964):

Additional letters e.g. unniverse.

Omitted letters e.g. univ rse.

Substituted letters e.g umiverse.

Swapped letters e.g. uinverse.

Page 6: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

6

Dyslexia Spelling Errors Words contain certain silent letters (knife).

Morphemes in the case of when affixes are added:

explain – explanation.

The struggle of dyslexic writers with the relationship between the

sound of a word and how it is spelt.

The inability to preserve in memory orthographic symbols makes it

difficult for dyslexics to remember the right order of letters in a word.

Page 7: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

7

Spelling errors by Arabic writers with dyslexia

Phonetic errors.

Irregular spelling rules.

Word omission.

Hamza.

Long vowel.

Exchanging consonants.

Difficulty in writing the letters in the correct shape.

The Arabic word is spelt according to how they hear it in the local spoken

dialect.

Page 8: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

8

Arabic Corpus of Dyslexic Texts

The rate of misspellings in the text is noticeably higher in the case of children. Therefore, the texts were collected from female primary school students with dyslexia who have been taught in resource rooms, been professionally diagnosed with dyslexia.

BDAC information:

Text: Writing exercises (Homework).

Size: 1067 words containing 694 errors.

Year: 2013.

Language: Arabic.

Country of production: Saudi Arabia (Riyadh).

The Bangor Dyslexic Arabic Corpus (BDAC) has the

character of a preliminary version, which aims to

investigate the possibility of a corpus being used as an aid for Arabic dyslexic

writers.

Page 9: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

9

Example Dyslexic Text Screenshot of a scanned image of one of the texts written by a dyslexic female child (nine years old).

Page 10: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

10

Example Dyslexic Text

This example includes basic errors as below:

Page 11: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

11

Analysis of the BDAC errors

Page 12: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

12

Analysis of the BDAC errors

3. Substitution (47 times), commonly found in: replacement of (Heh - ه) to (Teh Marbuta - ة) or vice verse, changing (Heh -ه or Teh Marbuta - or vice verse and (ت - Teh) with the letter (ةexchanging the letter (Dad - ض) with (Zah - ظ) or vice versa.

4. Transposition (19 times).

Page 13: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

13

Analysis of the BDAC errors

Page 14: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

14

Analysis of the BDAC errors

Page 15: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

15

Analysis of the BDAC errors

Page 16: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

16

Analysis of the BDAC errors

Page 17: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

17

Towards Automatic Correction of Dyslexic Errors The main tool employed was the Text Mining Toolkit (TMT).TMT is a software package designed specifically to conduct

tasks revolving around compression-based language modelling, text categorisation and correction, and segmentation of the text.

The toolkit was used to correct a small number of the dyslexic errors using a method that was similar to the method described by Alhawiti (2014) found effective for the correction of errors in Arabic OCR text.

Page 18: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

18

Towards Automatic Correction of Dyslexic Errors

First, it was crucial to choose a large training corpus of Arabic text to train the compression-based language model created by the toolkit. After researching suitable corpora, the Bangor Arabic Compression Corpus (BACC) created by Dr.Khaled Alhawiti was chosen.

Due to the current limitations of the TMT software, the correction of the dyslexic texts was applied just for one-to-one character errors using the toolkit’s markup correction capabilities that was able to find the most probable corrected sequence given the compression- based language model.

Page 19: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

19

Experimental Results All errors containing more than one character were removed.

1067694

280

BDAC Corpus

TextErrorsone-to-one character errors

Page 20: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

20

Experimental Results

153

99

Word

ErrorCorrect

80

49

Sentences

ErrorCorrect

4739

Paragraphs

ErrorCorrect

280

187

Total

ErrorsCorrect

The TMT software was able to correct more than half of the one-to-one character errors.

Page 21: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

21

Conclusion The corpus used in this study offers a useful platform for analysing dyslexic text.

It provides a better understanding of the occurrence of these errors and the factors determining such occurrences and therefore it is suitable for assisting dyslexic writers.

This corpus can serve as a platform for other researchers to build upon.

A preliminary investigation was undertaken into using automatic processing techniques as a form of assistance for Arabic dyslexic writers and some initial success was achieved in the automatic correction of dyslexic errors in Arabic text.

In future work, it requires considerably more resources and effort to extend the corpus to include more text for analysis.

Page 22: P02- Towards a New Arabic Corpus of Dyslexic Texts

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

22

Thank you.Any questions?