P02- Towards a New Arabic Corpus of Dyslexic Texts

Preview:

Citation preview

Towards A New Arabic Corpus of Dyslexic Texts

Maha A lamr i E lp003@bangor.ac .ukWi l l iam John TeahanW. J .Teahan@bangor.ac .uk Schoo l o f Computer Sc ience .

Bangor Un ivers i ty .

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

2

Outline Introduction.

Arabic Corpus of Dyslexic Texts.

Towards Automatic Correction of Dyslexic Errors.

Conclusion.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

3

Introduction The focus of this presentation is the creation of a new Arabic corpus of texts written by dyslexics and software for automatic spelling correction for Arabic texts written by dyslexics.

Dyslexia:

Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word ‘-lexia’,

which means language or word.

Inability to master the utilization of written language, including issues with

comprehension.

1 IN 10 people have a dyslexia.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

4

Introduction The main area of interest lies in the zone of convergence represented by the overlap area as illustrated:

Dyslexia Arabic Corpus

Automatic spelling correction

The term denotes the way in which a misspelled word is identified by a program and is then altered to its correct form.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

5

Spelling Errors Common Spelling Errors (Damerau, 1964):

Additional letters e.g. unniverse.

Omitted letters e.g. univ rse.

Substituted letters e.g umiverse.

Swapped letters e.g. uinverse.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

6

Dyslexia Spelling Errors Words contain certain silent letters (knife).

Morphemes in the case of when affixes are added:

explain – explanation.

The struggle of dyslexic writers with the relationship between the

sound of a word and how it is spelt.

The inability to preserve in memory orthographic symbols makes it

difficult for dyslexics to remember the right order of letters in a word.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

7

Spelling errors by Arabic writers with dyslexia

Phonetic errors.

Irregular spelling rules.

Word omission.

Hamza.

Long vowel.

Exchanging consonants.

Difficulty in writing the letters in the correct shape.

The Arabic word is spelt according to how they hear it in the local spoken

dialect.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

8

Arabic Corpus of Dyslexic Texts

The rate of misspellings in the text is noticeably higher in the case of children. Therefore, the texts were collected from female primary school students with dyslexia who have been taught in resource rooms, been professionally diagnosed with dyslexia.

BDAC information:

Text: Writing exercises (Homework).

Size: 1067 words containing 694 errors.

Year: 2013.

Language: Arabic.

Country of production: Saudi Arabia (Riyadh).

The Bangor Dyslexic Arabic Corpus (BDAC) has the

character of a preliminary version, which aims to

investigate the possibility of a corpus being used as an aid for Arabic dyslexic

writers.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

9

Example Dyslexic Text Screenshot of a scanned image of one of the texts written by a dyslexic female child (nine years old).

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

10

Example Dyslexic Text

This example includes basic errors as below:

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

11

Analysis of the BDAC errors

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

12

Analysis of the BDAC errors

3. Substitution (47 times), commonly found in: replacement of (Heh - ه) to (Teh Marbuta - ة) or vice verse, changing (Heh -ه or Teh Marbuta - or vice verse and (ت - Teh) with the letter (ةexchanging the letter (Dad - ض) with (Zah - ظ) or vice versa.

4. Transposition (19 times).

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

13

Analysis of the BDAC errors

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

14

Analysis of the BDAC errors

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

15

Analysis of the BDAC errors

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

16

Analysis of the BDAC errors

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

17

Towards Automatic Correction of Dyslexic Errors The main tool employed was the Text Mining Toolkit (TMT).TMT is a software package designed specifically to conduct

tasks revolving around compression-based language modelling, text categorisation and correction, and segmentation of the text.

The toolkit was used to correct a small number of the dyslexic errors using a method that was similar to the method described by Alhawiti (2014) found effective for the correction of errors in Arabic OCR text.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

18

Towards Automatic Correction of Dyslexic Errors

First, it was crucial to choose a large training corpus of Arabic text to train the compression-based language model created by the toolkit. After researching suitable corpora, the Bangor Arabic Compression Corpus (BACC) created by Dr.Khaled Alhawiti was chosen.

Due to the current limitations of the TMT software, the correction of the dyslexic texts was applied just for one-to-one character errors using the toolkit’s markup correction capabilities that was able to find the most probable corrected sequence given the compression- based language model.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

19

Experimental Results All errors containing more than one character were removed.

1067694

280

BDAC Corpus

TextErrorsone-to-one character errors

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

20

Experimental Results

153

99

Word

ErrorCorrect

80

49

Sentences

ErrorCorrect

4739

Paragraphs

ErrorCorrect

280

187

Total

ErrorsCorrect

The TMT software was able to correct more than half of the one-to-one character errors.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

21

Conclusion The corpus used in this study offers a useful platform for analysing dyslexic text.

It provides a better understanding of the occurrence of these errors and the factors determining such occurrences and therefore it is suitable for assisting dyslexic writers.

This corpus can serve as a platform for other researchers to build upon.

A preliminary investigation was undertaken into using automatic processing techniques as a form of assistance for Arabic dyslexic writers and some initial success was achieved in the automatic correction of dyslexic errors in Arabic text.

In future work, it requires considerably more resources and effort to extend the corpus to include more text for analysis.

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

22

Thank you.Any questions?