P02- Towards a New Arabic Corpus of Dyslexic Texts

Towards A New Arabic Corpus of Dyslexic Texts

Maha A lamr i E lp003@bangor.ac .ukWi l l iam John TeahanW. J .Teahan@bangor.ac .uk Schoo l o f Computer Sc ience .

Bangor Un ivers i ty .

THE 2ND WORKSHOP ON ARABIC CORPORA AND PROCESSING TOOLS LREC2016

Outline Introduction.

Arabic Corpus of Dyslexic Texts.

Towards Automatic Correction of Dyslexic Errors.

Conclusion.

Introduction The focus of this presentation is the creation of a new Arabic corpus of texts written by dyslexics and software for automatic spelling correction for Arabic texts written by dyslexics.

Dyslexia:

Its roots in the Greek word ‘dys-’, meaning difficulty with, and the word ‘-lexia’,

which means language or word.

Inability to master the utilization of written language, including issues with

comprehension.

1 IN 10 people have a dyslexia.

Introduction The main area of interest lies in the zone of convergence represented by the overlap area as illustrated:

Dyslexia Arabic Corpus

Automatic spelling correction

The term denotes the way in which a misspelled word is identified by a program and is then altered to its correct form.

Spelling Errors Common Spelling Errors (Damerau, 1964):

Additional letters e.g. unniverse.

Omitted letters e.g. univ rse.

Substituted letters e.g umiverse.

Swapped letters e.g. uinverse.

Dyslexia Spelling Errors Words contain certain silent letters (knife).

Morphemes in the case of when affixes are added:

explain – explanation.

The struggle of dyslexic writers with the relationship between the

sound of a word and how it is spelt.

The inability to preserve in memory orthographic symbols makes it

difficult for dyslexics to remember the right order of letters in a word.

Spelling errors by Arabic writers with dyslexia

Phonetic errors.

Irregular spelling rules.

Word omission.

Hamza.

Long vowel.

Exchanging consonants.

Difficulty in writing the letters in the correct shape.

The Arabic word is spelt according to how they hear it in the local spoken

dialect.

Arabic Corpus of Dyslexic Texts

The rate of misspellings in the text is noticeably higher in the case of children. Therefore, the texts were collected from female primary school students with dyslexia who have been taught in resource rooms, been professionally diagnosed with dyslexia.

BDAC information:

Text: Writing exercises (Homework).

Size: 1067 words containing 694 errors.

Year: 2013.

Language: Arabic.

Country of production: Saudi Arabia (Riyadh).

The Bangor Dyslexic Arabic Corpus (BDAC) has the

character of a preliminary version, which aims to

investigate the possibility of a corpus being used as an aid for Arabic dyslexic

writers.

Example Dyslexic Text Screenshot of a scanned image of one of the texts written by a dyslexic female child (nine years old).

Example Dyslexic Text

This example includes basic errors as below:

Analysis of the BDAC errors

3. Substitution (47 times), commonly found in: replacement of (Heh - ه) to (Teh Marbuta - ة) or vice verse, changing (Heh -ه or Teh Marbuta - or vice verse and (ت - Teh) with the letter (ةexchanging the letter (Dad - ض) with (Zah - ظ) or vice versa.

4. Transposition (19 times).

Towards Automatic Correction of Dyslexic Errors The main tool employed was the Text Mining Toolkit (TMT).TMT is a software package designed specifically to conduct

tasks revolving around compression-based language modelling, text categorisation and correction, and segmentation of the text.

The toolkit was used to correct a small number of the dyslexic errors using a method that was similar to the method described by Alhawiti (2014) found effective for the correction of errors in Arabic OCR text.

Towards Automatic Correction of Dyslexic Errors

First, it was crucial to choose a large training corpus of Arabic text to train the compression-based language model created by the toolkit. After researching suitable corpora, the Bangor Arabic Compression Corpus (BACC) created by Dr.Khaled Alhawiti was chosen.

Due to the current limitations of the TMT software, the correction of the dyslexic texts was applied just for one-to-one character errors using the toolkit’s markup correction capabilities that was able to find the most probable corrected sequence given the compression- based language model.

Experimental Results All errors containing more than one character were removed.

1067694

BDAC Corpus

TextErrorsone-to-one character errors

Experimental Results

ErrorCorrect

Sentences

ErrorCorrect

Paragraphs

ErrorCorrect

ErrorsCorrect

The TMT software was able to correct more than half of the one-to-one character errors.

Conclusion The corpus used in this study offers a useful platform for analysing dyslexic text.

It provides a better understanding of the occurrence of these errors and the factors determining such occurrences and therefore it is suitable for assisting dyslexic writers.

This corpus can serve as a platform for other researchers to build upon.

A preliminary investigation was undertaken into using automatic processing techniques as a form of assistance for Arabic dyslexic writers and some initial success was achieved in the automatic correction of dyslexic errors in Arabic text.

In future work, it requires considerably more resources and effort to extend the corpus to include more text for analysis.

Thank you.Any questions?

P02- Towards a New Arabic Corpus of Dyslexic Texts

Education

Dyslexia & dyslexic traits

Finding a Dyslexic Approach to Career Planning Charles Freeman Charlesgjfreeman@msn.com Dyslexic Success UK - - equipping diverse

THE DYSLEXIC DYNAMIC

Chinese dyslexic

art p02 - radzynskirocznik.pl

p02 Handout 1

Tf Optimus Prime P02

Thank God I'm Dyslexic

Literacy Portal for dyslexic persons

p02 Brown Design Thinking

M02 un11 p02

INFRASTRUCTURE · FRC2 P01 P02 P01 P02 CLN CLN1 CLN2 P01 P02 P01 P02 DC REGION GLOBAL . NetNORAD ± Data Pipeline Alarming Timeseries SCUBA Agents reports to a fleet of aggregators

P02 F05_For FOCUS

M02 un09 p02

Rad 206 p02

Dyslexic Dada

Thu 23 a p02

Open dyslexic

TLC p02 FundComm 2011

Teaching for Dyslexic Children