Intelligent Text Document Correction System Based on Similarity Technique

Intelligent Text Document

Correction System Based on

Similarity Technique

A Thesis

Submitted to the Council of the College of Information Technology,

University of Babylon in Partial Fulfillment of the Requirements

for the Degree of Master of Sciences in Computer Sciences.

By

Marwa Kadhim Obeid Al-Rikaby

Supervised by

Prof. Dr. Abbas Mohsen Al-Bakry

2015 D.C. 1436 A.H.

Ministry of Higher Education and

Scientific Research

University of Babylon- College of Information

Technology

Software Department

II

{

}

61 \

III

Supervisor Certification

I certify that this thesis was prepared under my supervision at the Department of

Software / Information Technology / University of Babylon, by Marwa

Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the

degree of Master of Sciences in Computer Science.

Signature:

Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry

Title : Professor.

Date : / / 2015

The Head of the Department Certification

In view of the available recommendation, we forward this thesis for debate by

the examining committee.

Signature:

Name : Dr. Eman Salih Al-Shamery

Title: Assist. Professor.

Date: / / 2015

IV

To

Master of creatures,

Loved by Allah,

The Prophet Muhammad

(Allah bless him and his family)

V

Acknowledgements

All praise be to Allah Almighty who enabled me to complete this task

successfully and utmost respect to His last Prophet Mohammad PBUH.

First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al-

Bakry, for his advice and guidance that led to the completion of this thesis.

I would like to thank the staff of the Software Department for the help they

have offered, especially, the head of the Software Department Dr. Eman Salih

Al-Shamery.

Most importantly, I would like to thank my parents, my sisters, my brothers

and my friends for their support.

VI

Abstract

Automatic text correction is one of the human-computer interaction

challenges. It is directly interposed with several application areas like post

handwritten text digitizing correction or indirectly such as user's queries correction

before applying a retrieval process in interactive databases.

Automatic text correction process passes through two major phases: error

detection and candidates suggestion. Techniques for both phases are categorized

into: Procedural and statistical. Procedural techniques are based on using rules to

govern texts acceptability, including Natural Language Processing Techniques.

Statistical techniques, on the other hand, are dependent on statistics and

probabilities collected from large corpus based on what is commonly used by

humans.

In this work, natural language processing techniques are used as bases for

analysis and both spell and grammar acceptance checking of English texts. A

prefix dependent hash-indexing scheme is used to shorten the time of looking up

the underhand dictionary which contains all English tokens. The dictionary is used

as a base for the error detection process.

Candidates generation is based on calculating source token similarity,

measured using an improved Levenshtein method, to the dictionary tokens and

ranking them accordingly; however this process is time extensive, therefore, tokens

are divided into smaller groups according to spell similarity in such a way keeps

the random access availability. Finally, candidates suggestion involves examining

a set of commonly committed mistakes related features. The system selects the

optimal candidate which provides the highest suitability and doesn't violate

grammar rules to generate linguistically accepted text.

Testing the system accuracy showed better results than Microsoft Word and

some other systems. The enhanced similarity measure reduced the time complexity

to be on the boundaries of the original Levenshtein method with an additional error

type discovery.

VII

Table of Contents

Subject Page

No.

Chapter One : Overview

1.1 Introduction 1

1.2 Problem Statement 3

1.3 Literature Review 5

1.4 Research Objectives 10

1.5 Thesis Outlines 11

Chapter Two: Background and Related Concepts

Part I: Natural Language Processing 12

2.1 Introduction 12

2.2 Natural Language Processing Definition 12

2.3 Natural Language Processing Applications 13

2.3.1 Text Techniques 14

2.3.2 Speech Techniques 15

2.4 Natural Language Processing and Linguistics 16

2.4.1 Linguistics 16

2.4.1.1 Terms of Linguistic Analysis 17

2.4.1.2 Linguistic Units Hierarchy 19

2.4.1.3 Sentence Structure and Constituency 19

2.4.1.4 Language and Grammar 20

2.5 Natural Language Processing Techniques 22

2.5.1 Morphological Analysis 22

2.5.2 Part of Speech Tagging 23

2.5.3 Syntactic Analysis 26

2.5.4 Semantic Analysis 27

2.5.5 Discourse Integration 27

2.5.6 Pragmatic Analysis 28

2.6 Natural Language Processing Challenges 28

2.6.1 Linguistics Units Challenges 28

2.6.1.1 Tokenization 28

2.6.1.2 Segmentation 29

2.6.2 Ambiguity 31

2.6.2.1 Lexical Ambiguity 31

VIII

Subject Page

No.

2.6.2.2 Syntactic Ambiguity 31

2.6.2.3 Semantic Ambiguity 32

2.6.2.4 Anaphoric Ambiguity 32

2.6.3 Language Change 32

2.6.3.1 Phonological Change 33

2.6.3.2 Morphological Change 33

2.6.3.3 Syntactic Change 33

2.6.3.4 Lexical Change 33

2.6.3.5 Semantic Change 34

Part II: Text Correction 35

2.7 Introduction 35

2.8 Text Errors 35

2.8.1 Non-words Errors 36

2.8.2 Real-word Errors 36

2.9 Error Detection Techniques 37

2.9.1 Dictionary Looking Up 37

2.9.1.1 Dictionaries Resources 37

2.9.1.2 Dictionaries Structures 38

2.9.2 N-gram Analysis 39

2.10 Error Correction Techniques 40

2.10.1 Minimum Edit Distance Techniques 40

2.10.2 Similarity Key Techniques 43

2.10.3 Rule Based Techniques 43

2.10.4 Probabilistic Techniques 43

2.11 Suggestion of Corrections 44

2.12 The Suggested Approach 44

2.12.1 Finding Candidates Using Minimum Edit Distance 45

2.12.2 Candidates Mining 45

2.12.3 Part-of-Speech Tagging and Parsing 46

Chapter Three : Hashed Dictionary and Looking Up Technique

3.1 Introduction 48

3.2 Hashing 48

3.2.1 Hash Function 49

3.2.2 Formulation 52

3.2.3 Indexing 53

3.3 Looking Up Procedure 56

IX

Subject Page

No.

3.4 Dictionary Structure Properties 58

3.5 Similarity Based Looking-Up 59

3.5.1 Bi-grams Generation 60

3.5.2 Primary Centroids Selection 62

3.5.3 Centroids Referencing 63

3.6 Application of Similarity Based Looking up approach 64

3.7 The Similarity Based Looking up Properties 67

Chapter Four : Error Detection and Candidates Generation

4.1 Introduction 69

4.2 Non-word Error Detection 69

4.3 Real-Words Error Detection 71

4.4 Candidates Generation 72

4.4.1 Candidates Generation for Non-word Errors 72

4.4.1.2 Enhanced Levenshtein Method 74

4.4.1.3 Similarity Measure 78

4.4.1.4 Looking for Candidates 79

4.4.2 Candidates Generation for Real-words Errors 81

Chapter Five : Text Correction and Candidates Suggestion

5.1 Introduction 82

5.2 Correction and Candidates Suggestion Structure 82

5.3 Named-Entity Recognition 85

5.4 Candidates Ranking 86

5.4.1 Edit Distance Based Similarity 87

5.4.2 First and End Symbols Matching 87

5.4.3 Difference in Lengths 88

5.4.4 Transposition Probability 89

5.4.5 Confusion Probability 90

5.4.6 Consecutive Letters (Duplication) 91

5.4.7 Different Symbols Existence 92

5.5 Syntax Analysis 93

5.5.1 Sentence Phrasing 93

5.5.2 Candidates Optimization 95

5.5.3 Grammar Correction 95

5.5.4 Document Correction 97

Chapter Six: Experimental Results, Conclusions, and Future Works

X

Subject Page

No.

6.1 Experimental Results 98

6.1.1 Tagging and Error Detection Time Reduction 98

6.1.1.1 Successful Looking Up 99

6.1.1.2 Failure Looking Up 100

6.1.2 Candidates Generation and Similarity Search Space

Reduction

101

6.1.3 Time Reduction of the Damerau-Levenshtein method 103

6.1.4 Features Effect on Candidates Suggestion 104

6.2 Conclusions 107

6.3 Future Works 108

References 110

Appendix A 117

Appendix B 122

List of Figures

Figure

No.

Title Page

No.

(2.1) NLP dimensions 16

(2.2) Linguistics analysis steps 17

(2.3) Linguistic Units Hierarchy 19

(2.4) Classification of POS tagging models 24

(2.5) An example of lexical change 34

(2.6) Outlines of Spell Correction Algorithm 38

(2.7) Levenshtein Edit Distance Algorithm 41

(2.8) Damerau-Levenshtein Edit Distance Algorithm 42

(2.9) The Suggested System Block Diagram 47

(3.1) Token Hashing Algorithm 54

XI

Figure

No.

Title Page

No.

(3.2) Dictionary Structure and Indexing Scheme 55

(3.3) Algorithm of Looking Up Procedure 57

(3.4) Semi Hash Clustering block diagram 61

(3.5) Similarity Based Hashing algorithm 64

(3.6) Block diagram of candidates generation using SBL 66

(3.7) Similarity Based Looking up algorithm 68

(4.1) Tagging Flow Chart 70

(4.2) The Enhanced Levenshtein Method Algorithm 76

(4.3) Original Levenshtein Example 77

(4.4) Damerau-Levenshtein Example 77

(4.5) Enhanced Levenshtein Example 78

(5.1) Candidates ranking flowchart 84

(5.2) Syntax analysis flowchart 94

(6.1) Tokens distribution in primary packets 99

(6.2) Tokens distribution in secondary packets 99

(6.3) Time complexity Variance of Levenshtein, Damerau-

Levenshtein, and Enhanced Levenshtein (our modification) 103

(6.4) Suggestion Accuracy with a comparison to Microsoft Office

Word on a Sample from the Wikipedia 104

(6.5) Testing the suggested system accuracy and comparing the

results with other systems using the same dataset 105

(6.6) Discarding one feature at a time for optimal candidate

selection 106

(6.7) Using one feature at a time for optimal candidate selection 107

XII

List of Tables

Table

No.

Title Page

No.

(1-1) Summary of Literature Review 9

(3-1) Alphabet Encoding 50

(3-2) Addressing Range 52

(3-3) Predicting errors using Bi-grams analysis 61

(5-1) Transposition Matrix 90

(5-2) Confusion Matrix 91

List of Symbols and Abbreviations

Meaning Abbreviation

Alphabet

Adjectival Phrase A

Absolute Difference abs

Sentence Complement C

Context Free Grammar CFG

Dictionary D

Dioxide Nuclear Acid DNA

Error E

Grammar G

Grammar Error Correction GEC

Hidden Markov Model HMM

Information Retrieval IR

Machine Translation MT

Named Entity NE

Named-Entity Recognition NER

Noun Group NG

Natural Language Generation NLG

Natural Language Processing NLP

Natural Languages NLs

Natural Language Understanding NLU

XIII

Noun Phrase NP

big-Oh notation ( =at most) O( )

Optical Character Recognition OCR

Production Rule P

Part Of Speech POS

Prepositional Phrase PP

Query Q

Ranking Value R

Relative Distance R_Dist

Start Symbol S

Stanford Machine Translator SMT

Speech Recognition SR

String1, String2 St1,St2

Variable V

Adverbial Phrase v

Verb Phrase VP

big-Omega notation (= at least) ( )

Chapter One

Overview

1

Chapter One

Overview

1.1 Introduction

Natural Language Processing, also known as computational Linguistics,

is the field of computer science that deals with linguistics; it is a form of

human- computer interaction where formalization is applied on the elements

of human language to be performed by a computer [Ach14]. Natural

Language Processing (NLP) is the implementation of systems that are

capable of manipulating and processing natural languages (NLs)

sentences[Jac02] like English, Arabic, Chinese and not formal languages

like Python, Java, C++; nor descriptive languages such as DNA in biology

and Chemical formulas in chemist [Mom12]. NLP task is the designing and

building of software for analyzing, understanding and generating spoken

and/or written NLs. [Man08] [Mis13]

NLP has many applications such as automatic summarization, Machine

Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition

(SR), Optical Character Recognition (OCR), Information Retrieval (IR),

Opinion Mining [Nad11], and others [Wol11].

Text Correction is another significant application of NLP. It includes

both Spell Checking and Grammar Error Correction (GEC). Spell checking

research extends early back to the mid of 20th century by Lee Earnest at

Stanford University but the first application was created in 1971 by Ralph

Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of

10,000 English words. [Set14] [Pet80]

Grammar error correction, in spite of its central role in semantic and

meaning representations, is largely ignored by NLP community. In recent

Chapter One Overview ________________________________________________________________________

2

years, an improvement noticed in automatic GEC techniques. [Voo05]

[Jul13] However, most of these techniques are limited in specific domains

such as real-word spell correction [Hwe14], subject-verb disagreement

[Han06], verb tense misuse [Gam10], determiners or articles and improper

preposition usage. [Tet10] [Dah11]

Different techniques like edit distance [Wan74], rule-based techniques

[Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98],

probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel

model [Tou02] have been proposed for text correction purposes. Each

technique needs some sort of resources. Edit distance, rule-based and

similarity key techniques require a dictionary (or lexicon), n-grams and

probabilistic work with statistical and frequency information, neural nets are

learned with training patterns, etc

Text correction, spell and grammar, is an extensive process includes,

typically, three major steps: [Ach14] [Jul13]

The first step is to detect the incorrect words. The most popular way to

decide if a word is misspelled is to look for it in a dictionary, a list of

correctly spelled words. This way can detect non-word errors not the real-

word errors [Kuk92] [Mis13] because an unintended word may match a

word in the dictionary. NLs have a large number of words resulting in a

huge dictionary, therefore, the task of looking every word consumes a long

time. Whereas, in GEC this step is more complicated, it requires applying

more analysis at the level of sentences and phrases using computational

linguistics basics to detect the word that makes the sentence incorrect.

Next, a list of candidates or alternatives should be generated for the

incorrect word (misspelled or misused). This list is preferred to be short and

contains the words with highest similarity or suitability; and to produce it, a

technique is needed to calculate the similarity of the incorrect word with


3

every word in the dictionary. Efficiency and accuracy are major factors in

the selection of such technique. GEC requires broad knowledge of diverse

grammatical error categories and extensive linguistic technique to identify

alternatives because a grammatical error mayn't be resulted from a unique

word.

Finally, suggesting the intended word or a list of alternatives contains

the intended word. This task requires ranking the words according to the

similarity amount to the incorrect word and some other considerations may

or may not be taken depending on the technique in use.

Text mining techniques started to enter the area of text correction;

Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and

Information Retrieval [Kir11] are examples. Statistics and probabilistic also

played a great role specifically in analyzing common mistakes and n-gram

datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and

phonetic level, can be used in reducing the looking up space; NER may help

in avoiding interpreting proper nouns as misspellings; statistics merged with

NLP techniques to provide more precise parsing and POS tagging, usually,

in context dependent applications. The application of a given technique

differs according to what level of correction is intended; it starts from the

character level [Far14], passes through word, phrase (usually in GEC),

sentence, and ends in the context or document subject level.

1.2 Problem Statement

Although many text checking and correction systems are produced,

each has its variances from the sides of input quality restrictions, techniques

used, output accuracy, speed, performance conditionsetc. [Ahm09]

[Pet80]. This field of NLP is really an open research from all sides because

there is no complete algorithm or technique handles all considerations.


4

The limited linguistic knowledge, the huge number of lexicons, the

extensive grammar, language ambiguity and change over time, variety of

committed errors and computational requirements are challenges facing the

process of developing a text correction application.

In this work, some of the above mentioned problems are solved using a

set of solutions:

Integrating two lexicon datasets (WordNet and Ispell).

Using brute-force approach to solve some sorts of ambiguity.

Applying hashing and indexing in looking up the dictionary.

Reducing search space in candidates collecting process by

grouping similarly spelled words into semi clusters.

The Levenshtein method [Hal11] is also enhanced to consider Damerau

four types of errors within time period shorter than Damerau-Levenshtein

method [Hal11]. Named Entity Recognition, letters confusion and

transposition, and candidate length effect are used as features to optimize the

candidates' suggestion. In addition to applying rules of Part-Of-Speech tags

and sentence constituency for checking sentence grammar correctness,

whether it is lexically corrected or is not.

The proposed three components of this system are: (1)a spell error

detection is based on a fast looking up technique in a dictionary of more than

300,000 tokens, constructed by applying a string prefix dependent hash

function and indexing method; grammar error detector is a brute-force

parser. (2)For candidates generation, an enhancement was implemented on

the Levenshtein method to consider Damerau four errors types and then used

to measure similarity according to the minimum edit distance and difference

in lengths effect, the dictionary tokens are grouped into spell based clusters

to reduce search space. (3)The candidates suggestion exploits NER features,


5

transposition error and confusion statistics, affixes analysis (including first

and last letters matching), length of candidates, and parsing success.

1.3 Literature Review

Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to

string transformation includes a model consists of rules and weights for

training and an algorithm depends on scoring and ranking according to

conditional probability distribution for generating the top k-candidates at

the character level where both high and low frequency words can be

generated. Spell checking is one of many applications on which the

approach was applied; the misspelled strings (words or characters) are

transformed by applying a number of operators into the k-most similar

strings in a dictionary (start and end letters are constants). [Ach14]

Mariano F., Zheng Y., and others, 2014, talked the correction of

grammatical errors by processes pipelining which combines results from

multiple systems. The components of the approach are: a rule based error

corrector uses rules automatically derived from the Cambridge Learner

Corpus which based on N-grams that have been annotated as incorrect;

SMT system translates incorrectly written English into correct English;

NLTK1 was used to perform segmentation, tokenization, and POS

tagging; the candidates generation produce all the possible combinations

of corrections for the sentence, in addition to the sentence itself to

consider the "no correction" option; finally the candidates are ranked

using a language model. [Fel14]

__________________________________________________________

1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research

and teaching in computational linguistics and natural language processing. NLTK is written in Python and

distributed under the GPL open source license. Over the past year the toolkit has been rewritten,

simplifying many linguistic data structures and taking advantage of recent enhancements in the Python

language.


6

Anubhav G., 2014, presented a rule-based approach that used two POS

taggers to correct non-native English speakers' grammatical errors,

Stanford parser and Tree Tagger. The detection of errors depends on the

outputs of the two taggers, if they differ then the sentence is not correct.

Errors are corrected using Nodebox English Linguistic library. Error

correction includes subject-verb disagreement, verb form, and errors

detected by POS tag mismatch. [Gup14]

Stephan R., 2013, proposed a model for spelling correction based on

treating words as "documents" and spell correction as a form of

document retrieval in that the model retrieves the best matching correct

spell for a given input. The words are transformed into tiny documents of

bits and hamming distance is used to predict the closest string of bits

from a dictionary holding the correctly spelled words as strings of bits.

The model is knowledge free and only contains a list of correct words.

[Raa13]

Youssef B., 2012, produced a parallel spell checking algorithm for

spelling errors detection and correction. The algorithm is based on

information from Yahoo! N-gram dataset 2.0; it is a shared memory

model allowing concurrency among threads for both parallel multi

processor and multi core machines. The three major components (error

detector, candidates' generator and error corrector) are designed to run in

a parallel fashion. Error detector, based on unigrams, detects non-word

errors; candidates' generator is based on bi-grams; the error corrector,

context sensitive, is based on 5-grams information.[Bas12]

Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and

Gary G. L., 2012, presented a novel method for grammatical error

correction by building a meta-classifier. The meta-classifier decides the

final output depending on the internal results from several base

classifiers; they used multiple grammatical errors tagged corpora with


7

different properties in various aspects. The method focused on the articles

and the correction arises only when a mismatching occur with the

observed articles. [Seo12]

Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic

information retrieval system performing automatic spell correction for

user queries before applying the retrieval process. The correcting

procedure depends on matching the misspelled word against a correctly

spelled words dictionary using Levenshtein algorithm. If an incorrect

word is encountered then the system retrieves the most similar word

depending on the Levenshtein measure and the occurrence frequency of

the misspelled word.[Kir11]

Farag, Ernesto, and Andreas, 2008, developed a language-independent

spell checker. It is based on the enhancement of N-gram model through

creating a ranked list of correction candidates derived based on N-gram

statistics and lexical resources then selecting the most promising

candidates as correction suggestions. Their algorithm assigns weights to

the possible suggestions to detect non-word errors. They depended a

"MultiWordNet" dictionary of about 80,000 entries.[Ahm09]

Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of

real-words spelling error correction. They assumed that the observed

sentence is a signal passed through a noisy channel, where the channel

reflects the typist and the distortion reflects errors committed by the

typist. The probability of the sentence correctness, given by the channel

(typist), is a parameter associated with that sentence. The probability of

every word in the sentence to be the intended one is equivalent to the

sentence correctness probability and the word is associated with a set of

spell variants words excluding the word itself. Correction can be applied

to one word in the sentence by replacing the incorrect one by another


8

from the candidates (its real-word spelling variations) set so that it gives

the maximum probability.[Amb08]

Stoyan, Svetla, and others, 2005, described an approach for lexical post-

correction of the output of optical character recognizer OCR as a two

research project. They worked on multiple sides; on the dictionary side,

they enriched their large sizes dictionaries with specialty dictionaries; on

the candidates selection, they used a very fast searching algorithm

depends on Levenshtein automata for efficient selecting the correction

candidates with a bound not exceeding 3; they ranked candidates

depending on a number of features such as frequency and edit

distance.[Mih04]

Suzan V., 2002, described a context sensitive spell checking algorithm

based on the BESL spell checker lexicons and word trigrams for

detecting and correcting real-word errors using probability information.

The algorithm splits up the input text into trigrams and every trigram is

looked up in a precompiled database which contains a list of trigrams and

their occurrence number in the corpus used for database compiling. The

trigram is correct if it is in the trigram database, otherwise it is considered

an erroneous trigram containing a real-word error. The correction

algorithm uses BESL spell checker to find candidates but the most

frequent in the trigrams database are suggested to the user.[Ver02]


9

No. Reference Methodology Technique

1

[Ach14] Generating the top K-

candidates at the

character level for both

high and low frequency.

A model consists of rules and

weights, and a conditional

probability distribution

dependent algorithm

2

[Fel14] Grammatical errors

correction based on

generating all possible

correct alternatives for

the sentence

Combining the results of

multiple systems: rule based

error corrector, SMT English

to Correct English translator,

and NLTK for segmentation,

tokenization and tagging

3

[Gup14] Non-native English

speakers' grammatical

errors correction

Error detection used Stanford

parser and Tree Tagger.

Correction based on

Nodebox English Linguistic

library

4

[Raa13] Dictionary based Spell

correction treats the

misspelled word as a

document.

Converting the misspelled

word into a tiny document of

bits and retrieving the most

similar documents using

Hamming Distance

5

[Bas12] Context sensitive spell

checking using a shared

memory model allowing

concurrency among

threads for parallel

execution

Different N-grams levels for

error detection, candidates

generation, and candidates

suggestion depending on

Yahoo! N-Grams dataset 2.0

6

[Seo12] Meta-classifier for

grammatical errors

correction focused

mainly on the articles.

Deciding the output

depending on the internal

results from several base

classifiers

7

[Kir11]

Automatic spell

correction for user

queries before applying

retrieval process

Using Levenshtein algorithm

for both error detection and

correction in a dictionary

looking up technique

Table 1.1: Summary of Literature Review


11

8

[Ahm09] Language independent

model for non-word error

correction based on N-

gram statistics and lexical

resources

Ranking a list of correction

candidates by assigning

weights to the possible

suggestions depending on a

"MultiWordNet" dictionary

of about 80,000 entries

9

[Amb08] Noisy channel model for

Real words error

correction based on

probability.

Channel represents the typist,

distortion represents the

error, and the noise

probability is a parameter

10

[Mih04]

OCR output post

correction

Levenshtein automata for

candidates generation and

frequency for ranking

11

[Ver02] Context sensitive spell

checking algorithm based

on tri-grams

Splitting texts into word

trigrams and matching them

against the precompiled

BESL spell checker lexicons,

suggestion depends on

probability information.

1.4 Research Objectives

This research is attempted to design and implement a smart text

document correction system for English texts. It is based on mining a typed

text for detecting spelling and grammar errors and giving the optimal

suggestion(s) from a set of candidates, its steps are:

1. Analyzing the given text by using Natural Language Processing

techniques, at each step detect the erroneous words.

2. Looking up candidates for the erroneous words and ranking them

according to a given set of features and conditions to be the initial

solutions.

3. Optimizing the initial solutions depending on the extracted

information from the given text and the detected errors.


11

4. Recovering the input text document with the optimal solutions and

associating the best set of candidates with each incorrect detected

word.

1.5 Thesis Outlines

The next five chapters are:

1. Chapter Two: "Background and Related Concepts" consisted of two

parts. The first overviews NLP fundamentals, applications and

techniques; whereas, the second is about text correction techniques.

2. Chapter Three: "Dictionary Structure and Looking up Technique"

describes the suggested approach of constructing the dictionary of the

system for both perfect matching and similarity looking up.

3. Chapter Four: "Error Detection and Candidates Generation", declares

the suggested technique for indicating incorrect words and the method

of generating candidates.

4. Chapter Five: "Automatic Text Correction and Candidates

Suggestion", describes the techniques of suggestions selection and

optimization.

5. Chapter Six: "Experimental Results, Conclusion, and Future Works",

the experimental results of applying the techniques described in

chapters three, four and five, conclusion of the system and the future

directions are shown.

Chapter Two

Background

and

Related Concepts

12

Chapter Two

Background and Related Concepts

Part I

Natural Language Processing

2.1 Introduction

Natural Language Processing (NLP) began in the late 1940s. It was

focused on machine translations; in 1958, NLP was linked to the

information retrieval by the Washington International Conference of

Scientific Information; [Jon01] primary ideas for developing applications

for detecting and correcting text errors started at that period of time.

[Pet80] [Boo58]

Natural Language Processing has a great interest from that time till

our days because it plays an important role in the interaction between

human and computers. It represents the intersection of linguistics and

artificial intelligence [Nad11] where machine can be programmed to

manipulate natural language.

2.2 Natural Language Processing Definition

"Natural Language Processing (NLP) is the computerized approach

for analyzing text that is based on both a set of theories and a set of

technologies." [Sag13]

NLP describes the function of software or hardware components in a

computer system that is capable of analyzing or synthesizing human

languages (spoken or written) [Jac02] [Mis13] like English, Arabic,

Chinese etc, not formal languages like Python, Java, C++ etc, nor

Chapter Two Part I: Natural Language Processing _________________________________________________________________________

13

descriptive languages such as DNA in biology and Chemical formulas in

chemist [Mom12].

"NLP is a tool that can reside inside almost any text processing

software application" [Wol11]

We can define NLP as a subfield of Artificial Intelligence

encompasses anything needed by a computer to understand and generate

natural language. It is based on processing human language for two tasks:

the first receives a natural language input (text or speech), applies analysis,

reasons what was meant by that input, and outputs in computer language;

this is the task of Natural Language Understanding (NLU). While the

second task is to generate human sentences according to specific

considerations, the input is in computer language but the output is in human

languages; it is called Natural Language Generation (NLG). [Raj09]

"Natural Language Understanding is associated with the more

ambitious goals of having a computer system actually comprehend natural

language as a human being might". [Jac02]

2.3 Natural Language Processing Applications

Even of its wide usage in computer systems, NLP is entirely

disappeared into the background; where it is invisible to the user and adds

significant business value. [Wol11]

The major distinction of NLP applications from other data

processing systems is that they use Language Knowledge. Natural

Language Processing applications are mainly divided into two categories

according to the given NL format [Mom12] [Wol11]:


14

2.3.1Text Technologies

Spell and Grammar Checking: systems deal with indicating

lexical and grammar errors and suggest corrections.

Text Categorization and Information Filtering: In such

applications, NLP represents the documents linguistically and

compares each one to the others. In text categorization, the

documents are grouped according to their linguistic

representation characteristics into several categories. Information

filtering signals out, from a collection of documents, the

documents that are satisfying some criterion.

Information Retrieval: finds and collects relevant information to

a given query. A user expresses the information need by a query,

then the system attempts to match the given query to the database

documents that is satisfying the users query. Query and

documents are transformed into a sort of linguistic structure, and

the matching is performed accordingly.

Summarization: according to an information need or a query

from the user, this type of applications finds the most relevant

part of the document.

Information Extraction: refers to the automatic extraction of

structured information from unstructured sources. Structured

information like entities, their relationships, and attributes

describing them. This can integrate structured and unstructured

data sources, if both are exist, and pose queries for spanning the

integrated information giving better results than applying

searches by keywords alone.

Question Answering: works with plain speech or text input,

applies an information search based on the input. Such as IBM

Watson and the reigning JEOPARDY! Champion, which read


15

questions and understand their intention, then looking up the

knowledge library to find a match.

Machine Translation: translate a given text from a specific

natural language to another natural language, some applications

have the ability to recognize the given text language even if the

user didn't specify it correctly.

Data Fusion: Combining extracted information from several text

files into a database or an ontology.

Optical Character Recognition: digitizing handwritten and

printed texts. I.e. converting characters from images to digital

codes.

Classification: this NLP application type sorts and organizes

information into relevant categories. Like e-mail spam filters and

Google News news service.

And also NLP entered other applications such as educational

essay test-scoring systems, voice-mail phone trees, and even e-

mail spam detection software.

2.3.2 Speech Technologies

Speech Recognition: mostly used on telephone voice response

systems as a service client. Its task is processing plain speech. It

is also used to convert speech into text.

Speech Synthesis: means converting text into speech. This

process requires working at the level of phones and converting

from alphabetic symbols into sound signals.


16

2.4 Natural Language Processing and Linguistics

Natural Language Processing is concerned with three dimensions:

language, algorithm and problem as presented in figure (2.1). On the

language dimension, NLP considers linguistics; algorithm dimension

mentions NLP techniques and tasks, while the problem dimension depicts

the applied mechanisms to solve problems. [Bha12]

2.4.1 Linguistics

Natural Language is a communication mean. It is a system of

arbitrary signals such as the voice sound and written symbols. [Ali11]

Linguistics is the scientific study of language; it starts from the simple

acoustic signals which form sounds and ends with pragmatic understanding

to produce the full context meaning.

There are two major levels of linguistic, Speech Recognition (SR)

and Natural Language Processing (NLP) as shown in figure (2.2).

Figure (2.1) : NLP dimensions [Bha12]


17

2.4.1.1 Terms of Linguistic Analysis

A natural language, as formal language does, has a set of basic

components that may vary from one language to another but remain

bounded under specific considerations giving the special characteristics to

every language.

From the computational view, a language is a set of strings generated

over a finite alphabet and can be considered by a grammar. The definition

Acoustic Signal

Phones

Letters and Strings

Morphemes

Words

Phrases and Sentences

Meaning out of Context

Meaning in Context

SR

NLP

Phonetics

Phonology

Lexicon

Morphology

Syntax

Semantics

Pragmatics

Figure (2.2) : Linguistics analysis steps [Cha10]


18

of the three abstracted names is dependent on the language itself; i.e.

strings, alphabet and grammar formulate and characterize language.

Strings:

In natural language processing, the strings are the morphemes of the

language, their combinations (words) and the combinations of their

combinations (sentences), but linguistics going somewhat deeper than this.

It starts with phones, the primitive acoustic patterns, which are significant

and distinguishable from one natural language to another. Phonology

groups phones together to produce phonemes represented by symbols.

Morphemes consist of one or more symbols; thus, NLs can be further

distinguished.

Alphabet:

When individual symbols, usually thousands, represent words then

the language is "logographic"; if the individual symbols represent syllables,

it is a "syllabic" one. But when they represent sounds, the language is

"alphabetic". Syllabic and alphabetic languages have typically less than 100

symbols, unlike logographic.

English is an alphabetic language system consists of 26 symbols,

these symbols represents phones combined into morphemes which may or

may not combined further more to form words.

Grammar:

Grammar is a set of rules specifying the legal structure of the

language; it is a declarative representation about the language syntactic

facts. Usually, grammar is represented by a set of productive rules.


19

2.4.1.2 Linguistic Units Hierarchy

Language can be divided into pieces; there is a typical structure or

form for every level of analysis. Those pieces can be put into a hierarchical

structure starting from a meaningful sentence as the top level, proceeding

in the separation of building units until reaching the primary acoustic

sounds. Figure (2.3) presented an example.

Figure (2.3) : Linguistic Units Hierarchy

2.4.1.3 Sentence Structure and Constituency

"It is constantly necessary to refer to units smaller than the sentence

itself units such as those which are commonly referred as CLAUSE,

PHRASE, WORD, and MORPHEME. The relation between one unit and

another unit of which it is a part is CONSTITUENCY." [Qui85]

The task of dividing a sentence into constituents is a complex task

________________________________________________________

1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries

The teacher talked to the students

The teach er talk ed to the student s



Sentence

Phrase

Word

Morphem

e

Phonemes1 u


20

requires incorporating more than one analysis stage; tokenization,

segmentation, parsing, (and sometimes stemming) usually merged together

to build the parse tree for a given sentence.

2.4.1.4 Language and Grammar

A language is a 'set' of sentences and a sentence is a 'sequence' of

'symbols' [Gru08]; it can be generated given its context free grammar

G=(V,,S,P). [Cla10]

Commonly, grammars are represented as a set of production rules

which is taken by the parser and compared against the input sentences.

Every matched rule adds something to the sentence complete structure

which is called 'parse tree'. [Ric91]

Context free grammar (CFG) is a popular method for generating

formal grammars. It is used extensively to define languages syntax. The

four components of the grammar are defined in CFG as [Sag13]:

Terminals (): represent the basic elements which form the

strings of the language.

Nonterminals or Syntactic Variables (V): sets of strings define the

language which is generated by the grammar. Nonterminals

represent a key in syntax analyzing and translation via imposing a

hierarchical structure for the language.

Set of production rules (P): this set define the way of combining

terminals with nonterminals to produce strings. The production

rule is consisted of a variable on the left side represents its head,

this head defines

Start symbol (S).

The following is an example describes the structure of English sentence


21

V = {S, NP, N, V P, V, Art}

= {boy, icecream, dog, bite, like, ate, the, a},

P = {S NP V P,

NP N,

NP ART N,

V P V NP,

N boy | icecream | dog,

V ate | like | bite,

Art the | a}

The grammar specifies two things about the language: [Ric91]

Its weak generative capacity; the limited set of sentences which can

be completely matched by a series of grammar rules.

Its strong generative capacity, grammatical structure(s) of each

sentence in the language.

Generally, there are an infinite number of sentences for each grammar

which can be structured with it. The strength and importance of grammars

lurk in their ability of supplying structure to an infinite number of

sentences because they succinctly summarize an infinite number of objects

structures of a certain class. [Gru08]

The grammar is said to be generative if it has a fixed size production

rules which, if followed, can generate every sentence in the language using

an infinite number of actions. [Gru08]


22

2.5 Natural Language Processing Techniques

2.5.1 Morphological Analysis

Morphology is the study of how words are constructed from

morphemes which represent the minimal meaning-bearing language

primitive units.[Raj09] [Jur00]

There are two broad classes of morphemes: stems and affixes; the

distinction between the two classes is language dependent in that it varies

from one language to another. The stem, usually, refers to the main part of

the word and the affixes can be added to the words to give it additional

meaning. [Jur00]

Further more, affixes can be divided into four categories according to

the position where they are added. Prefixes, suffixes, circumfixes and

infixes generally refer to the different types of affixes but it is not necessary

to a language to have all the types. English accept both prefixes to precede

stems and suffixes to follow stems, while there is no good example for a

circumfixe (precede and follow a stem) in English, and infixing (inserting

inside the stem) is not allowed (unlike German and Philippine languages,

consecutively) . [Jur00]

Morphology is concerned with recognizing the modification of base

words to form other words with different syntactic categories but similar

meanings.

Generally, three forms of word modifications are found [Jur00]:

Inflection: syntactic rules change the textual representation of the

words; such as adding the suffix 's' to convert nouns into plurals,

adding 'er' and 'est' convert regular adjectives into comparative and


23

superlative forms, consecutively. This type of modification usually

results a word from the same word class of the stem word.

Derivation: new words are produced by adding morphemes, usually

more complex and harder in meaning than inflectional morphology.

It often occurs in a regular manner and results words differ in their

word class from the stem word. Like adding the suffix 'ness' to

'happy' to produce 'happiness'.

Compounding: this type modifies stem words by another stem words

by grouping them. Like grouping 'head' with 'ache' to produce

'headache'. In English, this type is infrequent.

Morphological processing, also known as stemming, depends heavily on

the analyzed language. The output is the set of morphemes that are

combined to form words. Morphemes can be stem words, affixes, and

punctuations.

2.5.2 Part Of Speech Tagging

Part of Speech (POS) tagging is the process of giving the proper

lexical information or POS tag (also known as word classes, lexical tags,

and morphological classes), which is encoded as a symbol, for every word

(or token) in a sentence. [Sco99] [Has06b]

In English, POS tags are classified into four basic classes of words: [Qui85]

1. Closed classes: include prepositions, pronouns, determiners,

conjunctions, modal verbs and primary verbs.

2. Open classes: include nouns, adjectives, adverbs, and full verbs.

3. Numerals: include numbers and orders.

4. Interjections: include small set of words like oh, ah, ugh, phew.

Usually, a POS tag indicates one or more of the previous information and it

is sometimes holds other features like the tense of the verb or the number


24

(plural or singular). POS tagging may generate tagged corpora or serve as a

preprocessing step for the next NLP processes. [Sco99]

Most of tagging systems performance is typically limited because

they only use local lexical information available in the sentence, at the

opposite of syntax analyzing systems which exploit both lexical and

structural information. [Sco99] More research was done and several models

and methods have been proposed to enhance taggers performance, they fall

mainly into supervised and unsupervised methods where the main

difference between the two categories is the set of training corpora that is

pre tagged in supervised methods unlike unsupervised methods which

needs advanced computational methods for gaining such a corpora.

[Has06a] [Has06b]. Figure (2.4) presents the main categories and shows

some examples.

In both categories, the following are the most popular:

Figure (2.4) : Classification of POS tagging models [Has06a]


25

Statistical (stochastic, or probabilistic) methods: taggers which

use these methods are firstly trained on a correctly tagged set of

sentences which allow the tagger to disambiguate words by

extracting implicit rules or picking the most probable tag based on

the words that are surrounding the given word in the sentence.

Examples of these methods are Maximum-Entropy Models, Hidden

Markov Models (HMM), and Memory Based models.

Rule based methods: a sequence of rules, a set of hand written

rules, is applied to detect the best tags set for the sentence regardless

of any maximization probability. The set of rules need to be written

probably and checked by human experts. Examples: the path-voting

constraint models and decision tree models.

Transformational approach: combines both statistical methods and

rule based methods to firstly find the most probable set of available

tags and then applies a set of rules to select the best.

Neural Networks: with linear separator or full neural network, have

been used for tagging processes.

The methods described above, as any other research areas, have their

advantages and disadvantages; but there is a major difficulty facing all

of them, it is the tagging of unknown words (words that have never seen

before in the training corpora). While rule-based approaches depends on

a special set of rules to handle such situations, stochastic and neural nets

lack this feature and use other ways such as suffixes analysis and n-

gram by applying morphological analysis; some methods use default set

of tags to disambiguate unknown words. [Has06a]


26

2.5.3 Syntactic Analysis

"Syntax is the study of the relationships between linguistics forms,

how they are arranged in sequence, and which sequences are well-

formed". [Yul00]

Syntactic analysis, also referred by "Parsing", is the process of

converting the sentence from its flat format which is represented as a

sequence of words into a structure that defines its units and the relations

between these units. [Raj09]

Hence, the goal of this technique is to transform natural language

into an internal system representation. The format of this representation

may be dependency graphs, frames, trees or some other structural

representations. Syntactic parsing attempts only for converting sentences

into either dependency links representing the utterance syntactic structure

or a tree structure and the output of this process is called "parse tree" or

simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning

in the level of the smallest parts ("words" in terms of language scientist,

"tokens" in terms of computer scientists). [Gru08]

Syntactic analysis makes use of both the results of morphological

analysis and Part-Of-Speech tagging to build the structural description of

the sentence by applying the grammar rules of the language under

consideration; if a sentence violates the rules then it is rejected and

assigned as incorrect. [Raj09]

The two main components of every syntax analyzer are:

Grammar: the grammar provides the analyzer with the set of

production rules that will lead it to construct the structure of the

sentences and specifies the correctness of every given sentence.


27

Good grammars make a careful distinction between the

sentence/word level, which they often call syntax or syntaxis and

the word/letter level, which they call morphology. [Gru08]

Parser: the parser reconstructs the production tree (or trees) by

applying the grammar to indicate how the given sentence (if

correctly constructed) was produced from that grammar.

Parsing is the process of structuring a linear representation in

accordance with a given grammar.

Today, most of parsers combine context free grammars with probability

models to determine the most likely syntactic structure out of many others

that are accepted as parse trees for an utterance. [Dzi04]

2.5.4 Semantic Analysis

"Semantics is the study of the relationships between linguistic

forms and entities in the words; that is, how words literally connect to

things." [Yul00]

This technique and the later following it are basically depended by

language understanding. Semantic analysis is the process of assigning

meanings to the syntactic structures of the sentences regardless of its

context. [Yul00] [Raj09]

2.5.5 Discourse Integration

Discourse analysis is concerned with studying the effect of sentences

of each other. It shows how a given sentence is affected by the one

preceding it and how it affects the sentence following it. Discourse

Integration is relevant to understanding texts and paragraphs rather than

simple sentences, so, discourse knowledge is important in the interpretation


28

of temporal aspects (like pronouns) in the conveyed information. [Ric91]

[Raj09]

2.5.6 Pragmatic Analysis

This step interprets the structure that represents what is said for

determining what was meant actually. Context is a fundamental resource

for processing here. [Ric91]

2.6 Natural Language Processing Challenges

The challenges of natural language processing are much enough that

can't be summarized in a limited list; with every processing step from the

start point to results outputting there are a set of problems that natural

language processors vary in their ability to handle. However, the

application where NLP is used, usually, concerned with a specific task

rather than considering all processing steps with all their details, this is an

advantage for the NLP community helps to outline the challenges and

problems according to the task under consideration.

For our research area, we precisely concerned with the set of

problems that are directly affecting the task of text correction; the next

subsections describe some of them:

2.6.1 Linguistic Units Challenges:

The task of text correction starts from the level of characters up to

paragraphs and full texts, with every level there are a set of difficulties that

the handling analyzer faces:

2.6.1.1 Tokenization

In this process, the lexical analyzer, usually called "Tokenizer",

divides the text into smaller units and the output of this step is a series of


29

morphemes, words, expressions and punctuations (called tokens). It

involves locating tokens boundaries (where one token ends and another

begins).

Issues that arise in tokenization and should be addressed are [Nad11]:

Problem depends on language type: language includes, in addition to

their symbols, a set of orthographic conventions which are used in

writing to indicate the boundaries of linguistic units. English

employs whitespaces to separate words but this isn't sufficient to

tokenize a text in a complete and unambiguous manner because the

same character may be used in different uses (as the case with

punctuations), there are words with multi parts (such as dividing the

word with a hyphen at the end of lines and some cases in the addition

of prefixes) and many expressions consisted of more than one word.

Encoding Problems: syllabic and alphabetic writing systems, usually,

encoded using single byte, but languages with larger character sets

require more than two bytes. The problem arise when the same set of

encodings represents different characters set; whereas, the tokenizers

are targeted to a specific encoding for a specific language.

Other problems such as the dependency of the application

requirements which indicates what a constituent is defined as a

token; in computational linguistics the definition should precisely

indicate what the next processing step requires. The tokeniser should

also have the ability to recognize the irregularities in texts such as

misspellings and erratic spacing and punctuation, etc.

2.6.1.2 Segmentation

Segmenting text means dividing it into small meaningful pieces

typically referred by "sentence", a sentence consists of one or more tokens


30

and handles a meaning which may not completely be clear. This task

requires a full knowledge in the scope of punctuation marks since they are

the major factor in denoting the start and ends of sentences.

Segmentation becomes more complicated as the punctuations usages

become more. Some of punctuations can be a part from a token and not a

stopping mark such as the case with periods (.) when used with

abbreviations.

However, there is a set of factors can help in making the

segmentation process more accurate [Nad11]:

Case distinction: English sentences normally start with a capital

letter, (but Proper nouns also do).

POS tag: the tags that are surrounding punctuation can assist this

process, but multi tags situations complicate it such as the using

of ing verbs as nouns.

The length of the word (in the case of abbreviation

disambiguation, notice a period may assign the end of a sentence

and an abbreviation at the same time).

Morphological information, this task requires finding the stem

word by suffixes removal.

It is likely not to separate tokenization and segmentation processes;

they are usually merged together for solving most of the above

problems, specifically segmentation problems.

A sentence is described to be an indeterminate unit because of the difficulty

in deciding where it ends and another starts; while the grammar is

indeterminate from the stand point of deciding 'which sentence is

grammatically correct?' because this question permits to be answered

divisively and discourse segmentation difficulty is not the lonely reason but


31

also grammatical acceptability, meaning, style goodness or badness, lexical

acceptability, context acceptability, etc. [Qui85]

2.6.2 Ambiguity

An input is ambiguous if there is more than one alternative linguistic

structure for it. [Jur00]

Two major types of sentence ambiguity, genuine and computer

ambiguity. In the first, the sentence is really has two different meanings to

the intelligent hearer; while in the second case, is that the sentence has one

meaning but for the computer it has more than one and this type is really a

problem facing NLP applications unlike the first. [Not]

Ambiguity as an NLP problem is found in every processing step [Not]

[Bha12]:

2.6.2.1 Lexical Ambiguity

Lexical ambiguity is described to be the possibility for a word to

have more than one meaning or more than one POS tag.

Obviously, meaning ambiguity leads to semantic ambiguity and tag

ambiguity to syntactic ambiguity because it can produce more than one

parse tree. Frequency is an available solution for this problem.

2.6.2.2 Syntactic Ambiguity

The sentence has more than one syntactic structure; particularly,

English common ambiguity sources are:

Phrase attachment: how a certain phrase or a clause in the sentence

can be attached to another when there is more than one possibility.

Crossing is not allowed in parse trees; therefore, a parser generates a

parse tree for each accepted state.


32

Conjunction: sometimes, the parser befuddled to select which phrase

a conjunctive should be connected to.

Noun group structure: the rule

NG NG NG

allows English to generate long series of nouns to be strung together.

Some of these problems can be resolved by applying syntactic constraints.

2.6.2.3 Semantic Ambiguity

Even when a sentence is unambiguous lexically and syntactically,

sometimes, there is more than one interpretation for it. This is because a

phrase or a word may refer to more than one meaning.

"Selection restrictions" or "semantic constraints" is a way to

disambiguate such sentences. It combines two concepts in one mode if both

of the concepts or one of them has specific features. Frequency in context

also can help in deciding the meaning of a word.

2.6.2.4 Anaphoric Ambiguity

This is the possibility for a word or a phrase to refer to something

that is previously mentioned but in the reference there is more than one

possibility.

This type can be resolved by parallel structures or recency rules.

2.6.3 Language Change

"All living languages change with time, it is fortunate that they do so

rather slowly compare to the human life". Language change is represented

by the change of grammars of people who speak the language and it has

been shown that English was changed in its lexicon, phonological,


33

morphological, syntax, and semantic components of the grammar over the

past 1,500 years. [Fro07]

2.6.3.1 Phonological Change

Correspondences of regular sounds show the phonological system

changes. The phonological system is governed, as well as any other

linguistic system, by a set of rules and this set of phonemes and

phonological rules is subjected to change by modification, deletion and

addition of new rules. The change in phonological rules can affect the

lexicon in that some of English words formations depends on sounds, such

as the vowels sound differentiate nouns from verbs ( nouns house and bath

from the verbs house and bathe).

2.6.3.2 Morphological Change

Morphological rules, like the phonological, are suspected to addition,

lose and change. Mostly, the usage of suffixes is the active area of change

where the way of adding them to the ends of stems affected the resulted

words and therefore changed the lexicon.

2.6.3.3 Syntactic Change

Syntactic changes are influenced by morphological changes which in

turn influenced by phonological changes. This type of change includes all

types of grammar modifications that are mainly based on the reordering of

words inside the sentence.

2.6.3.4 Lexical Change

Change of lexical categories is the most common in this type of

change. An example of this situation is the usage of nouns as verbs, verbs

as nouns, and adjectives as nouns. Lexical change also includes the


34

addition of new words, borrowing or loan words from another language,

and the loss of existing words.

Figure (2.5) : An example of lexical change 1

2.6.3.5 Semantic Change

As the category of a word can be changed, its semantic

representation or meaning can be changed, too. Three types of change are

possible for a word:

Broadening: the meaning of a word is expanded to mean everything

it has been used for and more than that.

Narrowing: on the reverse of broadening, here the word meaning is

reduced from more general meaning to a specific meaning.

Shifting: the word reference is shifted to refer to another meaning

somewhat differs from the original one.

_________________________________________________________

Darby Conley/ Get fuzzy UFS, Inc. 24 Feb. 2012

35

Part II

Text Correction

2.7 Introduction

Text correction is the process of indicating incorrect words in an input

text, finding candidates (or alternatives) and suggesting the candidates as

corrections to the incorrect word. The term incorrect refers to two different

types of erroneous words: misspelled and misused. But mainly, the process

is divided into two distinct phases: error detection phase which indicates the

incorrect words, and error correction phase that combined both generating

and suggesting candidates.

Devising techniques and algorithms for correcting texts in an

automatic manner is a primal opened research challenge started from the

early 1960s and continued until now because the existed correction

techniques are limited in their accuracy and application scope [Kuk92].

Usually, a correction application concerns a specific type of errors

because it is a complex task to computationally predict an intended word

written by a human.

2.8 Text Errors

A word can be mistaken in two ways: the first is by incorrectly

spelling a word due to lack of enough information about the word spell or

intentionally mistaking symbol(s) within the word, this type of errors is

known as non-word errors where the word can't be found in the language

lexicon.

The second is by using correctly spelled word in wrong position in the

sentences or unsuitable context. These errors are known as real-word errors

Chapter Two_ Part II Text Correction _______________________________________________________________________

36

where the incorrect word is accepted in the language lexicon.

[Gol96][Amb08]

Non-word errors are easier to be detected, unlike real-word errors; the

later needs more information about the language syntax and semantic nature.

Accordingly, the correction techniques are divided into isolated words error

detections that is concerned with non-word errors; and context sensitive

error correction which deals with real-words error. [Gol96]

2.8.1 Non-word errors

Those errors include the words that are not found in the lexicon; a

misspelled word contains one or more from the following errors:

Substitution: one or more symbols are changed.

Deletion: one or more symbols are missed from the intended word.

Insertion: adding symbol(s) to the front, end, or any index in the word.

Transposition: two adjacent symbols are swapped.

The four errors are known as Damerau edit operations.

2.8.2 Real-word errors

These errors occur through mistaking an intended word by another

one that is lexically accepted. Real-word errors can be resulted from

phonetic confusion like using the word "piece" instead of "peace" which

usually leads to semantically unaccepted sentences, after applying non-word

correction, or even from misspelling the intended word and producing

another lexically accepted word. [Amb08]

Sometimes, the confusion results in syntactically unaccepted

sentences; like writing the sentence "John visit his uncle" instead of "John

visits his uncle".


37

Correcting real-word errors is context sensitive in that it needs to

check the surrounding words and sentences before suggesting candidates.

2.9 Error Detection Techniques

Indicating whether a word is correct or not is based on the type of

correction procedure; non-word error detection is usually checking the

acceptance of a word in the language dictionary (the lexicon) and marks any

mismatched word as incorrect. While real-word error is more complex task,

it requires analysing larger parts from the text, typically, paragraphs and full

text [Kuk92]. In this work, we mainly focus on non-word error detection

techniques.

Dustin defined spelling error as E in a given query word Q which is

not an entry in the underhand dictionary D. [Bos05] He outlined an

algorithm for spelling correction as shown in figure (2.6).

Spell error detection techniques can be classified into two major types:

2.9.1 Dictionary Looking Up

All the words of a given text are matched against every word in a

pre created dictionary or a list of all acceptable words in the language under-

consideration (or most of them since some languages have a huge number of

words and collecting them totally is semi impossible task). The word is

incorrect if and only if there is no match found. This technique is robust but

suffers from the long time required for checking; as the dictionary size

becomes larger, looking up time becomes longer. [Kuk92] [Mis13]

2.9.1.1 Dictionaries Resources

There are many systems deal with collecting and updating languages

lexical dictionaries. Example of these systems is the WordNet online

application; it is a large database of English lexicons. Lexicons (nouns,


38

verbs, adjectives, articles etc) are interlinked by lexical relations and

conceptual-semantic. The structure of WordNet is a network of words and

concepts that are related meaningfully and this structure made it a good tool

for NLP and Computational Linguistics.

Another example is the ISPELL text corrector; an online spell

checker provides many interfaces for many western languages. ISPELL is

the latest version of R. Gorin's spell checker which developed for Unix.

Suggestion a spell correction is based on only one Levenshtein edit distance

depending on looking up every token in the input text against a huge lexical

dictionary. [ISP14]

2.9.1.2 Dictionaries Structures

The standard looking up technique is to match every token in the

dictionary with every token in the text, but this process requires a long time

because NL dictionaries are usually of huge sizes and string matching needs

longer time than other data types do. A solution for this challenge is to

reduce the search space in such a way keeps similar tokens grouped together.

Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05]

Algorithm: Spell_correction

Input: word w

Output: suggestion(s) a set of alternatives for w

Begin

If (is_mistake(w))

Begin

Candidates=get_candidates( w)

Suggestions=filter_candidates( candidates)

Return suggestions

End

Else

Return is_correct

End.


39

Grouping according to spell or phones [Mis13], and using hash tables are

two fundamental ways to minimize search space.

Hashing techniques apply a hash function to generate a numeric key

from strings. The numeric keys are references to packets of tokens that can

generate the same key indices; hash functions differ in their ability to

distribute tokens and how much they minimize the search space. A perfect

hash function generates no collisions (hashing two different tokens to the

same key index), and a uniform hash function distribute tokens among

packets uniformly. The optimal hash function is a uniform perfect hash

function which hashes one token to every packet; such situation is

impossible with dictionaries due to the variance of tokens. [Nie09]

Spell and phones dependent groups use limited set of packets and

generate keys according to spell or pronunciation; they are another style of

hashing and sometimes of clustering. SPEEDCOP and Soundex are

examples. [Mis13] [Kuk92]

2.9.2 N-gram Analysis

N-grams are defined to be n subsequences of words or strings where n

is variable, often takes values: one to produce unigrams (or monograms),

two to produce bigrams (sometimes called "digrams"), three to produce

trigrams, or rarely takes larger values. This technique detects errors by

examining each n-gram from the given string and looking it with a

precompiled n-gram statistics table. The decision depends on the existence

of such n-gram or the frequency of it occurrence, if the n-gram is not found

or highly infrequent then the words or strings which contain it are incorrect.

[Kuk92] [Mis13]


40

2.10 Error Correction Techniques

Many techniques have been proposed to solve the problem of

generating candidates for the detected misspelled word; they vary in the

required resources, application scope, time and space complexity, and

accuracy. The most common are [Kuk92] [Mis13]:

2.10.1 Minimum Edit Distance Techniques

This technique stands on counting the minimum number of primal

operations required to convert the source string into the target one. Some

researchers refer to primal operations to be insertion, deletion, and

substitution of one letter by another; others add the transposition between

two adjacent letters to be the fourth primal operation. Examples, Levenshtein

Algorithm which counts one distance for every primal operation, Hamming

Algorithm works like Levenshtein but limited with only strings of equal

lengths; and Longest Common Substring finds the mutual substring between

two words.

Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has

no limitation on the types of symbols, or on their lengths. It can be executed

in time complexity of O(M.N) where M and N are the lengths of the two

input strings.

The algorithm can detect three types of errors (substitution, deletion,

and insertion). It doesn't account the transposition of two adjacent symbols

as one edit operation; instead, it counts such errors as two consecutive

substituting operations giving edit distance of 2.


41

One of the well-known modifications of the original Levenshtein

method was done by his friend Fred Damerau, who made a research and

found that about 80% to 90% of errors are caused by the four types of error

altogether which are known as Damerau-Levenshtein Distance. [Dam64]

The modified method required execution time longer than the original;

in every checking round, the method applies additional comparison to check

whether a transposition took place in the string then applies another

comparison to select the minimum value between the previous distance and

the distance with the occurrence of a transposition operation. This step

Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11]

1. Algorithm: Levenshtein Edit Distance

2. Input: String1, String2

3. Output: Edit Operations Number

4. Step1: Declaration

5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0,

cost=0

6. Step2: Calculate Distance

7. if String1 is NULL return Length of String2

8. if String2 is NULL return Length of String1

9. for each symbol x in String1 do

10. for each symbol y in String2 do

11. begin

12. if x = y

13. cost = 0

14. else

15. cost = 1

16. r=index of x, c=index of y

17. min1 = (distance(r - 1, c) + 1) // deletion

18. min2 = (distance(r, c - 1) + 1) //insertion

19. min3 = (distance(r - 1,c - 1) + cost) //substitution

20. distance( r , c )=minimum(min1 ,min2 ,min3)

21. end

22. Step3: return the value of the last cell in the distance matrix

23. return distance(Length of String1,Length of String2)

24. End.


42

multiplied time complexity by factor of 2, resulting in (2*M.N).Hence, in

this work, the original Levenshtein method (figure (2.7)) is modified to

consider the Damerau's four errors types within a time complexity shorter

than the time consumed by Damerau-Levenshtein Algorithm and close to the

original method. Figure (2.8) shows the modification of Damerau on

Levenshtein method.

1. Algorithm: Damerau-Levenshtein Distance

2. Input: String1, String2

3. Output: Damerau Edit Operations Number

4. Step1: Declaration

5. distance(length of String1,Length of String2)=0, min1=0, min2=0,

min3=0, cost=0

6. Step2: Calculate Distance

7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do

10. for each symbol y in String2 do

11. begin

12. if x = y

13. cost = 0

14. else 15. cost = 1

16. r=index of x, c=index of y

17. min1 = (distance(r - 1, c) + 1) // deletion

18. min2 = (distance(r, c - 1) + 1) //insertion

19. min3 = (distance(r - 1,c - 1) + cost) //substitution

20. distance( r , c )=minimum(min1 ,min2 ,min3)

21. if not(String1 starts with x) and not (String2 starts with y) then

22. if (the symbol preceding x= y) and (the symbol preceding y=x)

then 23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost)

24. end

25. Step3: return the value of the last cell in the distance matrix

26. return distance(Length of String1,Length of String2)

27. End.

Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]


43

2.10.2 Similarity Key Techniques

As its name clarifies, this technique finds a unique key to group

similarly spelled words together. The similarity key is computed for the

misspelled word and mapped to a pointer refers to the group of words that

are similar in their spell to the input one. Soundex algorithm finds keys

depending on the pronunciation of the words, while the SPEEDCOP system

rearranges the letters of the words by placing the first letter, followed by

consonants, and finally vowels according to their occurrence sequence in the

word and without duplication.[Kuk92] [Mis13]

2.10.3 Rule Based Techniques

This approach applies a set of rules on the misspelled word depending on

common mistakes patterns to transform the word into valid one. After

applying all the applicable rules, the set of generated words that are valid in

the dictionary suggested as candidates.

2.10.4 Probabilistic Techniques

Two methods are mainly based on statistics and probability:

1) Transition Method: depends on the probability of a given letter to be

followed by another one. The probability is estimated according to n-

gram statistics from big size corpus.

2) Confusion Method: depends on the probability of a given letter to be

confused or mistaken by another one. Probabilities in this method are

source dependent, as example: Optical Character Recognition (OCR)

systems vary in their accuracy and their basics in recognizing letters,

and Speech Recognition (SR) systems usually confuse sounds.


44

2.11 Suggestion of Corrections

Suggesting corrections may be merged within the candidates'

generation; it is fully dependent on the ou