146
Intelligent Text Document Correction System Based on Similarity Technique A Thesis Submitted to the Council of the College of Information Technology, University of Babylon in Partial Fulfillment of the Requirements for the Degree of Master of Sciences in Computer Sciences. By Marwa Kadhim Obeid Al-Rikaby Supervised by Prof. Dr. Abbas Mohsen Al-Bakry 2015 D.C. 1436 A.H. Ministry of Higher Education and Scientific Research University of Babylon- College of Information Technology Software Department

Intelligent Text Document Correction System Based on Similarity Technique

Embed Size (px)

Citation preview

  • Intelligent Text Document

    Correction System Based on

    Similarity Technique

    A Thesis

    Submitted to the Council of the College of Information Technology,

    University of Babylon in Partial Fulfillment of the Requirements

    for the Degree of Master of Sciences in Computer Sciences.

    By

    Marwa Kadhim Obeid Al-Rikaby

    Supervised by

    Prof. Dr. Abbas Mohsen Al-Bakry

    2015 D.C. 1436 A.H.

    Ministry of Higher Education and

    Scientific Research

    University of Babylon- College of Information

    Technology

    Software Department

  • II

    {

    }

    61 \

  • III

    Supervisor Certification

    I certify that this thesis was prepared under my supervision at the Department of

    Software / Information Technology / University of Babylon, by Marwa

    Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the

    degree of Master of Sciences in Computer Science.

    Signature:

    Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry

    Title : Professor.

    Date : / / 2015

    The Head of the Department Certification

    In view of the available recommendation, we forward this thesis for debate by

    the examining committee.

    Signature:

    Name : Dr. Eman Salih Al-Shamery

    Title: Assist. Professor.

    Date: / / 2015

  • IV

    To

    Master of creatures,

    Loved by Allah,

    The Prophet Muhammad

    (Allah bless him and his family)

  • V

    Acknowledgements

    All praise be to Allah Almighty who enabled me to complete this task

    successfully and utmost respect to His last Prophet Mohammad PBUH.

    First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al-

    Bakry, for his advice and guidance that led to the completion of this thesis.

    I would like to thank the staff of the Software Department for the help they

    have offered, especially, the head of the Software Department Dr. Eman Salih

    Al-Shamery.

    Most importantly, I would like to thank my parents, my sisters, my brothers

    and my friends for their support.

  • VI

    Abstract

    Automatic text correction is one of the human-computer interaction

    challenges. It is directly interposed with several application areas like post

    handwritten text digitizing correction or indirectly such as user's queries correction

    before applying a retrieval process in interactive databases.

    Automatic text correction process passes through two major phases: error

    detection and candidates suggestion. Techniques for both phases are categorized

    into: Procedural and statistical. Procedural techniques are based on using rules to

    govern texts acceptability, including Natural Language Processing Techniques.

    Statistical techniques, on the other hand, are dependent on statistics and

    probabilities collected from large corpus based on what is commonly used by

    humans.

    In this work, natural language processing techniques are used as bases for

    analysis and both spell and grammar acceptance checking of English texts. A

    prefix dependent hash-indexing scheme is used to shorten the time of looking up

    the underhand dictionary which contains all English tokens. The dictionary is used

    as a base for the error detection process.

    Candidates generation is based on calculating source token similarity,

    measured using an improved Levenshtein method, to the dictionary tokens and

    ranking them accordingly; however this process is time extensive, therefore, tokens

    are divided into smaller groups according to spell similarity in such a way keeps

    the random access availability. Finally, candidates suggestion involves examining

    a set of commonly committed mistakes related features. The system selects the

    optimal candidate which provides the highest suitability and doesn't violate

    grammar rules to generate linguistically accepted text.

    Testing the system accuracy showed better results than Microsoft Word and

    some other systems. The enhanced similarity measure reduced the time complexity

    to be on the boundaries of the original Levenshtein method with an additional error

    type discovery.

  • VII

    Table of Contents

    Subject Page

    No.

    Chapter One : Overview

    1.1 Introduction 1

    1.2 Problem Statement 3

    1.3 Literature Review 5

    1.4 Research Objectives 10

    1.5 Thesis Outlines 11

    Chapter Two: Background and Related Concepts

    Part I: Natural Language Processing 12

    2.1 Introduction 12

    2.2 Natural Language Processing Definition 12

    2.3 Natural Language Processing Applications 13

    2.3.1 Text Techniques 14

    2.3.2 Speech Techniques 15

    2.4 Natural Language Processing and Linguistics 16

    2.4.1 Linguistics 16

    2.4.1.1 Terms of Linguistic Analysis 17

    2.4.1.2 Linguistic Units Hierarchy 19

    2.4.1.3 Sentence Structure and Constituency 19

    2.4.1.4 Language and Grammar 20

    2.5 Natural Language Processing Techniques 22

    2.5.1 Morphological Analysis 22

    2.5.2 Part of Speech Tagging 23

    2.5.3 Syntactic Analysis 26

    2.5.4 Semantic Analysis 27

    2.5.5 Discourse Integration 27

    2.5.6 Pragmatic Analysis 28

    2.6 Natural Language Processing Challenges 28

    2.6.1 Linguistics Units Challenges 28

    2.6.1.1 Tokenization 28

    2.6.1.2 Segmentation 29

    2.6.2 Ambiguity 31

    2.6.2.1 Lexical Ambiguity 31

  • VIII

    Subject Page

    No.

    2.6.2.2 Syntactic Ambiguity 31

    2.6.2.3 Semantic Ambiguity 32

    2.6.2.4 Anaphoric Ambiguity 32

    2.6.3 Language Change 32

    2.6.3.1 Phonological Change 33

    2.6.3.2 Morphological Change 33

    2.6.3.3 Syntactic Change 33

    2.6.3.4 Lexical Change 33

    2.6.3.5 Semantic Change 34

    Part II: Text Correction 35

    2.7 Introduction 35

    2.8 Text Errors 35

    2.8.1 Non-words Errors 36

    2.8.2 Real-word Errors 36

    2.9 Error Detection Techniques 37

    2.9.1 Dictionary Looking Up 37

    2.9.1.1 Dictionaries Resources 37

    2.9.1.2 Dictionaries Structures 38

    2.9.2 N-gram Analysis 39

    2.10 Error Correction Techniques 40

    2.10.1 Minimum Edit Distance Techniques 40

    2.10.2 Similarity Key Techniques 43

    2.10.3 Rule Based Techniques 43

    2.10.4 Probabilistic Techniques 43

    2.11 Suggestion of Corrections 44

    2.12 The Suggested Approach 44

    2.12.1 Finding Candidates Using Minimum Edit Distance 45

    2.12.2 Candidates Mining 45

    2.12.3 Part-of-Speech Tagging and Parsing 46

    Chapter Three : Hashed Dictionary and Looking Up Technique

    3.1 Introduction 48

    3.2 Hashing 48

    3.2.1 Hash Function 49

    3.2.2 Formulation 52

    3.2.3 Indexing 53

    3.3 Looking Up Procedure 56

  • IX

    Subject Page

    No.

    3.4 Dictionary Structure Properties 58

    3.5 Similarity Based Looking-Up 59

    3.5.1 Bi-grams Generation 60

    3.5.2 Primary Centroids Selection 62

    3.5.3 Centroids Referencing 63

    3.6 Application of Similarity Based Looking up approach 64

    3.7 The Similarity Based Looking up Properties 67

    Chapter Four : Error Detection and Candidates Generation

    4.1 Introduction 69

    4.2 Non-word Error Detection 69

    4.3 Real-Words Error Detection 71

    4.4 Candidates Generation 72

    4.4.1 Candidates Generation for Non-word Errors 72

    4.4.1.2 Enhanced Levenshtein Method 74

    4.4.1.3 Similarity Measure 78

    4.4.1.4 Looking for Candidates 79

    4.4.2 Candidates Generation for Real-words Errors 81

    Chapter Five : Text Correction and Candidates Suggestion

    5.1 Introduction 82

    5.2 Correction and Candidates Suggestion Structure 82

    5.3 Named-Entity Recognition 85

    5.4 Candidates Ranking 86

    5.4.1 Edit Distance Based Similarity 87

    5.4.2 First and End Symbols Matching 87

    5.4.3 Difference in Lengths 88

    5.4.4 Transposition Probability 89

    5.4.5 Confusion Probability 90

    5.4.6 Consecutive Letters (Duplication) 91

    5.4.7 Different Symbols Existence 92

    5.5 Syntax Analysis 93

    5.5.1 Sentence Phrasing 93

    5.5.2 Candidates Optimization 95

    5.5.3 Grammar Correction 95

    5.5.4 Document Correction 97

    Chapter Six: Experimental Results, Conclusions, and Future Works

  • X

    Subject Page

    No.

    6.1 Experimental Results 98

    6.1.1 Tagging and Error Detection Time Reduction 98

    6.1.1.1 Successful Looking Up 99

    6.1.1.2 Failure Looking Up 100

    6.1.2 Candidates Generation and Similarity Search Space

    Reduction

    101

    6.1.3 Time Reduction of the Damerau-Levenshtein method 103

    6.1.4 Features Effect on Candidates Suggestion 104

    6.2 Conclusions 107

    6.3 Future Works 108

    References 110

    Appendix A 117

    Appendix B 122

    List of Figures

    Figure

    No.

    Title Page

    No.

    (2.1) NLP dimensions 16

    (2.2) Linguistics analysis steps 17

    (2.3) Linguistic Units Hierarchy 19

    (2.4) Classification of POS tagging models 24

    (2.5) An example of lexical change 34

    (2.6) Outlines of Spell Correction Algorithm 38

    (2.7) Levenshtein Edit Distance Algorithm 41

    (2.8) Damerau-Levenshtein Edit Distance Algorithm 42

    (2.9) The Suggested System Block Diagram 47

    (3.1) Token Hashing Algorithm 54

  • XI

    Figure

    No.

    Title Page

    No.

    (3.2) Dictionary Structure and Indexing Scheme 55

    (3.3) Algorithm of Looking Up Procedure 57

    (3.4) Semi Hash Clustering block diagram 61

    (3.5) Similarity Based Hashing algorithm 64

    (3.6) Block diagram of candidates generation using SBL 66

    (3.7) Similarity Based Looking up algorithm 68

    (4.1) Tagging Flow Chart 70

    (4.2) The Enhanced Levenshtein Method Algorithm 76

    (4.3) Original Levenshtein Example 77

    (4.4) Damerau-Levenshtein Example 77

    (4.5) Enhanced Levenshtein Example 78

    (5.1) Candidates ranking flowchart 84

    (5.2) Syntax analysis flowchart 94

    (6.1) Tokens distribution in primary packets 99

    (6.2) Tokens distribution in secondary packets 99

    (6.3) Time complexity Variance of Levenshtein, Damerau-

    Levenshtein, and Enhanced Levenshtein (our modification) 103

    (6.4) Suggestion Accuracy with a comparison to Microsoft Office

    Word on a Sample from the Wikipedia 104

    (6.5) Testing the suggested system accuracy and comparing the

    results with other systems using the same dataset 105

    (6.6) Discarding one feature at a time for optimal candidate

    selection 106

    (6.7) Using one feature at a time for optimal candidate selection 107

  • XII

    List of Tables

    Table

    No.

    Title Page

    No.

    (1-1) Summary of Literature Review 9

    (3-1) Alphabet Encoding 50

    (3-2) Addressing Range 52

    (3-3) Predicting errors using Bi-grams analysis 61

    (5-1) Transposition Matrix 90

    (5-2) Confusion Matrix 91

    List of Symbols and Abbreviations

    Meaning Abbreviation

    Alphabet

    Adjectival Phrase A

    Absolute Difference abs

    Sentence Complement C

    Context Free Grammar CFG

    Dictionary D

    Dioxide Nuclear Acid DNA

    Error E

    Grammar G

    Grammar Error Correction GEC

    Hidden Markov Model HMM

    Information Retrieval IR

    Machine Translation MT

    Named Entity NE

    Named-Entity Recognition NER

    Noun Group NG

    Natural Language Generation NLG

    Natural Language Processing NLP

    Natural Languages NLs

    Natural Language Understanding NLU

  • XIII

    Noun Phrase NP

    big-Oh notation ( =at most) O( )

    Optical Character Recognition OCR

    Production Rule P

    Part Of Speech POS

    Prepositional Phrase PP

    Query Q

    Ranking Value R

    Relative Distance R_Dist

    Start Symbol S

    Stanford Machine Translator SMT

    Speech Recognition SR

    String1, String2 St1,St2

    Variable V

    Adverbial Phrase v

    Verb Phrase VP

    big-Omega notation (= at least) ( )

  • Chapter One

    Overview

  • 1

    Chapter One

    Overview

    1.1 Introduction

    Natural Language Processing, also known as computational Linguistics,

    is the field of computer science that deals with linguistics; it is a form of

    human- computer interaction where formalization is applied on the elements

    of human language to be performed by a computer [Ach14]. Natural

    Language Processing (NLP) is the implementation of systems that are

    capable of manipulating and processing natural languages (NLs)

    sentences[Jac02] like English, Arabic, Chinese and not formal languages

    like Python, Java, C++; nor descriptive languages such as DNA in biology

    and Chemical formulas in chemist [Mom12]. NLP task is the designing and

    building of software for analyzing, understanding and generating spoken

    and/or written NLs. [Man08] [Mis13]

    NLP has many applications such as automatic summarization, Machine

    Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition

    (SR), Optical Character Recognition (OCR), Information Retrieval (IR),

    Opinion Mining [Nad11], and others [Wol11].

    Text Correction is another significant application of NLP. It includes

    both Spell Checking and Grammar Error Correction (GEC). Spell checking

    research extends early back to the mid of 20th century by Lee Earnest at

    Stanford University but the first application was created in 1971 by Ralph

    Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of

    10,000 English words. [Set14] [Pet80]

    Grammar error correction, in spite of its central role in semantic and

    meaning representations, is largely ignored by NLP community. In recent

  • Chapter One Overview ________________________________________________________________________

    2

    years, an improvement noticed in automatic GEC techniques. [Voo05]

    [Jul13] However, most of these techniques are limited in specific domains

    such as real-word spell correction [Hwe14], subject-verb disagreement

    [Han06], verb tense misuse [Gam10], determiners or articles and improper

    preposition usage. [Tet10] [Dah11]

    Different techniques like edit distance [Wan74], rule-based techniques

    [Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98],

    probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel

    model [Tou02] have been proposed for text correction purposes. Each

    technique needs some sort of resources. Edit distance, rule-based and

    similarity key techniques require a dictionary (or lexicon), n-grams and

    probabilistic work with statistical and frequency information, neural nets are

    learned with training patterns, etc

    Text correction, spell and grammar, is an extensive process includes,

    typically, three major steps: [Ach14] [Jul13]

    The first step is to detect the incorrect words. The most popular way to

    decide if a word is misspelled is to look for it in a dictionary, a list of

    correctly spelled words. This way can detect non-word errors not the real-

    word errors [Kuk92] [Mis13] because an unintended word may match a

    word in the dictionary. NLs have a large number of words resulting in a

    huge dictionary, therefore, the task of looking every word consumes a long

    time. Whereas, in GEC this step is more complicated, it requires applying

    more analysis at the level of sentences and phrases using computational

    linguistics basics to detect the word that makes the sentence incorrect.

    Next, a list of candidates or alternatives should be generated for the

    incorrect word (misspelled or misused). This list is preferred to be short and

    contains the words with highest similarity or suitability; and to produce it, a

    technique is needed to calculate the similarity of the incorrect word with

  • Chapter One Overview ________________________________________________________________________

    3

    every word in the dictionary. Efficiency and accuracy are major factors in

    the selection of such technique. GEC requires broad knowledge of diverse

    grammatical error categories and extensive linguistic technique to identify

    alternatives because a grammatical error mayn't be resulted from a unique

    word.

    Finally, suggesting the intended word or a list of alternatives contains

    the intended word. This task requires ranking the words according to the

    similarity amount to the incorrect word and some other considerations may

    or may not be taken depending on the technique in use.

    Text mining techniques started to enter the area of text correction;

    Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and

    Information Retrieval [Kir11] are examples. Statistics and probabilistic also

    played a great role specifically in analyzing common mistakes and n-gram

    datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and

    phonetic level, can be used in reducing the looking up space; NER may help

    in avoiding interpreting proper nouns as misspellings; statistics merged with

    NLP techniques to provide more precise parsing and POS tagging, usually,

    in context dependent applications. The application of a given technique

    differs according to what level of correction is intended; it starts from the

    character level [Far14], passes through word, phrase (usually in GEC),

    sentence, and ends in the context or document subject level.

    1.2 Problem Statement

    Although many text checking and correction systems are produced,

    each has its variances from the sides of input quality restrictions, techniques

    used, output accuracy, speed, performance conditionsetc. [Ahm09]

    [Pet80]. This field of NLP is really an open research from all sides because

    there is no complete algorithm or technique handles all considerations.

  • Chapter One Overview ________________________________________________________________________

    4

    The limited linguistic knowledge, the huge number of lexicons, the

    extensive grammar, language ambiguity and change over time, variety of

    committed errors and computational requirements are challenges facing the

    process of developing a text correction application.

    In this work, some of the above mentioned problems are solved using a

    set of solutions:

    Integrating two lexicon datasets (WordNet and Ispell).

    Using brute-force approach to solve some sorts of ambiguity.

    Applying hashing and indexing in looking up the dictionary.

    Reducing search space in candidates collecting process by

    grouping similarly spelled words into semi clusters.

    The Levenshtein method [Hal11] is also enhanced to consider Damerau

    four types of errors within time period shorter than Damerau-Levenshtein

    method [Hal11]. Named Entity Recognition, letters confusion and

    transposition, and candidate length effect are used as features to optimize the

    candidates' suggestion. In addition to applying rules of Part-Of-Speech tags

    and sentence constituency for checking sentence grammar correctness,

    whether it is lexically corrected or is not.

    The proposed three components of this system are: (1)a spell error

    detection is based on a fast looking up technique in a dictionary of more than

    300,000 tokens, constructed by applying a string prefix dependent hash

    function and indexing method; grammar error detector is a brute-force

    parser. (2)For candidates generation, an enhancement was implemented on

    the Levenshtein method to consider Damerau four errors types and then used

    to measure similarity according to the minimum edit distance and difference

    in lengths effect, the dictionary tokens are grouped into spell based clusters

    to reduce search space. (3)The candidates suggestion exploits NER features,

  • Chapter One Overview ________________________________________________________________________

    5

    transposition error and confusion statistics, affixes analysis (including first

    and last letters matching), length of candidates, and parsing success.

    1.3 Literature Review

    Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to

    string transformation includes a model consists of rules and weights for

    training and an algorithm depends on scoring and ranking according to

    conditional probability distribution for generating the top k-candidates at

    the character level where both high and low frequency words can be

    generated. Spell checking is one of many applications on which the

    approach was applied; the misspelled strings (words or characters) are

    transformed by applying a number of operators into the k-most similar

    strings in a dictionary (start and end letters are constants). [Ach14]

    Mariano F., Zheng Y., and others, 2014, talked the correction of

    grammatical errors by processes pipelining which combines results from

    multiple systems. The components of the approach are: a rule based error

    corrector uses rules automatically derived from the Cambridge Learner

    Corpus which based on N-grams that have been annotated as incorrect;

    SMT system translates incorrectly written English into correct English;

    NLTK1 was used to perform segmentation, tokenization, and POS

    tagging; the candidates generation produce all the possible combinations

    of corrections for the sentence, in addition to the sentence itself to

    consider the "no correction" option; finally the candidates are ranked

    using a language model. [Fel14]

    __________________________________________________________

    1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research

    and teaching in computational linguistics and natural language processing. NLTK is written in Python and

    distributed under the GPL open source license. Over the past year the toolkit has been rewritten,

    simplifying many linguistic data structures and taking advantage of recent enhancements in the Python

    language.

  • Chapter One Overview ________________________________________________________________________

    6

    Anubhav G., 2014, presented a rule-based approach that used two POS

    taggers to correct non-native English speakers' grammatical errors,

    Stanford parser and Tree Tagger. The detection of errors depends on the

    outputs of the two taggers, if they differ then the sentence is not correct.

    Errors are corrected using Nodebox English Linguistic library. Error

    correction includes subject-verb disagreement, verb form, and errors

    detected by POS tag mismatch. [Gup14]

    Stephan R., 2013, proposed a model for spelling correction based on

    treating words as "documents" and spell correction as a form of

    document retrieval in that the model retrieves the best matching correct

    spell for a given input. The words are transformed into tiny documents of

    bits and hamming distance is used to predict the closest string of bits

    from a dictionary holding the correctly spelled words as strings of bits.

    The model is knowledge free and only contains a list of correct words.

    [Raa13]

    Youssef B., 2012, produced a parallel spell checking algorithm for

    spelling errors detection and correction. The algorithm is based on

    information from Yahoo! N-gram dataset 2.0; it is a shared memory

    model allowing concurrency among threads for both parallel multi

    processor and multi core machines. The three major components (error

    detector, candidates' generator and error corrector) are designed to run in

    a parallel fashion. Error detector, based on unigrams, detects non-word

    errors; candidates' generator is based on bi-grams; the error corrector,

    context sensitive, is based on 5-grams information.[Bas12]

    Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and

    Gary G. L., 2012, presented a novel method for grammatical error

    correction by building a meta-classifier. The meta-classifier decides the

    final output depending on the internal results from several base

    classifiers; they used multiple grammatical errors tagged corpora with

  • Chapter One Overview ________________________________________________________________________

    7

    different properties in various aspects. The method focused on the articles

    and the correction arises only when a mismatching occur with the

    observed articles. [Seo12]

    Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic

    information retrieval system performing automatic spell correction for

    user queries before applying the retrieval process. The correcting

    procedure depends on matching the misspelled word against a correctly

    spelled words dictionary using Levenshtein algorithm. If an incorrect

    word is encountered then the system retrieves the most similar word

    depending on the Levenshtein measure and the occurrence frequency of

    the misspelled word.[Kir11]

    Farag, Ernesto, and Andreas, 2008, developed a language-independent

    spell checker. It is based on the enhancement of N-gram model through

    creating a ranked list of correction candidates derived based on N-gram

    statistics and lexical resources then selecting the most promising

    candidates as correction suggestions. Their algorithm assigns weights to

    the possible suggestions to detect non-word errors. They depended a

    "MultiWordNet" dictionary of about 80,000 entries.[Ahm09]

    Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of

    real-words spelling error correction. They assumed that the observed

    sentence is a signal passed through a noisy channel, where the channel

    reflects the typist and the distortion reflects errors committed by the

    typist. The probability of the sentence correctness, given by the channel

    (typist), is a parameter associated with that sentence. The probability of

    every word in the sentence to be the intended one is equivalent to the

    sentence correctness probability and the word is associated with a set of

    spell variants words excluding the word itself. Correction can be applied

    to one word in the sentence by replacing the incorrect one by another

  • Chapter One Overview ________________________________________________________________________

    8

    from the candidates (its real-word spelling variations) set so that it gives

    the maximum probability.[Amb08]

    Stoyan, Svetla, and others, 2005, described an approach for lexical post-

    correction of the output of optical character recognizer OCR as a two

    research project. They worked on multiple sides; on the dictionary side,

    they enriched their large sizes dictionaries with specialty dictionaries; on

    the candidates selection, they used a very fast searching algorithm

    depends on Levenshtein automata for efficient selecting the correction

    candidates with a bound not exceeding 3; they ranked candidates

    depending on a number of features such as frequency and edit

    distance.[Mih04]

    Suzan V., 2002, described a context sensitive spell checking algorithm

    based on the BESL spell checker lexicons and word trigrams for

    detecting and correcting real-word errors using probability information.

    The algorithm splits up the input text into trigrams and every trigram is

    looked up in a precompiled database which contains a list of trigrams and

    their occurrence number in the corpus used for database compiling. The

    trigram is correct if it is in the trigram database, otherwise it is considered

    an erroneous trigram containing a real-word error. The correction

    algorithm uses BESL spell checker to find candidates but the most

    frequent in the trigrams database are suggested to the user.[Ver02]

  • Chapter One Overview ________________________________________________________________________

    9

    No. Reference Methodology Technique

    1

    [Ach14] Generating the top K-

    candidates at the

    character level for both

    high and low frequency.

    A model consists of rules and

    weights, and a conditional

    probability distribution

    dependent algorithm

    2

    [Fel14] Grammatical errors

    correction based on

    generating all possible

    correct alternatives for

    the sentence

    Combining the results of

    multiple systems: rule based

    error corrector, SMT English

    to Correct English translator,

    and NLTK for segmentation,

    tokenization and tagging

    3

    [Gup14] Non-native English

    speakers' grammatical

    errors correction

    Error detection used Stanford

    parser and Tree Tagger.

    Correction based on

    Nodebox English Linguistic

    library

    4

    [Raa13] Dictionary based Spell

    correction treats the

    misspelled word as a

    document.

    Converting the misspelled

    word into a tiny document of

    bits and retrieving the most

    similar documents using

    Hamming Distance

    5

    [Bas12] Context sensitive spell

    checking using a shared

    memory model allowing

    concurrency among

    threads for parallel

    execution

    Different N-grams levels for

    error detection, candidates

    generation, and candidates

    suggestion depending on

    Yahoo! N-Grams dataset 2.0

    6

    [Seo12] Meta-classifier for

    grammatical errors

    correction focused

    mainly on the articles.

    Deciding the output

    depending on the internal

    results from several base

    classifiers

    7

    [Kir11]

    Automatic spell

    correction for user

    queries before applying

    retrieval process

    Using Levenshtein algorithm

    for both error detection and

    correction in a dictionary

    looking up technique

    Table 1.1: Summary of Literature Review

  • Chapter One Overview ________________________________________________________________________

    11

    8

    [Ahm09] Language independent

    model for non-word error

    correction based on N-

    gram statistics and lexical

    resources

    Ranking a list of correction

    candidates by assigning

    weights to the possible

    suggestions depending on a

    "MultiWordNet" dictionary

    of about 80,000 entries

    9

    [Amb08] Noisy channel model for

    Real words error

    correction based on

    probability.

    Channel represents the typist,

    distortion represents the

    error, and the noise

    probability is a parameter

    10

    [Mih04]

    OCR output post

    correction

    Levenshtein automata for

    candidates generation and

    frequency for ranking

    11

    [Ver02] Context sensitive spell

    checking algorithm based

    on tri-grams

    Splitting texts into word

    trigrams and matching them

    against the precompiled

    BESL spell checker lexicons,

    suggestion depends on

    probability information.

    1.4 Research Objectives

    This research is attempted to design and implement a smart text

    document correction system for English texts. It is based on mining a typed

    text for detecting spelling and grammar errors and giving the optimal

    suggestion(s) from a set of candidates, its steps are:

    1. Analyzing the given text by using Natural Language Processing

    techniques, at each step detect the erroneous words.

    2. Looking up candidates for the erroneous words and ranking them

    according to a given set of features and conditions to be the initial

    solutions.

    3. Optimizing the initial solutions depending on the extracted

    information from the given text and the detected errors.

  • Chapter One Overview ________________________________________________________________________

    11

    4. Recovering the input text document with the optimal solutions and

    associating the best set of candidates with each incorrect detected

    word.

    1.5 Thesis Outlines

    The next five chapters are:

    1. Chapter Two: "Background and Related Concepts" consisted of two

    parts. The first overviews NLP fundamentals, applications and

    techniques; whereas, the second is about text correction techniques.

    2. Chapter Three: "Dictionary Structure and Looking up Technique"

    describes the suggested approach of constructing the dictionary of the

    system for both perfect matching and similarity looking up.

    3. Chapter Four: "Error Detection and Candidates Generation", declares

    the suggested technique for indicating incorrect words and the method

    of generating candidates.

    4. Chapter Five: "Automatic Text Correction and Candidates

    Suggestion", describes the techniques of suggestions selection and

    optimization.

    5. Chapter Six: "Experimental Results, Conclusion, and Future Works",

    the experimental results of applying the techniques described in

    chapters three, four and five, conclusion of the system and the future

    directions are shown.

  • Chapter Two

    Background

    and

    Related Concepts

  • 12

    Chapter Two

    Background and Related Concepts

    Part I

    Natural Language Processing

    2.1 Introduction

    Natural Language Processing (NLP) began in the late 1940s. It was

    focused on machine translations; in 1958, NLP was linked to the

    information retrieval by the Washington International Conference of

    Scientific Information; [Jon01] primary ideas for developing applications

    for detecting and correcting text errors started at that period of time.

    [Pet80] [Boo58]

    Natural Language Processing has a great interest from that time till

    our days because it plays an important role in the interaction between

    human and computers. It represents the intersection of linguistics and

    artificial intelligence [Nad11] where machine can be programmed to

    manipulate natural language.

    2.2 Natural Language Processing Definition

    "Natural Language Processing (NLP) is the computerized approach

    for analyzing text that is based on both a set of theories and a set of

    technologies." [Sag13]

    NLP describes the function of software or hardware components in a

    computer system that is capable of analyzing or synthesizing human

    languages (spoken or written) [Jac02] [Mis13] like English, Arabic,

    Chinese etc, not formal languages like Python, Java, C++ etc, nor

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    13

    descriptive languages such as DNA in biology and Chemical formulas in

    chemist [Mom12].

    "NLP is a tool that can reside inside almost any text processing

    software application" [Wol11]

    We can define NLP as a subfield of Artificial Intelligence

    encompasses anything needed by a computer to understand and generate

    natural language. It is based on processing human language for two tasks:

    the first receives a natural language input (text or speech), applies analysis,

    reasons what was meant by that input, and outputs in computer language;

    this is the task of Natural Language Understanding (NLU). While the

    second task is to generate human sentences according to specific

    considerations, the input is in computer language but the output is in human

    languages; it is called Natural Language Generation (NLG). [Raj09]

    "Natural Language Understanding is associated with the more

    ambitious goals of having a computer system actually comprehend natural

    language as a human being might". [Jac02]

    2.3 Natural Language Processing Applications

    Even of its wide usage in computer systems, NLP is entirely

    disappeared into the background; where it is invisible to the user and adds

    significant business value. [Wol11]

    The major distinction of NLP applications from other data

    processing systems is that they use Language Knowledge. Natural

    Language Processing applications are mainly divided into two categories

    according to the given NL format [Mom12] [Wol11]:

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    14

    2.3.1Text Technologies

    Spell and Grammar Checking: systems deal with indicating

    lexical and grammar errors and suggest corrections.

    Text Categorization and Information Filtering: In such

    applications, NLP represents the documents linguistically and

    compares each one to the others. In text categorization, the

    documents are grouped according to their linguistic

    representation characteristics into several categories. Information

    filtering signals out, from a collection of documents, the

    documents that are satisfying some criterion.

    Information Retrieval: finds and collects relevant information to

    a given query. A user expresses the information need by a query,

    then the system attempts to match the given query to the database

    documents that is satisfying the users query. Query and

    documents are transformed into a sort of linguistic structure, and

    the matching is performed accordingly.

    Summarization: according to an information need or a query

    from the user, this type of applications finds the most relevant

    part of the document.

    Information Extraction: refers to the automatic extraction of

    structured information from unstructured sources. Structured

    information like entities, their relationships, and attributes

    describing them. This can integrate structured and unstructured

    data sources, if both are exist, and pose queries for spanning the

    integrated information giving better results than applying

    searches by keywords alone.

    Question Answering: works with plain speech or text input,

    applies an information search based on the input. Such as IBM

    Watson and the reigning JEOPARDY! Champion, which read

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    15

    questions and understand their intention, then looking up the

    knowledge library to find a match.

    Machine Translation: translate a given text from a specific

    natural language to another natural language, some applications

    have the ability to recognize the given text language even if the

    user didn't specify it correctly.

    Data Fusion: Combining extracted information from several text

    files into a database or an ontology.

    Optical Character Recognition: digitizing handwritten and

    printed texts. I.e. converting characters from images to digital

    codes.

    Classification: this NLP application type sorts and organizes

    information into relevant categories. Like e-mail spam filters and

    Google News news service.

    And also NLP entered other applications such as educational

    essay test-scoring systems, voice-mail phone trees, and even e-

    mail spam detection software.

    2.3.2 Speech Technologies

    Speech Recognition: mostly used on telephone voice response

    systems as a service client. Its task is processing plain speech. It

    is also used to convert speech into text.

    Speech Synthesis: means converting text into speech. This

    process requires working at the level of phones and converting

    from alphabetic symbols into sound signals.

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    16

    2.4 Natural Language Processing and Linguistics

    Natural Language Processing is concerned with three dimensions:

    language, algorithm and problem as presented in figure (2.1). On the

    language dimension, NLP considers linguistics; algorithm dimension

    mentions NLP techniques and tasks, while the problem dimension depicts

    the applied mechanisms to solve problems. [Bha12]

    2.4.1 Linguistics

    Natural Language is a communication mean. It is a system of

    arbitrary signals such as the voice sound and written symbols. [Ali11]

    Linguistics is the scientific study of language; it starts from the simple

    acoustic signals which form sounds and ends with pragmatic understanding

    to produce the full context meaning.

    There are two major levels of linguistic, Speech Recognition (SR)

    and Natural Language Processing (NLP) as shown in figure (2.2).

    Figure (2.1) : NLP dimensions [Bha12]

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    17

    2.4.1.1 Terms of Linguistic Analysis

    A natural language, as formal language does, has a set of basic

    components that may vary from one language to another but remain

    bounded under specific considerations giving the special characteristics to

    every language.

    From the computational view, a language is a set of strings generated

    over a finite alphabet and can be considered by a grammar. The definition

    Acoustic Signal

    Phones

    Letters and Strings

    Morphemes

    Words

    Phrases and Sentences

    Meaning out of Context

    Meaning in Context

    SR

    NLP

    Phonetics

    Phonology

    Lexicon

    Morphology

    Syntax

    Semantics

    Pragmatics

    Figure (2.2) : Linguistics analysis steps [Cha10]

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    18

    of the three abstracted names is dependent on the language itself; i.e.

    strings, alphabet and grammar formulate and characterize language.

    Strings:

    In natural language processing, the strings are the morphemes of the

    language, their combinations (words) and the combinations of their

    combinations (sentences), but linguistics going somewhat deeper than this.

    It starts with phones, the primitive acoustic patterns, which are significant

    and distinguishable from one natural language to another. Phonology

    groups phones together to produce phonemes represented by symbols.

    Morphemes consist of one or more symbols; thus, NLs can be further

    distinguished.

    Alphabet:

    When individual symbols, usually thousands, represent words then

    the language is "logographic"; if the individual symbols represent syllables,

    it is a "syllabic" one. But when they represent sounds, the language is

    "alphabetic". Syllabic and alphabetic languages have typically less than 100

    symbols, unlike logographic.

    English is an alphabetic language system consists of 26 symbols,

    these symbols represents phones combined into morphemes which may or

    may not combined further more to form words.

    Grammar:

    Grammar is a set of rules specifying the legal structure of the

    language; it is a declarative representation about the language syntactic

    facts. Usually, grammar is represented by a set of productive rules.

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    19

    2.4.1.2 Linguistic Units Hierarchy

    Language can be divided into pieces; there is a typical structure or

    form for every level of analysis. Those pieces can be put into a hierarchical

    structure starting from a meaningful sentence as the top level, proceeding

    in the separation of building units until reaching the primary acoustic

    sounds. Figure (2.3) presented an example.

    Figure (2.3) : Linguistic Units Hierarchy

    2.4.1.3 Sentence Structure and Constituency

    "It is constantly necessary to refer to units smaller than the sentence

    itself units such as those which are commonly referred as CLAUSE,

    PHRASE, WORD, and MORPHEME. The relation between one unit and

    another unit of which it is a part is CONSTITUENCY." [Qui85]

    The task of dividing a sentence into constituents is a complex task

    ________________________________________________________

    1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries

    The teacher talked to the students

    The teach er talk ed to the student s

    The teacher talked to the students

    The teacher talked to the students

    Sentence

    Phrase

    Word

    Morphem

    e

    Phonemes1 u

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    20

    requires incorporating more than one analysis stage; tokenization,

    segmentation, parsing, (and sometimes stemming) usually merged together

    to build the parse tree for a given sentence.

    2.4.1.4 Language and Grammar

    A language is a 'set' of sentences and a sentence is a 'sequence' of

    'symbols' [Gru08]; it can be generated given its context free grammar

    G=(V,,S,P). [Cla10]

    Commonly, grammars are represented as a set of production rules

    which is taken by the parser and compared against the input sentences.

    Every matched rule adds something to the sentence complete structure

    which is called 'parse tree'. [Ric91]

    Context free grammar (CFG) is a popular method for generating

    formal grammars. It is used extensively to define languages syntax. The

    four components of the grammar are defined in CFG as [Sag13]:

    Terminals (): represent the basic elements which form the

    strings of the language.

    Nonterminals or Syntactic Variables (V): sets of strings define the

    language which is generated by the grammar. Nonterminals

    represent a key in syntax analyzing and translation via imposing a

    hierarchical structure for the language.

    Set of production rules (P): this set define the way of combining

    terminals with nonterminals to produce strings. The production

    rule is consisted of a variable on the left side represents its head,

    this head defines

    Start symbol (S).

    The following is an example describes the structure of English sentence

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    21

    V = {S, NP, N, V P, V, Art}

    = {boy, icecream, dog, bite, like, ate, the, a},

    P = {S NP V P,

    NP N,

    NP ART N,

    V P V NP,

    N boy | icecream | dog,

    V ate | like | bite,

    Art the | a}

    The grammar specifies two things about the language: [Ric91]

    Its weak generative capacity; the limited set of sentences which can

    be completely matched by a series of grammar rules.

    Its strong generative capacity, grammatical structure(s) of each

    sentence in the language.

    Generally, there are an infinite number of sentences for each grammar

    which can be structured with it. The strength and importance of grammars

    lurk in their ability of supplying structure to an infinite number of

    sentences because they succinctly summarize an infinite number of objects

    structures of a certain class. [Gru08]

    The grammar is said to be generative if it has a fixed size production

    rules which, if followed, can generate every sentence in the language using

    an infinite number of actions. [Gru08]

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    22

    2.5 Natural Language Processing Techniques

    2.5.1 Morphological Analysis

    Morphology is the study of how words are constructed from

    morphemes which represent the minimal meaning-bearing language

    primitive units.[Raj09] [Jur00]

    There are two broad classes of morphemes: stems and affixes; the

    distinction between the two classes is language dependent in that it varies

    from one language to another. The stem, usually, refers to the main part of

    the word and the affixes can be added to the words to give it additional

    meaning. [Jur00]

    Further more, affixes can be divided into four categories according to

    the position where they are added. Prefixes, suffixes, circumfixes and

    infixes generally refer to the different types of affixes but it is not necessary

    to a language to have all the types. English accept both prefixes to precede

    stems and suffixes to follow stems, while there is no good example for a

    circumfixe (precede and follow a stem) in English, and infixing (inserting

    inside the stem) is not allowed (unlike German and Philippine languages,

    consecutively) . [Jur00]

    Morphology is concerned with recognizing the modification of base

    words to form other words with different syntactic categories but similar

    meanings.

    Generally, three forms of word modifications are found [Jur00]:

    Inflection: syntactic rules change the textual representation of the

    words; such as adding the suffix 's' to convert nouns into plurals,

    adding 'er' and 'est' convert regular adjectives into comparative and

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    23

    superlative forms, consecutively. This type of modification usually

    results a word from the same word class of the stem word.

    Derivation: new words are produced by adding morphemes, usually

    more complex and harder in meaning than inflectional morphology.

    It often occurs in a regular manner and results words differ in their

    word class from the stem word. Like adding the suffix 'ness' to

    'happy' to produce 'happiness'.

    Compounding: this type modifies stem words by another stem words

    by grouping them. Like grouping 'head' with 'ache' to produce

    'headache'. In English, this type is infrequent.

    Morphological processing, also known as stemming, depends heavily on

    the analyzed language. The output is the set of morphemes that are

    combined to form words. Morphemes can be stem words, affixes, and

    punctuations.

    2.5.2 Part Of Speech Tagging

    Part of Speech (POS) tagging is the process of giving the proper

    lexical information or POS tag (also known as word classes, lexical tags,

    and morphological classes), which is encoded as a symbol, for every word

    (or token) in a sentence. [Sco99] [Has06b]

    In English, POS tags are classified into four basic classes of words: [Qui85]

    1. Closed classes: include prepositions, pronouns, determiners,

    conjunctions, modal verbs and primary verbs.

    2. Open classes: include nouns, adjectives, adverbs, and full verbs.

    3. Numerals: include numbers and orders.

    4. Interjections: include small set of words like oh, ah, ugh, phew.

    Usually, a POS tag indicates one or more of the previous information and it

    is sometimes holds other features like the tense of the verb or the number

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    24

    (plural or singular). POS tagging may generate tagged corpora or serve as a

    preprocessing step for the next NLP processes. [Sco99]

    Most of tagging systems performance is typically limited because

    they only use local lexical information available in the sentence, at the

    opposite of syntax analyzing systems which exploit both lexical and

    structural information. [Sco99] More research was done and several models

    and methods have been proposed to enhance taggers performance, they fall

    mainly into supervised and unsupervised methods where the main

    difference between the two categories is the set of training corpora that is

    pre tagged in supervised methods unlike unsupervised methods which

    needs advanced computational methods for gaining such a corpora.

    [Has06a] [Has06b]. Figure (2.4) presents the main categories and shows

    some examples.

    In both categories, the following are the most popular:

    Figure (2.4) : Classification of POS tagging models [Has06a]

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    25

    Statistical (stochastic, or probabilistic) methods: taggers which

    use these methods are firstly trained on a correctly tagged set of

    sentences which allow the tagger to disambiguate words by

    extracting implicit rules or picking the most probable tag based on

    the words that are surrounding the given word in the sentence.

    Examples of these methods are Maximum-Entropy Models, Hidden

    Markov Models (HMM), and Memory Based models.

    Rule based methods: a sequence of rules, a set of hand written

    rules, is applied to detect the best tags set for the sentence regardless

    of any maximization probability. The set of rules need to be written

    probably and checked by human experts. Examples: the path-voting

    constraint models and decision tree models.

    Transformational approach: combines both statistical methods and

    rule based methods to firstly find the most probable set of available

    tags and then applies a set of rules to select the best.

    Neural Networks: with linear separator or full neural network, have

    been used for tagging processes.

    The methods described above, as any other research areas, have their

    advantages and disadvantages; but there is a major difficulty facing all

    of them, it is the tagging of unknown words (words that have never seen

    before in the training corpora). While rule-based approaches depends on

    a special set of rules to handle such situations, stochastic and neural nets

    lack this feature and use other ways such as suffixes analysis and n-

    gram by applying morphological analysis; some methods use default set

    of tags to disambiguate unknown words. [Has06a]

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    26

    2.5.3 Syntactic Analysis

    "Syntax is the study of the relationships between linguistics forms,

    how they are arranged in sequence, and which sequences are well-

    formed". [Yul00]

    Syntactic analysis, also referred by "Parsing", is the process of

    converting the sentence from its flat format which is represented as a

    sequence of words into a structure that defines its units and the relations

    between these units. [Raj09]

    Hence, the goal of this technique is to transform natural language

    into an internal system representation. The format of this representation

    may be dependency graphs, frames, trees or some other structural

    representations. Syntactic parsing attempts only for converting sentences

    into either dependency links representing the utterance syntactic structure

    or a tree structure and the output of this process is called "parse tree" or

    simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning

    in the level of the smallest parts ("words" in terms of language scientist,

    "tokens" in terms of computer scientists). [Gru08]

    Syntactic analysis makes use of both the results of morphological

    analysis and Part-Of-Speech tagging to build the structural description of

    the sentence by applying the grammar rules of the language under

    consideration; if a sentence violates the rules then it is rejected and

    assigned as incorrect. [Raj09]

    The two main components of every syntax analyzer are:

    Grammar: the grammar provides the analyzer with the set of

    production rules that will lead it to construct the structure of the

    sentences and specifies the correctness of every given sentence.

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    27

    Good grammars make a careful distinction between the

    sentence/word level, which they often call syntax or syntaxis and

    the word/letter level, which they call morphology. [Gru08]

    Parser: the parser reconstructs the production tree (or trees) by

    applying the grammar to indicate how the given sentence (if

    correctly constructed) was produced from that grammar.

    Parsing is the process of structuring a linear representation in

    accordance with a given grammar.

    Today, most of parsers combine context free grammars with probability

    models to determine the most likely syntactic structure out of many others

    that are accepted as parse trees for an utterance. [Dzi04]

    2.5.4 Semantic Analysis

    "Semantics is the study of the relationships between linguistic

    forms and entities in the words; that is, how words literally connect to

    things." [Yul00]

    This technique and the later following it are basically depended by

    language understanding. Semantic analysis is the process of assigning

    meanings to the syntactic structures of the sentences regardless of its

    context. [Yul00] [Raj09]

    2.5.5 Discourse Integration

    Discourse analysis is concerned with studying the effect of sentences

    of each other. It shows how a given sentence is affected by the one

    preceding it and how it affects the sentence following it. Discourse

    Integration is relevant to understanding texts and paragraphs rather than

    simple sentences, so, discourse knowledge is important in the interpretation

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    28

    of temporal aspects (like pronouns) in the conveyed information. [Ric91]

    [Raj09]

    2.5.6 Pragmatic Analysis

    This step interprets the structure that represents what is said for

    determining what was meant actually. Context is a fundamental resource

    for processing here. [Ric91]

    2.6 Natural Language Processing Challenges

    The challenges of natural language processing are much enough that

    can't be summarized in a limited list; with every processing step from the

    start point to results outputting there are a set of problems that natural

    language processors vary in their ability to handle. However, the

    application where NLP is used, usually, concerned with a specific task

    rather than considering all processing steps with all their details, this is an

    advantage for the NLP community helps to outline the challenges and

    problems according to the task under consideration.

    For our research area, we precisely concerned with the set of

    problems that are directly affecting the task of text correction; the next

    subsections describe some of them:

    2.6.1 Linguistic Units Challenges:

    The task of text correction starts from the level of characters up to

    paragraphs and full texts, with every level there are a set of difficulties that

    the handling analyzer faces:

    2.6.1.1 Tokenization

    In this process, the lexical analyzer, usually called "Tokenizer",

    divides the text into smaller units and the output of this step is a series of

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    29

    morphemes, words, expressions and punctuations (called tokens). It

    involves locating tokens boundaries (where one token ends and another

    begins).

    Issues that arise in tokenization and should be addressed are [Nad11]:

    Problem depends on language type: language includes, in addition to

    their symbols, a set of orthographic conventions which are used in

    writing to indicate the boundaries of linguistic units. English

    employs whitespaces to separate words but this isn't sufficient to

    tokenize a text in a complete and unambiguous manner because the

    same character may be used in different uses (as the case with

    punctuations), there are words with multi parts (such as dividing the

    word with a hyphen at the end of lines and some cases in the addition

    of prefixes) and many expressions consisted of more than one word.

    Encoding Problems: syllabic and alphabetic writing systems, usually,

    encoded using single byte, but languages with larger character sets

    require more than two bytes. The problem arise when the same set of

    encodings represents different characters set; whereas, the tokenizers

    are targeted to a specific encoding for a specific language.

    Other problems such as the dependency of the application

    requirements which indicates what a constituent is defined as a

    token; in computational linguistics the definition should precisely

    indicate what the next processing step requires. The tokeniser should

    also have the ability to recognize the irregularities in texts such as

    misspellings and erratic spacing and punctuation, etc.

    2.6.1.2 Segmentation

    Segmenting text means dividing it into small meaningful pieces

    typically referred by "sentence", a sentence consists of one or more tokens

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    30

    and handles a meaning which may not completely be clear. This task

    requires a full knowledge in the scope of punctuation marks since they are

    the major factor in denoting the start and ends of sentences.

    Segmentation becomes more complicated as the punctuations usages

    become more. Some of punctuations can be a part from a token and not a

    stopping mark such as the case with periods (.) when used with

    abbreviations.

    However, there is a set of factors can help in making the

    segmentation process more accurate [Nad11]:

    Case distinction: English sentences normally start with a capital

    letter, (but Proper nouns also do).

    POS tag: the tags that are surrounding punctuation can assist this

    process, but multi tags situations complicate it such as the using

    of ing verbs as nouns.

    The length of the word (in the case of abbreviation

    disambiguation, notice a period may assign the end of a sentence

    and an abbreviation at the same time).

    Morphological information, this task requires finding the stem

    word by suffixes removal.

    It is likely not to separate tokenization and segmentation processes;

    they are usually merged together for solving most of the above

    problems, specifically segmentation problems.

    A sentence is described to be an indeterminate unit because of the difficulty

    in deciding where it ends and another starts; while the grammar is

    indeterminate from the stand point of deciding 'which sentence is

    grammatically correct?' because this question permits to be answered

    divisively and discourse segmentation difficulty is not the lonely reason but

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    31

    also grammatical acceptability, meaning, style goodness or badness, lexical

    acceptability, context acceptability, etc. [Qui85]

    2.6.2 Ambiguity

    An input is ambiguous if there is more than one alternative linguistic

    structure for it. [Jur00]

    Two major types of sentence ambiguity, genuine and computer

    ambiguity. In the first, the sentence is really has two different meanings to

    the intelligent hearer; while in the second case, is that the sentence has one

    meaning but for the computer it has more than one and this type is really a

    problem facing NLP applications unlike the first. [Not]

    Ambiguity as an NLP problem is found in every processing step [Not]

    [Bha12]:

    2.6.2.1 Lexical Ambiguity

    Lexical ambiguity is described to be the possibility for a word to

    have more than one meaning or more than one POS tag.

    Obviously, meaning ambiguity leads to semantic ambiguity and tag

    ambiguity to syntactic ambiguity because it can produce more than one

    parse tree. Frequency is an available solution for this problem.

    2.6.2.2 Syntactic Ambiguity

    The sentence has more than one syntactic structure; particularly,

    English common ambiguity sources are:

    Phrase attachment: how a certain phrase or a clause in the sentence

    can be attached to another when there is more than one possibility.

    Crossing is not allowed in parse trees; therefore, a parser generates a

    parse tree for each accepted state.

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    32

    Conjunction: sometimes, the parser befuddled to select which phrase

    a conjunctive should be connected to.

    Noun group structure: the rule

    NG NG NG

    allows English to generate long series of nouns to be strung together.

    Some of these problems can be resolved by applying syntactic constraints.

    2.6.2.3 Semantic Ambiguity

    Even when a sentence is unambiguous lexically and syntactically,

    sometimes, there is more than one interpretation for it. This is because a

    phrase or a word may refer to more than one meaning.

    "Selection restrictions" or "semantic constraints" is a way to

    disambiguate such sentences. It combines two concepts in one mode if both

    of the concepts or one of them has specific features. Frequency in context

    also can help in deciding the meaning of a word.

    2.6.2.4 Anaphoric Ambiguity

    This is the possibility for a word or a phrase to refer to something

    that is previously mentioned but in the reference there is more than one

    possibility.

    This type can be resolved by parallel structures or recency rules.

    2.6.3 Language Change

    "All living languages change with time, it is fortunate that they do so

    rather slowly compare to the human life". Language change is represented

    by the change of grammars of people who speak the language and it has

    been shown that English was changed in its lexicon, phonological,

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    33

    morphological, syntax, and semantic components of the grammar over the

    past 1,500 years. [Fro07]

    2.6.3.1 Phonological Change

    Correspondences of regular sounds show the phonological system

    changes. The phonological system is governed, as well as any other

    linguistic system, by a set of rules and this set of phonemes and

    phonological rules is subjected to change by modification, deletion and

    addition of new rules. The change in phonological rules can affect the

    lexicon in that some of English words formations depends on sounds, such

    as the vowels sound differentiate nouns from verbs ( nouns house and bath

    from the verbs house and bathe).

    2.6.3.2 Morphological Change

    Morphological rules, like the phonological, are suspected to addition,

    lose and change. Mostly, the usage of suffixes is the active area of change

    where the way of adding them to the ends of stems affected the resulted

    words and therefore changed the lexicon.

    2.6.3.3 Syntactic Change

    Syntactic changes are influenced by morphological changes which in

    turn influenced by phonological changes. This type of change includes all

    types of grammar modifications that are mainly based on the reordering of

    words inside the sentence.

    2.6.3.4 Lexical Change

    Change of lexical categories is the most common in this type of

    change. An example of this situation is the usage of nouns as verbs, verbs

    as nouns, and adjectives as nouns. Lexical change also includes the

  • Chapter Two Part I: Natural Language Processing _________________________________________________________________________

    34

    addition of new words, borrowing or loan words from another language,

    and the loss of existing words.

    Figure (2.5) : An example of lexical change 1

    2.6.3.5 Semantic Change

    As the category of a word can be changed, its semantic

    representation or meaning can be changed, too. Three types of change are

    possible for a word:

    Broadening: the meaning of a word is expanded to mean everything

    it has been used for and more than that.

    Narrowing: on the reverse of broadening, here the word meaning is

    reduced from more general meaning to a specific meaning.

    Shifting: the word reference is shifted to refer to another meaning

    somewhat differs from the original one.

    _________________________________________________________

    Darby Conley/ Get fuzzy UFS, Inc. 24 Feb. 2012

  • 35

    Part II

    Text Correction

    2.7 Introduction

    Text correction is the process of indicating incorrect words in an input

    text, finding candidates (or alternatives) and suggesting the candidates as

    corrections to the incorrect word. The term incorrect refers to two different

    types of erroneous words: misspelled and misused. But mainly, the process

    is divided into two distinct phases: error detection phase which indicates the

    incorrect words, and error correction phase that combined both generating

    and suggesting candidates.

    Devising techniques and algorithms for correcting texts in an

    automatic manner is a primal opened research challenge started from the

    early 1960s and continued until now because the existed correction

    techniques are limited in their accuracy and application scope [Kuk92].

    Usually, a correction application concerns a specific type of errors

    because it is a complex task to computationally predict an intended word

    written by a human.

    2.8 Text Errors

    A word can be mistaken in two ways: the first is by incorrectly

    spelling a word due to lack of enough information about the word spell or

    intentionally mistaking symbol(s) within the word, this type of errors is

    known as non-word errors where the word can't be found in the language

    lexicon.

    The second is by using correctly spelled word in wrong position in the

    sentences or unsuitable context. These errors are known as real-word errors

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    36

    where the incorrect word is accepted in the language lexicon.

    [Gol96][Amb08]

    Non-word errors are easier to be detected, unlike real-word errors; the

    later needs more information about the language syntax and semantic nature.

    Accordingly, the correction techniques are divided into isolated words error

    detections that is concerned with non-word errors; and context sensitive

    error correction which deals with real-words error. [Gol96]

    2.8.1 Non-word errors

    Those errors include the words that are not found in the lexicon; a

    misspelled word contains one or more from the following errors:

    Substitution: one or more symbols are changed.

    Deletion: one or more symbols are missed from the intended word.

    Insertion: adding symbol(s) to the front, end, or any index in the word.

    Transposition: two adjacent symbols are swapped.

    The four errors are known as Damerau edit operations.

    2.8.2 Real-word errors

    These errors occur through mistaking an intended word by another

    one that is lexically accepted. Real-word errors can be resulted from

    phonetic confusion like using the word "piece" instead of "peace" which

    usually leads to semantically unaccepted sentences, after applying non-word

    correction, or even from misspelling the intended word and producing

    another lexically accepted word. [Amb08]

    Sometimes, the confusion results in syntactically unaccepted

    sentences; like writing the sentence "John visit his uncle" instead of "John

    visits his uncle".

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    37

    Correcting real-word errors is context sensitive in that it needs to

    check the surrounding words and sentences before suggesting candidates.

    2.9 Error Detection Techniques

    Indicating whether a word is correct or not is based on the type of

    correction procedure; non-word error detection is usually checking the

    acceptance of a word in the language dictionary (the lexicon) and marks any

    mismatched word as incorrect. While real-word error is more complex task,

    it requires analysing larger parts from the text, typically, paragraphs and full

    text [Kuk92]. In this work, we mainly focus on non-word error detection

    techniques.

    Dustin defined spelling error as E in a given query word Q which is

    not an entry in the underhand dictionary D. [Bos05] He outlined an

    algorithm for spelling correction as shown in figure (2.6).

    Spell error detection techniques can be classified into two major types:

    2.9.1 Dictionary Looking Up

    All the words of a given text are matched against every word in a

    pre created dictionary or a list of all acceptable words in the language under-

    consideration (or most of them since some languages have a huge number of

    words and collecting them totally is semi impossible task). The word is

    incorrect if and only if there is no match found. This technique is robust but

    suffers from the long time required for checking; as the dictionary size

    becomes larger, looking up time becomes longer. [Kuk92] [Mis13]

    2.9.1.1 Dictionaries Resources

    There are many systems deal with collecting and updating languages

    lexical dictionaries. Example of these systems is the WordNet online

    application; it is a large database of English lexicons. Lexicons (nouns,

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    38

    verbs, adjectives, articles etc) are interlinked by lexical relations and

    conceptual-semantic. The structure of WordNet is a network of words and

    concepts that are related meaningfully and this structure made it a good tool

    for NLP and Computational Linguistics.

    Another example is the ISPELL text corrector; an online spell

    checker provides many interfaces for many western languages. ISPELL is

    the latest version of R. Gorin's spell checker which developed for Unix.

    Suggestion a spell correction is based on only one Levenshtein edit distance

    depending on looking up every token in the input text against a huge lexical

    dictionary. [ISP14]

    2.9.1.2 Dictionaries Structures

    The standard looking up technique is to match every token in the

    dictionary with every token in the text, but this process requires a long time

    because NL dictionaries are usually of huge sizes and string matching needs

    longer time than other data types do. A solution for this challenge is to

    reduce the search space in such a way keeps similar tokens grouped together.

    Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05]

    Algorithm: Spell_correction

    Input: word w

    Output: suggestion(s) a set of alternatives for w

    Begin

    If (is_mistake(w))

    Begin

    Candidates=get_candidates( w)

    Suggestions=filter_candidates( candidates)

    Return suggestions

    End

    Else

    Return is_correct

    End.

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    39

    Grouping according to spell or phones [Mis13], and using hash tables are

    two fundamental ways to minimize search space.

    Hashing techniques apply a hash function to generate a numeric key

    from strings. The numeric keys are references to packets of tokens that can

    generate the same key indices; hash functions differ in their ability to

    distribute tokens and how much they minimize the search space. A perfect

    hash function generates no collisions (hashing two different tokens to the

    same key index), and a uniform hash function distribute tokens among

    packets uniformly. The optimal hash function is a uniform perfect hash

    function which hashes one token to every packet; such situation is

    impossible with dictionaries due to the variance of tokens. [Nie09]

    Spell and phones dependent groups use limited set of packets and

    generate keys according to spell or pronunciation; they are another style of

    hashing and sometimes of clustering. SPEEDCOP and Soundex are

    examples. [Mis13] [Kuk92]

    2.9.2 N-gram Analysis

    N-grams are defined to be n subsequences of words or strings where n

    is variable, often takes values: one to produce unigrams (or monograms),

    two to produce bigrams (sometimes called "digrams"), three to produce

    trigrams, or rarely takes larger values. This technique detects errors by

    examining each n-gram from the given string and looking it with a

    precompiled n-gram statistics table. The decision depends on the existence

    of such n-gram or the frequency of it occurrence, if the n-gram is not found

    or highly infrequent then the words or strings which contain it are incorrect.

    [Kuk92] [Mis13]

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    40

    2.10 Error Correction Techniques

    Many techniques have been proposed to solve the problem of

    generating candidates for the detected misspelled word; they vary in the

    required resources, application scope, time and space complexity, and

    accuracy. The most common are [Kuk92] [Mis13]:

    2.10.1 Minimum Edit Distance Techniques

    This technique stands on counting the minimum number of primal

    operations required to convert the source string into the target one. Some

    researchers refer to primal operations to be insertion, deletion, and

    substitution of one letter by another; others add the transposition between

    two adjacent letters to be the fourth primal operation. Examples, Levenshtein

    Algorithm which counts one distance for every primal operation, Hamming

    Algorithm works like Levenshtein but limited with only strings of equal

    lengths; and Longest Common Substring finds the mutual substring between

    two words.

    Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has

    no limitation on the types of symbols, or on their lengths. It can be executed

    in time complexity of O(M.N) where M and N are the lengths of the two

    input strings.

    The algorithm can detect three types of errors (substitution, deletion,

    and insertion). It doesn't account the transposition of two adjacent symbols

    as one edit operation; instead, it counts such errors as two consecutive

    substituting operations giving edit distance of 2.

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    41

    One of the well-known modifications of the original Levenshtein

    method was done by his friend Fred Damerau, who made a research and

    found that about 80% to 90% of errors are caused by the four types of error

    altogether which are known as Damerau-Levenshtein Distance. [Dam64]

    The modified method required execution time longer than the original;

    in every checking round, the method applies additional comparison to check

    whether a transposition took place in the string then applies another

    comparison to select the minimum value between the previous distance and

    the distance with the occurrence of a transposition operation. This step

    Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11]

    1. Algorithm: Levenshtein Edit Distance

    2. Input: String1, String2

    3. Output: Edit Operations Number

    4. Step1: Declaration

    5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0,

    cost=0

    6. Step2: Calculate Distance

    7. if String1 is NULL return Length of String2

    8. if String2 is NULL return Length of String1

    9. for each symbol x in String1 do

    10. for each symbol y in String2 do

    11. begin

    12. if x = y

    13. cost = 0

    14. else

    15. cost = 1

    16. r=index of x, c=index of y

    17. min1 = (distance(r - 1, c) + 1) // deletion

    18. min2 = (distance(r, c - 1) + 1) //insertion

    19. min3 = (distance(r - 1,c - 1) + cost) //substitution

    20. distance( r , c )=minimum(min1 ,min2 ,min3)

    21. end

    22. Step3: return the value of the last cell in the distance matrix

    23. return distance(Length of String1,Length of String2)

    24. End.

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    42

    multiplied time complexity by factor of 2, resulting in (2*M.N).Hence, in

    this work, the original Levenshtein method (figure (2.7)) is modified to

    consider the Damerau's four errors types within a time complexity shorter

    than the time consumed by Damerau-Levenshtein Algorithm and close to the

    original method. Figure (2.8) shows the modification of Damerau on

    Levenshtein method.

    1. Algorithm: Damerau-Levenshtein Distance

    2. Input: String1, String2

    3. Output: Damerau Edit Operations Number

    4. Step1: Declaration

    5. distance(length of String1,Length of String2)=0, min1=0, min2=0,

    min3=0, cost=0

    6. Step2: Calculate Distance

    7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do

    10. for each symbol y in String2 do

    11. begin

    12. if x = y

    13. cost = 0

    14. else 15. cost = 1

    16. r=index of x, c=index of y

    17. min1 = (distance(r - 1, c) + 1) // deletion

    18. min2 = (distance(r, c - 1) + 1) //insertion

    19. min3 = (distance(r - 1,c - 1) + cost) //substitution

    20. distance( r , c )=minimum(min1 ,min2 ,min3)

    21. if not(String1 starts with x) and not (String2 starts with y) then

    22. if (the symbol preceding x= y) and (the symbol preceding y=x)

    then 23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost)

    24. end

    25. Step3: return the value of the last cell in the distance matrix

    26. return distance(Length of String1,Length of String2)

    27. End.

    Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    43

    2.10.2 Similarity Key Techniques

    As its name clarifies, this technique finds a unique key to group

    similarly spelled words together. The similarity key is computed for the

    misspelled word and mapped to a pointer refers to the group of words that

    are similar in their spell to the input one. Soundex algorithm finds keys

    depending on the pronunciation of the words, while the SPEEDCOP system

    rearranges the letters of the words by placing the first letter, followed by

    consonants, and finally vowels according to their occurrence sequence in the

    word and without duplication.[Kuk92] [Mis13]

    2.10.3 Rule Based Techniques

    This approach applies a set of rules on the misspelled word depending on

    common mistakes patterns to transform the word into valid one. After

    applying all the applicable rules, the set of generated words that are valid in

    the dictionary suggested as candidates.

    2.10.4 Probabilistic Techniques

    Two methods are mainly based on statistics and probability:

    1) Transition Method: depends on the probability of a given letter to be

    followed by another one. The probability is estimated according to n-

    gram statistics from big size corpus.

    2) Confusion Method: depends on the probability of a given letter to be

    confused or mistaken by another one. Probabilities in this method are

    source dependent, as example: Optical Character Recognition (OCR)

    systems vary in their accuracy and their basics in recognizing letters,

    and Speech Recognition (SR) systems usually confuse sounds.

  • Chapter Two_ Part II Text Correction _______________________________________________________________________

    44

    2.11 Suggestion of Corrections

    Suggesting corrections may be merged within the candidates'

    generation; it is fully dependent on the ou