Upload
college-of-engineering-and-technology-at-university-of-babylon
View
166
Download
2
Tags:
Embed Size (px)
Citation preview
Intelligent Text Document
Correction System Based on
Similarity Technique
A Thesis
Submitted to the Council of the College of Information Technology,
University of Babylon in Partial Fulfillment of the Requirements
for the Degree of Master of Sciences in Computer Sciences.
By
Marwa Kadhim Obeid Al-Rikaby
Supervised by
Prof. Dr. Abbas Mohsen Al-Bakry
2015 D.C. 1436 A.H.
Ministry of Higher Education and
Scientific Research
University of Babylon- College of Information
Technology
Software Department
II
{
}
61 \
III
Supervisor Certification
I certify that this thesis was prepared under my supervision at the Department of
Software / Information Technology / University of Babylon, by Marwa
Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the
degree of Master of Sciences in Computer Science.
Signature:
Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry
Title : Professor.
Date : / / 2015
The Head of the Department Certification
In view of the available recommendation, we forward this thesis for debate by
the examining committee.
Signature:
Name : Dr. Eman Salih Al-Shamery
Title: Assist. Professor.
Date: / / 2015
IV
To
Master of creatures,
Loved by Allah,
The Prophet Muhammad
(Allah bless him and his family)
V
Acknowledgements
All praise be to Allah Almighty who enabled me to complete this task
successfully and utmost respect to His last Prophet Mohammad PBUH.
First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al-
Bakry, for his advice and guidance that led to the completion of this thesis.
I would like to thank the staff of the Software Department for the help they
have offered, especially, the head of the Software Department Dr. Eman Salih
Al-Shamery.
Most importantly, I would like to thank my parents, my sisters, my brothers
and my friends for their support.
VI
Abstract
Automatic text correction is one of the human-computer interaction
challenges. It is directly interposed with several application areas like post
handwritten text digitizing correction or indirectly such as user's queries correction
before applying a retrieval process in interactive databases.
Automatic text correction process passes through two major phases: error
detection and candidates suggestion. Techniques for both phases are categorized
into: Procedural and statistical. Procedural techniques are based on using rules to
govern texts acceptability, including Natural Language Processing Techniques.
Statistical techniques, on the other hand, are dependent on statistics and
probabilities collected from large corpus based on what is commonly used by
humans.
In this work, natural language processing techniques are used as bases for
analysis and both spell and grammar acceptance checking of English texts. A
prefix dependent hash-indexing scheme is used to shorten the time of looking up
the underhand dictionary which contains all English tokens. The dictionary is used
as a base for the error detection process.
Candidates generation is based on calculating source token similarity,
measured using an improved Levenshtein method, to the dictionary tokens and
ranking them accordingly; however this process is time extensive, therefore, tokens
are divided into smaller groups according to spell similarity in such a way keeps
the random access availability. Finally, candidates suggestion involves examining
a set of commonly committed mistakes related features. The system selects the
optimal candidate which provides the highest suitability and doesn't violate
grammar rules to generate linguistically accepted text.
Testing the system accuracy showed better results than Microsoft Word and
some other systems. The enhanced similarity measure reduced the time complexity
to be on the boundaries of the original Levenshtein method with an additional error
type discovery.
VII
Table of Contents
Subject Page
No.
Chapter One : Overview
1.1 Introduction 1
1.2 Problem Statement 3
1.3 Literature Review 5
1.4 Research Objectives 10
1.5 Thesis Outlines 11
Chapter Two: Background and Related Concepts
Part I: Natural Language Processing 12
2.1 Introduction 12
2.2 Natural Language Processing Definition 12
2.3 Natural Language Processing Applications 13
2.3.1 Text Techniques 14
2.3.2 Speech Techniques 15
2.4 Natural Language Processing and Linguistics 16
2.4.1 Linguistics 16
2.4.1.1 Terms of Linguistic Analysis 17
2.4.1.2 Linguistic Units Hierarchy 19
2.4.1.3 Sentence Structure and Constituency 19
2.4.1.4 Language and Grammar 20
2.5 Natural Language Processing Techniques 22
2.5.1 Morphological Analysis 22
2.5.2 Part of Speech Tagging 23
2.5.3 Syntactic Analysis 26
2.5.4 Semantic Analysis 27
2.5.5 Discourse Integration 27
2.5.6 Pragmatic Analysis 28
2.6 Natural Language Processing Challenges 28
2.6.1 Linguistics Units Challenges 28
2.6.1.1 Tokenization 28
2.6.1.2 Segmentation 29
2.6.2 Ambiguity 31
2.6.2.1 Lexical Ambiguity 31
VIII
Subject Page
No.
2.6.2.2 Syntactic Ambiguity 31
2.6.2.3 Semantic Ambiguity 32
2.6.2.4 Anaphoric Ambiguity 32
2.6.3 Language Change 32
2.6.3.1 Phonological Change 33
2.6.3.2 Morphological Change 33
2.6.3.3 Syntactic Change 33
2.6.3.4 Lexical Change 33
2.6.3.5 Semantic Change 34
Part II: Text Correction 35
2.7 Introduction 35
2.8 Text Errors 35
2.8.1 Non-words Errors 36
2.8.2 Real-word Errors 36
2.9 Error Detection Techniques 37
2.9.1 Dictionary Looking Up 37
2.9.1.1 Dictionaries Resources 37
2.9.1.2 Dictionaries Structures 38
2.9.2 N-gram Analysis 39
2.10 Error Correction Techniques 40
2.10.1 Minimum Edit Distance Techniques 40
2.10.2 Similarity Key Techniques 43
2.10.3 Rule Based Techniques 43
2.10.4 Probabilistic Techniques 43
2.11 Suggestion of Corrections 44
2.12 The Suggested Approach 44
2.12.1 Finding Candidates Using Minimum Edit Distance 45
2.12.2 Candidates Mining 45
2.12.3 Part-of-Speech Tagging and Parsing 46
Chapter Three : Hashed Dictionary and Looking Up Technique
3.1 Introduction 48
3.2 Hashing 48
3.2.1 Hash Function 49
3.2.2 Formulation 52
3.2.3 Indexing 53
3.3 Looking Up Procedure 56
IX
Subject Page
No.
3.4 Dictionary Structure Properties 58
3.5 Similarity Based Looking-Up 59
3.5.1 Bi-grams Generation 60
3.5.2 Primary Centroids Selection 62
3.5.3 Centroids Referencing 63
3.6 Application of Similarity Based Looking up approach 64
3.7 The Similarity Based Looking up Properties 67
Chapter Four : Error Detection and Candidates Generation
4.1 Introduction 69
4.2 Non-word Error Detection 69
4.3 Real-Words Error Detection 71
4.4 Candidates Generation 72
4.4.1 Candidates Generation for Non-word Errors 72
4.4.1.2 Enhanced Levenshtein Method 74
4.4.1.3 Similarity Measure 78
4.4.1.4 Looking for Candidates 79
4.4.2 Candidates Generation for Real-words Errors 81
Chapter Five : Text Correction and Candidates Suggestion
5.1 Introduction 82
5.2 Correction and Candidates Suggestion Structure 82
5.3 Named-Entity Recognition 85
5.4 Candidates Ranking 86
5.4.1 Edit Distance Based Similarity 87
5.4.2 First and End Symbols Matching 87
5.4.3 Difference in Lengths 88
5.4.4 Transposition Probability 89
5.4.5 Confusion Probability 90
5.4.6 Consecutive Letters (Duplication) 91
5.4.7 Different Symbols Existence 92
5.5 Syntax Analysis 93
5.5.1 Sentence Phrasing 93
5.5.2 Candidates Optimization 95
5.5.3 Grammar Correction 95
5.5.4 Document Correction 97
Chapter Six: Experimental Results, Conclusions, and Future Works
X
Subject Page
No.
6.1 Experimental Results 98
6.1.1 Tagging and Error Detection Time Reduction 98
6.1.1.1 Successful Looking Up 99
6.1.1.2 Failure Looking Up 100
6.1.2 Candidates Generation and Similarity Search Space
Reduction
101
6.1.3 Time Reduction of the Damerau-Levenshtein method 103
6.1.4 Features Effect on Candidates Suggestion 104
6.2 Conclusions 107
6.3 Future Works 108
References 110
Appendix A 117
Appendix B 122
List of Figures
Figure
No.
Title Page
No.
(2.1) NLP dimensions 16
(2.2) Linguistics analysis steps 17
(2.3) Linguistic Units Hierarchy 19
(2.4) Classification of POS tagging models 24
(2.5) An example of lexical change 34
(2.6) Outlines of Spell Correction Algorithm 38
(2.7) Levenshtein Edit Distance Algorithm 41
(2.8) Damerau-Levenshtein Edit Distance Algorithm 42
(2.9) The Suggested System Block Diagram 47
(3.1) Token Hashing Algorithm 54
XI
Figure
No.
Title Page
No.
(3.2) Dictionary Structure and Indexing Scheme 55
(3.3) Algorithm of Looking Up Procedure 57
(3.4) Semi Hash Clustering block diagram 61
(3.5) Similarity Based Hashing algorithm 64
(3.6) Block diagram of candidates generation using SBL 66
(3.7) Similarity Based Looking up algorithm 68
(4.1) Tagging Flow Chart 70
(4.2) The Enhanced Levenshtein Method Algorithm 76
(4.3) Original Levenshtein Example 77
(4.4) Damerau-Levenshtein Example 77
(4.5) Enhanced Levenshtein Example 78
(5.1) Candidates ranking flowchart 84
(5.2) Syntax analysis flowchart 94
(6.1) Tokens distribution in primary packets 99
(6.2) Tokens distribution in secondary packets 99
(6.3) Time complexity Variance of Levenshtein, Damerau-
Levenshtein, and Enhanced Levenshtein (our modification) 103
(6.4) Suggestion Accuracy with a comparison to Microsoft Office
Word on a Sample from the Wikipedia 104
(6.5) Testing the suggested system accuracy and comparing the
results with other systems using the same dataset 105
(6.6) Discarding one feature at a time for optimal candidate
selection 106
(6.7) Using one feature at a time for optimal candidate selection 107
XII
List of Tables
Table
No.
Title Page
No.
(1-1) Summary of Literature Review 9
(3-1) Alphabet Encoding 50
(3-2) Addressing Range 52
(3-3) Predicting errors using Bi-grams analysis 61
(5-1) Transposition Matrix 90
(5-2) Confusion Matrix 91
List of Symbols and Abbreviations
Meaning Abbreviation
Alphabet
Adjectival Phrase A
Absolute Difference abs
Sentence Complement C
Context Free Grammar CFG
Dictionary D
Dioxide Nuclear Acid DNA
Error E
Grammar G
Grammar Error Correction GEC
Hidden Markov Model HMM
Information Retrieval IR
Machine Translation MT
Named Entity NE
Named-Entity Recognition NER
Noun Group NG
Natural Language Generation NLG
Natural Language Processing NLP
Natural Languages NLs
Natural Language Understanding NLU
XIII
Noun Phrase NP
big-Oh notation ( =at most) O( )
Optical Character Recognition OCR
Production Rule P
Part Of Speech POS
Prepositional Phrase PP
Query Q
Ranking Value R
Relative Distance R_Dist
Start Symbol S
Stanford Machine Translator SMT
Speech Recognition SR
String1, String2 St1,St2
Variable V
Adverbial Phrase v
Verb Phrase VP
big-Omega notation (= at least) ( )
Chapter One
Overview
1
Chapter One
Overview
1.1 Introduction
Natural Language Processing, also known as computational Linguistics,
is the field of computer science that deals with linguistics; it is a form of
human- computer interaction where formalization is applied on the elements
of human language to be performed by a computer [Ach14]. Natural
Language Processing (NLP) is the implementation of systems that are
capable of manipulating and processing natural languages (NLs)
sentences[Jac02] like English, Arabic, Chinese and not formal languages
like Python, Java, C++; nor descriptive languages such as DNA in biology
and Chemical formulas in chemist [Mom12]. NLP task is the designing and
building of software for analyzing, understanding and generating spoken
and/or written NLs. [Man08] [Mis13]
NLP has many applications such as automatic summarization, Machine
Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition
(SR), Optical Character Recognition (OCR), Information Retrieval (IR),
Opinion Mining [Nad11], and others [Wol11].
Text Correction is another significant application of NLP. It includes
both Spell Checking and Grammar Error Correction (GEC). Spell checking
research extends early back to the mid of 20th century by Lee Earnest at
Stanford University but the first application was created in 1971 by Ralph
Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of
10,000 English words. [Set14] [Pet80]
Grammar error correction, in spite of its central role in semantic and
meaning representations, is largely ignored by NLP community. In recent
Chapter One Overview ________________________________________________________________________
2
years, an improvement noticed in automatic GEC techniques. [Voo05]
[Jul13] However, most of these techniques are limited in specific domains
such as real-word spell correction [Hwe14], subject-verb disagreement
[Han06], verb tense misuse [Gam10], determiners or articles and improper
preposition usage. [Tet10] [Dah11]
Different techniques like edit distance [Wan74], rule-based techniques
[Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98],
probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel
model [Tou02] have been proposed for text correction purposes. Each
technique needs some sort of resources. Edit distance, rule-based and
similarity key techniques require a dictionary (or lexicon), n-grams and
probabilistic work with statistical and frequency information, neural nets are
learned with training patterns, etc
Text correction, spell and grammar, is an extensive process includes,
typically, three major steps: [Ach14] [Jul13]
The first step is to detect the incorrect words. The most popular way to
decide if a word is misspelled is to look for it in a dictionary, a list of
correctly spelled words. This way can detect non-word errors not the real-
word errors [Kuk92] [Mis13] because an unintended word may match a
word in the dictionary. NLs have a large number of words resulting in a
huge dictionary, therefore, the task of looking every word consumes a long
time. Whereas, in GEC this step is more complicated, it requires applying
more analysis at the level of sentences and phrases using computational
linguistics basics to detect the word that makes the sentence incorrect.
Next, a list of candidates or alternatives should be generated for the
incorrect word (misspelled or misused). This list is preferred to be short and
contains the words with highest similarity or suitability; and to produce it, a
technique is needed to calculate the similarity of the incorrect word with
Chapter One Overview ________________________________________________________________________
3
every word in the dictionary. Efficiency and accuracy are major factors in
the selection of such technique. GEC requires broad knowledge of diverse
grammatical error categories and extensive linguistic technique to identify
alternatives because a grammatical error mayn't be resulted from a unique
word.
Finally, suggesting the intended word or a list of alternatives contains
the intended word. This task requires ranking the words according to the
similarity amount to the incorrect word and some other considerations may
or may not be taken depending on the technique in use.
Text mining techniques started to enter the area of text correction;
Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and
Information Retrieval [Kir11] are examples. Statistics and probabilistic also
played a great role specifically in analyzing common mistakes and n-gram
datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and
phonetic level, can be used in reducing the looking up space; NER may help
in avoiding interpreting proper nouns as misspellings; statistics merged with
NLP techniques to provide more precise parsing and POS tagging, usually,
in context dependent applications. The application of a given technique
differs according to what level of correction is intended; it starts from the
character level [Far14], passes through word, phrase (usually in GEC),
sentence, and ends in the context or document subject level.
1.2 Problem Statement
Although many text checking and correction systems are produced,
each has its variances from the sides of input quality restrictions, techniques
used, output accuracy, speed, performance conditionsetc. [Ahm09]
[Pet80]. This field of NLP is really an open research from all sides because
there is no complete algorithm or technique handles all considerations.
Chapter One Overview ________________________________________________________________________
4
The limited linguistic knowledge, the huge number of lexicons, the
extensive grammar, language ambiguity and change over time, variety of
committed errors and computational requirements are challenges facing the
process of developing a text correction application.
In this work, some of the above mentioned problems are solved using a
set of solutions:
Integrating two lexicon datasets (WordNet and Ispell).
Using brute-force approach to solve some sorts of ambiguity.
Applying hashing and indexing in looking up the dictionary.
Reducing search space in candidates collecting process by
grouping similarly spelled words into semi clusters.
The Levenshtein method [Hal11] is also enhanced to consider Damerau
four types of errors within time period shorter than Damerau-Levenshtein
method [Hal11]. Named Entity Recognition, letters confusion and
transposition, and candidate length effect are used as features to optimize the
candidates' suggestion. In addition to applying rules of Part-Of-Speech tags
and sentence constituency for checking sentence grammar correctness,
whether it is lexically corrected or is not.
The proposed three components of this system are: (1)a spell error
detection is based on a fast looking up technique in a dictionary of more than
300,000 tokens, constructed by applying a string prefix dependent hash
function and indexing method; grammar error detector is a brute-force
parser. (2)For candidates generation, an enhancement was implemented on
the Levenshtein method to consider Damerau four errors types and then used
to measure similarity according to the minimum edit distance and difference
in lengths effect, the dictionary tokens are grouped into spell based clusters
to reduce search space. (3)The candidates suggestion exploits NER features,
Chapter One Overview ________________________________________________________________________
5
transposition error and confusion statistics, affixes analysis (including first
and last letters matching), length of candidates, and parsing success.
1.3 Literature Review
Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to
string transformation includes a model consists of rules and weights for
training and an algorithm depends on scoring and ranking according to
conditional probability distribution for generating the top k-candidates at
the character level where both high and low frequency words can be
generated. Spell checking is one of many applications on which the
approach was applied; the misspelled strings (words or characters) are
transformed by applying a number of operators into the k-most similar
strings in a dictionary (start and end letters are constants). [Ach14]
Mariano F., Zheng Y., and others, 2014, talked the correction of
grammatical errors by processes pipelining which combines results from
multiple systems. The components of the approach are: a rule based error
corrector uses rules automatically derived from the Cambridge Learner
Corpus which based on N-grams that have been annotated as incorrect;
SMT system translates incorrectly written English into correct English;
NLTK1 was used to perform segmentation, tokenization, and POS
tagging; the candidates generation produce all the possible combinations
of corrections for the sentence, in addition to the sentence itself to
consider the "no correction" option; finally the candidates are ranked
using a language model. [Fel14]
__________________________________________________________
1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research
and teaching in computational linguistics and natural language processing. NLTK is written in Python and
distributed under the GPL open source license. Over the past year the toolkit has been rewritten,
simplifying many linguistic data structures and taking advantage of recent enhancements in the Python
language.
Chapter One Overview ________________________________________________________________________
6
Anubhav G., 2014, presented a rule-based approach that used two POS
taggers to correct non-native English speakers' grammatical errors,
Stanford parser and Tree Tagger. The detection of errors depends on the
outputs of the two taggers, if they differ then the sentence is not correct.
Errors are corrected using Nodebox English Linguistic library. Error
correction includes subject-verb disagreement, verb form, and errors
detected by POS tag mismatch. [Gup14]
Stephan R., 2013, proposed a model for spelling correction based on
treating words as "documents" and spell correction as a form of
document retrieval in that the model retrieves the best matching correct
spell for a given input. The words are transformed into tiny documents of
bits and hamming distance is used to predict the closest string of bits
from a dictionary holding the correctly spelled words as strings of bits.
The model is knowledge free and only contains a list of correct words.
[Raa13]
Youssef B., 2012, produced a parallel spell checking algorithm for
spelling errors detection and correction. The algorithm is based on
information from Yahoo! N-gram dataset 2.0; it is a shared memory
model allowing concurrency among threads for both parallel multi
processor and multi core machines. The three major components (error
detector, candidates' generator and error corrector) are designed to run in
a parallel fashion. Error detector, based on unigrams, detects non-word
errors; candidates' generator is based on bi-grams; the error corrector,
context sensitive, is based on 5-grams information.[Bas12]
Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and
Gary G. L., 2012, presented a novel method for grammatical error
correction by building a meta-classifier. The meta-classifier decides the
final output depending on the internal results from several base
classifiers; they used multiple grammatical errors tagged corpora with
Chapter One Overview ________________________________________________________________________
7
different properties in various aspects. The method focused on the articles
and the correction arises only when a mismatching occur with the
observed articles. [Seo12]
Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic
information retrieval system performing automatic spell correction for
user queries before applying the retrieval process. The correcting
procedure depends on matching the misspelled word against a correctly
spelled words dictionary using Levenshtein algorithm. If an incorrect
word is encountered then the system retrieves the most similar word
depending on the Levenshtein measure and the occurrence frequency of
the misspelled word.[Kir11]
Farag, Ernesto, and Andreas, 2008, developed a language-independent
spell checker. It is based on the enhancement of N-gram model through
creating a ranked list of correction candidates derived based on N-gram
statistics and lexical resources then selecting the most promising
candidates as correction suggestions. Their algorithm assigns weights to
the possible suggestions to detect non-word errors. They depended a
"MultiWordNet" dictionary of about 80,000 entries.[Ahm09]
Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of
real-words spelling error correction. They assumed that the observed
sentence is a signal passed through a noisy channel, where the channel
reflects the typist and the distortion reflects errors committed by the
typist. The probability of the sentence correctness, given by the channel
(typist), is a parameter associated with that sentence. The probability of
every word in the sentence to be the intended one is equivalent to the
sentence correctness probability and the word is associated with a set of
spell variants words excluding the word itself. Correction can be applied
to one word in the sentence by replacing the incorrect one by another
Chapter One Overview ________________________________________________________________________
8
from the candidates (its real-word spelling variations) set so that it gives
the maximum probability.[Amb08]
Stoyan, Svetla, and others, 2005, described an approach for lexical post-
correction of the output of optical character recognizer OCR as a two
research project. They worked on multiple sides; on the dictionary side,
they enriched their large sizes dictionaries with specialty dictionaries; on
the candidates selection, they used a very fast searching algorithm
depends on Levenshtein automata for efficient selecting the correction
candidates with a bound not exceeding 3; they ranked candidates
depending on a number of features such as frequency and edit
distance.[Mih04]
Suzan V., 2002, described a context sensitive spell checking algorithm
based on the BESL spell checker lexicons and word trigrams for
detecting and correcting real-word errors using probability information.
The algorithm splits up the input text into trigrams and every trigram is
looked up in a precompiled database which contains a list of trigrams and
their occurrence number in the corpus used for database compiling. The
trigram is correct if it is in the trigram database, otherwise it is considered
an erroneous trigram containing a real-word error. The correction
algorithm uses BESL spell checker to find candidates but the most
frequent in the trigrams database are suggested to the user.[Ver02]
Chapter One Overview ________________________________________________________________________
9
No. Reference Methodology Technique
1
[Ach14] Generating the top K-
candidates at the
character level for both
high and low frequency.
A model consists of rules and
weights, and a conditional
probability distribution
dependent algorithm
2
[Fel14] Grammatical errors
correction based on
generating all possible
correct alternatives for
the sentence
Combining the results of
multiple systems: rule based
error corrector, SMT English
to Correct English translator,
and NLTK for segmentation,
tokenization and tagging
3
[Gup14] Non-native English
speakers' grammatical
errors correction
Error detection used Stanford
parser and Tree Tagger.
Correction based on
Nodebox English Linguistic
library
4
[Raa13] Dictionary based Spell
correction treats the
misspelled word as a
document.
Converting the misspelled
word into a tiny document of
bits and retrieving the most
similar documents using
Hamming Distance
5
[Bas12] Context sensitive spell
checking using a shared
memory model allowing
concurrency among
threads for parallel
execution
Different N-grams levels for
error detection, candidates
generation, and candidates
suggestion depending on
Yahoo! N-Grams dataset 2.0
6
[Seo12] Meta-classifier for
grammatical errors
correction focused
mainly on the articles.
Deciding the output
depending on the internal
results from several base
classifiers
7
[Kir11]
Automatic spell
correction for user
queries before applying
retrieval process
Using Levenshtein algorithm
for both error detection and
correction in a dictionary
looking up technique
Table 1.1: Summary of Literature Review
Chapter One Overview ________________________________________________________________________
11
8
[Ahm09] Language independent
model for non-word error
correction based on N-
gram statistics and lexical
resources
Ranking a list of correction
candidates by assigning
weights to the possible
suggestions depending on a
"MultiWordNet" dictionary
of about 80,000 entries
9
[Amb08] Noisy channel model for
Real words error
correction based on
probability.
Channel represents the typist,
distortion represents the
error, and the noise
probability is a parameter
10
[Mih04]
OCR output post
correction
Levenshtein automata for
candidates generation and
frequency for ranking
11
[Ver02] Context sensitive spell
checking algorithm based
on tri-grams
Splitting texts into word
trigrams and matching them
against the precompiled
BESL spell checker lexicons,
suggestion depends on
probability information.
1.4 Research Objectives
This research is attempted to design and implement a smart text
document correction system for English texts. It is based on mining a typed
text for detecting spelling and grammar errors and giving the optimal
suggestion(s) from a set of candidates, its steps are:
1. Analyzing the given text by using Natural Language Processing
techniques, at each step detect the erroneous words.
2. Looking up candidates for the erroneous words and ranking them
according to a given set of features and conditions to be the initial
solutions.
3. Optimizing the initial solutions depending on the extracted
information from the given text and the detected errors.
Chapter One Overview ________________________________________________________________________
11
4. Recovering the input text document with the optimal solutions and
associating the best set of candidates with each incorrect detected
word.
1.5 Thesis Outlines
The next five chapters are:
1. Chapter Two: "Background and Related Concepts" consisted of two
parts. The first overviews NLP fundamentals, applications and
techniques; whereas, the second is about text correction techniques.
2. Chapter Three: "Dictionary Structure and Looking up Technique"
describes the suggested approach of constructing the dictionary of the
system for both perfect matching and similarity looking up.
3. Chapter Four: "Error Detection and Candidates Generation", declares
the suggested technique for indicating incorrect words and the method
of generating candidates.
4. Chapter Five: "Automatic Text Correction and Candidates
Suggestion", describes the techniques of suggestions selection and
optimization.
5. Chapter Six: "Experimental Results, Conclusion, and Future Works",
the experimental results of applying the techniques described in
chapters three, four and five, conclusion of the system and the future
directions are shown.
Chapter Two
Background
and
Related Concepts
12
Chapter Two
Background and Related Concepts
Part I
Natural Language Processing
2.1 Introduction
Natural Language Processing (NLP) began in the late 1940s. It was
focused on machine translations; in 1958, NLP was linked to the
information retrieval by the Washington International Conference of
Scientific Information; [Jon01] primary ideas for developing applications
for detecting and correcting text errors started at that period of time.
[Pet80] [Boo58]
Natural Language Processing has a great interest from that time till
our days because it plays an important role in the interaction between
human and computers. It represents the intersection of linguistics and
artificial intelligence [Nad11] where machine can be programmed to
manipulate natural language.
2.2 Natural Language Processing Definition
"Natural Language Processing (NLP) is the computerized approach
for analyzing text that is based on both a set of theories and a set of
technologies." [Sag13]
NLP describes the function of software or hardware components in a
computer system that is capable of analyzing or synthesizing human
languages (spoken or written) [Jac02] [Mis13] like English, Arabic,
Chinese etc, not formal languages like Python, Java, C++ etc, nor
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
13
descriptive languages such as DNA in biology and Chemical formulas in
chemist [Mom12].
"NLP is a tool that can reside inside almost any text processing
software application" [Wol11]
We can define NLP as a subfield of Artificial Intelligence
encompasses anything needed by a computer to understand and generate
natural language. It is based on processing human language for two tasks:
the first receives a natural language input (text or speech), applies analysis,
reasons what was meant by that input, and outputs in computer language;
this is the task of Natural Language Understanding (NLU). While the
second task is to generate human sentences according to specific
considerations, the input is in computer language but the output is in human
languages; it is called Natural Language Generation (NLG). [Raj09]
"Natural Language Understanding is associated with the more
ambitious goals of having a computer system actually comprehend natural
language as a human being might". [Jac02]
2.3 Natural Language Processing Applications
Even of its wide usage in computer systems, NLP is entirely
disappeared into the background; where it is invisible to the user and adds
significant business value. [Wol11]
The major distinction of NLP applications from other data
processing systems is that they use Language Knowledge. Natural
Language Processing applications are mainly divided into two categories
according to the given NL format [Mom12] [Wol11]:
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
14
2.3.1Text Technologies
Spell and Grammar Checking: systems deal with indicating
lexical and grammar errors and suggest corrections.
Text Categorization and Information Filtering: In such
applications, NLP represents the documents linguistically and
compares each one to the others. In text categorization, the
documents are grouped according to their linguistic
representation characteristics into several categories. Information
filtering signals out, from a collection of documents, the
documents that are satisfying some criterion.
Information Retrieval: finds and collects relevant information to
a given query. A user expresses the information need by a query,
then the system attempts to match the given query to the database
documents that is satisfying the users query. Query and
documents are transformed into a sort of linguistic structure, and
the matching is performed accordingly.
Summarization: according to an information need or a query
from the user, this type of applications finds the most relevant
part of the document.
Information Extraction: refers to the automatic extraction of
structured information from unstructured sources. Structured
information like entities, their relationships, and attributes
describing them. This can integrate structured and unstructured
data sources, if both are exist, and pose queries for spanning the
integrated information giving better results than applying
searches by keywords alone.
Question Answering: works with plain speech or text input,
applies an information search based on the input. Such as IBM
Watson and the reigning JEOPARDY! Champion, which read
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
15
questions and understand their intention, then looking up the
knowledge library to find a match.
Machine Translation: translate a given text from a specific
natural language to another natural language, some applications
have the ability to recognize the given text language even if the
user didn't specify it correctly.
Data Fusion: Combining extracted information from several text
files into a database or an ontology.
Optical Character Recognition: digitizing handwritten and
printed texts. I.e. converting characters from images to digital
codes.
Classification: this NLP application type sorts and organizes
information into relevant categories. Like e-mail spam filters and
Google News news service.
And also NLP entered other applications such as educational
essay test-scoring systems, voice-mail phone trees, and even e-
mail spam detection software.
2.3.2 Speech Technologies
Speech Recognition: mostly used on telephone voice response
systems as a service client. Its task is processing plain speech. It
is also used to convert speech into text.
Speech Synthesis: means converting text into speech. This
process requires working at the level of phones and converting
from alphabetic symbols into sound signals.
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
16
2.4 Natural Language Processing and Linguistics
Natural Language Processing is concerned with three dimensions:
language, algorithm and problem as presented in figure (2.1). On the
language dimension, NLP considers linguistics; algorithm dimension
mentions NLP techniques and tasks, while the problem dimension depicts
the applied mechanisms to solve problems. [Bha12]
2.4.1 Linguistics
Natural Language is a communication mean. It is a system of
arbitrary signals such as the voice sound and written symbols. [Ali11]
Linguistics is the scientific study of language; it starts from the simple
acoustic signals which form sounds and ends with pragmatic understanding
to produce the full context meaning.
There are two major levels of linguistic, Speech Recognition (SR)
and Natural Language Processing (NLP) as shown in figure (2.2).
Figure (2.1) : NLP dimensions [Bha12]
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
17
2.4.1.1 Terms of Linguistic Analysis
A natural language, as formal language does, has a set of basic
components that may vary from one language to another but remain
bounded under specific considerations giving the special characteristics to
every language.
From the computational view, a language is a set of strings generated
over a finite alphabet and can be considered by a grammar. The definition
Acoustic Signal
Phones
Letters and Strings
Morphemes
Words
Phrases and Sentences
Meaning out of Context
Meaning in Context
SR
NLP
Phonetics
Phonology
Lexicon
Morphology
Syntax
Semantics
Pragmatics
Figure (2.2) : Linguistics analysis steps [Cha10]
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
18
of the three abstracted names is dependent on the language itself; i.e.
strings, alphabet and grammar formulate and characterize language.
Strings:
In natural language processing, the strings are the morphemes of the
language, their combinations (words) and the combinations of their
combinations (sentences), but linguistics going somewhat deeper than this.
It starts with phones, the primitive acoustic patterns, which are significant
and distinguishable from one natural language to another. Phonology
groups phones together to produce phonemes represented by symbols.
Morphemes consist of one or more symbols; thus, NLs can be further
distinguished.
Alphabet:
When individual symbols, usually thousands, represent words then
the language is "logographic"; if the individual symbols represent syllables,
it is a "syllabic" one. But when they represent sounds, the language is
"alphabetic". Syllabic and alphabetic languages have typically less than 100
symbols, unlike logographic.
English is an alphabetic language system consists of 26 symbols,
these symbols represents phones combined into morphemes which may or
may not combined further more to form words.
Grammar:
Grammar is a set of rules specifying the legal structure of the
language; it is a declarative representation about the language syntactic
facts. Usually, grammar is represented by a set of productive rules.
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
19
2.4.1.2 Linguistic Units Hierarchy
Language can be divided into pieces; there is a typical structure or
form for every level of analysis. Those pieces can be put into a hierarchical
structure starting from a meaningful sentence as the top level, proceeding
in the separation of building units until reaching the primary acoustic
sounds. Figure (2.3) presented an example.
Figure (2.3) : Linguistic Units Hierarchy
2.4.1.3 Sentence Structure and Constituency
"It is constantly necessary to refer to units smaller than the sentence
itself units such as those which are commonly referred as CLAUSE,
PHRASE, WORD, and MORPHEME. The relation between one unit and
another unit of which it is a part is CONSTITUENCY." [Qui85]
The task of dividing a sentence into constituents is a complex task
________________________________________________________
1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries
The teacher talked to the students
The teach er talk ed to the student s
The teacher talked to the students
The teacher talked to the students
Sentence
Phrase
Word
Morphem
e
Phonemes1 u
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
20
requires incorporating more than one analysis stage; tokenization,
segmentation, parsing, (and sometimes stemming) usually merged together
to build the parse tree for a given sentence.
2.4.1.4 Language and Grammar
A language is a 'set' of sentences and a sentence is a 'sequence' of
'symbols' [Gru08]; it can be generated given its context free grammar
G=(V,,S,P). [Cla10]
Commonly, grammars are represented as a set of production rules
which is taken by the parser and compared against the input sentences.
Every matched rule adds something to the sentence complete structure
which is called 'parse tree'. [Ric91]
Context free grammar (CFG) is a popular method for generating
formal grammars. It is used extensively to define languages syntax. The
four components of the grammar are defined in CFG as [Sag13]:
Terminals (): represent the basic elements which form the
strings of the language.
Nonterminals or Syntactic Variables (V): sets of strings define the
language which is generated by the grammar. Nonterminals
represent a key in syntax analyzing and translation via imposing a
hierarchical structure for the language.
Set of production rules (P): this set define the way of combining
terminals with nonterminals to produce strings. The production
rule is consisted of a variable on the left side represents its head,
this head defines
Start symbol (S).
The following is an example describes the structure of English sentence
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
21
V = {S, NP, N, V P, V, Art}
= {boy, icecream, dog, bite, like, ate, the, a},
P = {S NP V P,
NP N,
NP ART N,
V P V NP,
N boy | icecream | dog,
V ate | like | bite,
Art the | a}
The grammar specifies two things about the language: [Ric91]
Its weak generative capacity; the limited set of sentences which can
be completely matched by a series of grammar rules.
Its strong generative capacity, grammatical structure(s) of each
sentence in the language.
Generally, there are an infinite number of sentences for each grammar
which can be structured with it. The strength and importance of grammars
lurk in their ability of supplying structure to an infinite number of
sentences because they succinctly summarize an infinite number of objects
structures of a certain class. [Gru08]
The grammar is said to be generative if it has a fixed size production
rules which, if followed, can generate every sentence in the language using
an infinite number of actions. [Gru08]
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
22
2.5 Natural Language Processing Techniques
2.5.1 Morphological Analysis
Morphology is the study of how words are constructed from
morphemes which represent the minimal meaning-bearing language
primitive units.[Raj09] [Jur00]
There are two broad classes of morphemes: stems and affixes; the
distinction between the two classes is language dependent in that it varies
from one language to another. The stem, usually, refers to the main part of
the word and the affixes can be added to the words to give it additional
meaning. [Jur00]
Further more, affixes can be divided into four categories according to
the position where they are added. Prefixes, suffixes, circumfixes and
infixes generally refer to the different types of affixes but it is not necessary
to a language to have all the types. English accept both prefixes to precede
stems and suffixes to follow stems, while there is no good example for a
circumfixe (precede and follow a stem) in English, and infixing (inserting
inside the stem) is not allowed (unlike German and Philippine languages,
consecutively) . [Jur00]
Morphology is concerned with recognizing the modification of base
words to form other words with different syntactic categories but similar
meanings.
Generally, three forms of word modifications are found [Jur00]:
Inflection: syntactic rules change the textual representation of the
words; such as adding the suffix 's' to convert nouns into plurals,
adding 'er' and 'est' convert regular adjectives into comparative and
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
23
superlative forms, consecutively. This type of modification usually
results a word from the same word class of the stem word.
Derivation: new words are produced by adding morphemes, usually
more complex and harder in meaning than inflectional morphology.
It often occurs in a regular manner and results words differ in their
word class from the stem word. Like adding the suffix 'ness' to
'happy' to produce 'happiness'.
Compounding: this type modifies stem words by another stem words
by grouping them. Like grouping 'head' with 'ache' to produce
'headache'. In English, this type is infrequent.
Morphological processing, also known as stemming, depends heavily on
the analyzed language. The output is the set of morphemes that are
combined to form words. Morphemes can be stem words, affixes, and
punctuations.
2.5.2 Part Of Speech Tagging
Part of Speech (POS) tagging is the process of giving the proper
lexical information or POS tag (also known as word classes, lexical tags,
and morphological classes), which is encoded as a symbol, for every word
(or token) in a sentence. [Sco99] [Has06b]
In English, POS tags are classified into four basic classes of words: [Qui85]
1. Closed classes: include prepositions, pronouns, determiners,
conjunctions, modal verbs and primary verbs.
2. Open classes: include nouns, adjectives, adverbs, and full verbs.
3. Numerals: include numbers and orders.
4. Interjections: include small set of words like oh, ah, ugh, phew.
Usually, a POS tag indicates one or more of the previous information and it
is sometimes holds other features like the tense of the verb or the number
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
24
(plural or singular). POS tagging may generate tagged corpora or serve as a
preprocessing step for the next NLP processes. [Sco99]
Most of tagging systems performance is typically limited because
they only use local lexical information available in the sentence, at the
opposite of syntax analyzing systems which exploit both lexical and
structural information. [Sco99] More research was done and several models
and methods have been proposed to enhance taggers performance, they fall
mainly into supervised and unsupervised methods where the main
difference between the two categories is the set of training corpora that is
pre tagged in supervised methods unlike unsupervised methods which
needs advanced computational methods for gaining such a corpora.
[Has06a] [Has06b]. Figure (2.4) presents the main categories and shows
some examples.
In both categories, the following are the most popular:
Figure (2.4) : Classification of POS tagging models [Has06a]
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
25
Statistical (stochastic, or probabilistic) methods: taggers which
use these methods are firstly trained on a correctly tagged set of
sentences which allow the tagger to disambiguate words by
extracting implicit rules or picking the most probable tag based on
the words that are surrounding the given word in the sentence.
Examples of these methods are Maximum-Entropy Models, Hidden
Markov Models (HMM), and Memory Based models.
Rule based methods: a sequence of rules, a set of hand written
rules, is applied to detect the best tags set for the sentence regardless
of any maximization probability. The set of rules need to be written
probably and checked by human experts. Examples: the path-voting
constraint models and decision tree models.
Transformational approach: combines both statistical methods and
rule based methods to firstly find the most probable set of available
tags and then applies a set of rules to select the best.
Neural Networks: with linear separator or full neural network, have
been used for tagging processes.
The methods described above, as any other research areas, have their
advantages and disadvantages; but there is a major difficulty facing all
of them, it is the tagging of unknown words (words that have never seen
before in the training corpora). While rule-based approaches depends on
a special set of rules to handle such situations, stochastic and neural nets
lack this feature and use other ways such as suffixes analysis and n-
gram by applying morphological analysis; some methods use default set
of tags to disambiguate unknown words. [Has06a]
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
26
2.5.3 Syntactic Analysis
"Syntax is the study of the relationships between linguistics forms,
how they are arranged in sequence, and which sequences are well-
formed". [Yul00]
Syntactic analysis, also referred by "Parsing", is the process of
converting the sentence from its flat format which is represented as a
sequence of words into a structure that defines its units and the relations
between these units. [Raj09]
Hence, the goal of this technique is to transform natural language
into an internal system representation. The format of this representation
may be dependency graphs, frames, trees or some other structural
representations. Syntactic parsing attempts only for converting sentences
into either dependency links representing the utterance syntactic structure
or a tree structure and the output of this process is called "parse tree" or
simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning
in the level of the smallest parts ("words" in terms of language scientist,
"tokens" in terms of computer scientists). [Gru08]
Syntactic analysis makes use of both the results of morphological
analysis and Part-Of-Speech tagging to build the structural description of
the sentence by applying the grammar rules of the language under
consideration; if a sentence violates the rules then it is rejected and
assigned as incorrect. [Raj09]
The two main components of every syntax analyzer are:
Grammar: the grammar provides the analyzer with the set of
production rules that will lead it to construct the structure of the
sentences and specifies the correctness of every given sentence.
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
27
Good grammars make a careful distinction between the
sentence/word level, which they often call syntax or syntaxis and
the word/letter level, which they call morphology. [Gru08]
Parser: the parser reconstructs the production tree (or trees) by
applying the grammar to indicate how the given sentence (if
correctly constructed) was produced from that grammar.
Parsing is the process of structuring a linear representation in
accordance with a given grammar.
Today, most of parsers combine context free grammars with probability
models to determine the most likely syntactic structure out of many others
that are accepted as parse trees for an utterance. [Dzi04]
2.5.4 Semantic Analysis
"Semantics is the study of the relationships between linguistic
forms and entities in the words; that is, how words literally connect to
things." [Yul00]
This technique and the later following it are basically depended by
language understanding. Semantic analysis is the process of assigning
meanings to the syntactic structures of the sentences regardless of its
context. [Yul00] [Raj09]
2.5.5 Discourse Integration
Discourse analysis is concerned with studying the effect of sentences
of each other. It shows how a given sentence is affected by the one
preceding it and how it affects the sentence following it. Discourse
Integration is relevant to understanding texts and paragraphs rather than
simple sentences, so, discourse knowledge is important in the interpretation
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
28
of temporal aspects (like pronouns) in the conveyed information. [Ric91]
[Raj09]
2.5.6 Pragmatic Analysis
This step interprets the structure that represents what is said for
determining what was meant actually. Context is a fundamental resource
for processing here. [Ric91]
2.6 Natural Language Processing Challenges
The challenges of natural language processing are much enough that
can't be summarized in a limited list; with every processing step from the
start point to results outputting there are a set of problems that natural
language processors vary in their ability to handle. However, the
application where NLP is used, usually, concerned with a specific task
rather than considering all processing steps with all their details, this is an
advantage for the NLP community helps to outline the challenges and
problems according to the task under consideration.
For our research area, we precisely concerned with the set of
problems that are directly affecting the task of text correction; the next
subsections describe some of them:
2.6.1 Linguistic Units Challenges:
The task of text correction starts from the level of characters up to
paragraphs and full texts, with every level there are a set of difficulties that
the handling analyzer faces:
2.6.1.1 Tokenization
In this process, the lexical analyzer, usually called "Tokenizer",
divides the text into smaller units and the output of this step is a series of
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
29
morphemes, words, expressions and punctuations (called tokens). It
involves locating tokens boundaries (where one token ends and another
begins).
Issues that arise in tokenization and should be addressed are [Nad11]:
Problem depends on language type: language includes, in addition to
their symbols, a set of orthographic conventions which are used in
writing to indicate the boundaries of linguistic units. English
employs whitespaces to separate words but this isn't sufficient to
tokenize a text in a complete and unambiguous manner because the
same character may be used in different uses (as the case with
punctuations), there are words with multi parts (such as dividing the
word with a hyphen at the end of lines and some cases in the addition
of prefixes) and many expressions consisted of more than one word.
Encoding Problems: syllabic and alphabetic writing systems, usually,
encoded using single byte, but languages with larger character sets
require more than two bytes. The problem arise when the same set of
encodings represents different characters set; whereas, the tokenizers
are targeted to a specific encoding for a specific language.
Other problems such as the dependency of the application
requirements which indicates what a constituent is defined as a
token; in computational linguistics the definition should precisely
indicate what the next processing step requires. The tokeniser should
also have the ability to recognize the irregularities in texts such as
misspellings and erratic spacing and punctuation, etc.
2.6.1.2 Segmentation
Segmenting text means dividing it into small meaningful pieces
typically referred by "sentence", a sentence consists of one or more tokens
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
30
and handles a meaning which may not completely be clear. This task
requires a full knowledge in the scope of punctuation marks since they are
the major factor in denoting the start and ends of sentences.
Segmentation becomes more complicated as the punctuations usages
become more. Some of punctuations can be a part from a token and not a
stopping mark such as the case with periods (.) when used with
abbreviations.
However, there is a set of factors can help in making the
segmentation process more accurate [Nad11]:
Case distinction: English sentences normally start with a capital
letter, (but Proper nouns also do).
POS tag: the tags that are surrounding punctuation can assist this
process, but multi tags situations complicate it such as the using
of ing verbs as nouns.
The length of the word (in the case of abbreviation
disambiguation, notice a period may assign the end of a sentence
and an abbreviation at the same time).
Morphological information, this task requires finding the stem
word by suffixes removal.
It is likely not to separate tokenization and segmentation processes;
they are usually merged together for solving most of the above
problems, specifically segmentation problems.
A sentence is described to be an indeterminate unit because of the difficulty
in deciding where it ends and another starts; while the grammar is
indeterminate from the stand point of deciding 'which sentence is
grammatically correct?' because this question permits to be answered
divisively and discourse segmentation difficulty is not the lonely reason but
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
31
also grammatical acceptability, meaning, style goodness or badness, lexical
acceptability, context acceptability, etc. [Qui85]
2.6.2 Ambiguity
An input is ambiguous if there is more than one alternative linguistic
structure for it. [Jur00]
Two major types of sentence ambiguity, genuine and computer
ambiguity. In the first, the sentence is really has two different meanings to
the intelligent hearer; while in the second case, is that the sentence has one
meaning but for the computer it has more than one and this type is really a
problem facing NLP applications unlike the first. [Not]
Ambiguity as an NLP problem is found in every processing step [Not]
[Bha12]:
2.6.2.1 Lexical Ambiguity
Lexical ambiguity is described to be the possibility for a word to
have more than one meaning or more than one POS tag.
Obviously, meaning ambiguity leads to semantic ambiguity and tag
ambiguity to syntactic ambiguity because it can produce more than one
parse tree. Frequency is an available solution for this problem.
2.6.2.2 Syntactic Ambiguity
The sentence has more than one syntactic structure; particularly,
English common ambiguity sources are:
Phrase attachment: how a certain phrase or a clause in the sentence
can be attached to another when there is more than one possibility.
Crossing is not allowed in parse trees; therefore, a parser generates a
parse tree for each accepted state.
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
32
Conjunction: sometimes, the parser befuddled to select which phrase
a conjunctive should be connected to.
Noun group structure: the rule
NG NG NG
allows English to generate long series of nouns to be strung together.
Some of these problems can be resolved by applying syntactic constraints.
2.6.2.3 Semantic Ambiguity
Even when a sentence is unambiguous lexically and syntactically,
sometimes, there is more than one interpretation for it. This is because a
phrase or a word may refer to more than one meaning.
"Selection restrictions" or "semantic constraints" is a way to
disambiguate such sentences. It combines two concepts in one mode if both
of the concepts or one of them has specific features. Frequency in context
also can help in deciding the meaning of a word.
2.6.2.4 Anaphoric Ambiguity
This is the possibility for a word or a phrase to refer to something
that is previously mentioned but in the reference there is more than one
possibility.
This type can be resolved by parallel structures or recency rules.
2.6.3 Language Change
"All living languages change with time, it is fortunate that they do so
rather slowly compare to the human life". Language change is represented
by the change of grammars of people who speak the language and it has
been shown that English was changed in its lexicon, phonological,
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
33
morphological, syntax, and semantic components of the grammar over the
past 1,500 years. [Fro07]
2.6.3.1 Phonological Change
Correspondences of regular sounds show the phonological system
changes. The phonological system is governed, as well as any other
linguistic system, by a set of rules and this set of phonemes and
phonological rules is subjected to change by modification, deletion and
addition of new rules. The change in phonological rules can affect the
lexicon in that some of English words formations depends on sounds, such
as the vowels sound differentiate nouns from verbs ( nouns house and bath
from the verbs house and bathe).
2.6.3.2 Morphological Change
Morphological rules, like the phonological, are suspected to addition,
lose and change. Mostly, the usage of suffixes is the active area of change
where the way of adding them to the ends of stems affected the resulted
words and therefore changed the lexicon.
2.6.3.3 Syntactic Change
Syntactic changes are influenced by morphological changes which in
turn influenced by phonological changes. This type of change includes all
types of grammar modifications that are mainly based on the reordering of
words inside the sentence.
2.6.3.4 Lexical Change
Change of lexical categories is the most common in this type of
change. An example of this situation is the usage of nouns as verbs, verbs
as nouns, and adjectives as nouns. Lexical change also includes the
Chapter Two Part I: Natural Language Processing _________________________________________________________________________
34
addition of new words, borrowing or loan words from another language,
and the loss of existing words.
Figure (2.5) : An example of lexical change 1
2.6.3.5 Semantic Change
As the category of a word can be changed, its semantic
representation or meaning can be changed, too. Three types of change are
possible for a word:
Broadening: the meaning of a word is expanded to mean everything
it has been used for and more than that.
Narrowing: on the reverse of broadening, here the word meaning is
reduced from more general meaning to a specific meaning.
Shifting: the word reference is shifted to refer to another meaning
somewhat differs from the original one.
_________________________________________________________
Darby Conley/ Get fuzzy UFS, Inc. 24 Feb. 2012
35
Part II
Text Correction
2.7 Introduction
Text correction is the process of indicating incorrect words in an input
text, finding candidates (or alternatives) and suggesting the candidates as
corrections to the incorrect word. The term incorrect refers to two different
types of erroneous words: misspelled and misused. But mainly, the process
is divided into two distinct phases: error detection phase which indicates the
incorrect words, and error correction phase that combined both generating
and suggesting candidates.
Devising techniques and algorithms for correcting texts in an
automatic manner is a primal opened research challenge started from the
early 1960s and continued until now because the existed correction
techniques are limited in their accuracy and application scope [Kuk92].
Usually, a correction application concerns a specific type of errors
because it is a complex task to computationally predict an intended word
written by a human.
2.8 Text Errors
A word can be mistaken in two ways: the first is by incorrectly
spelling a word due to lack of enough information about the word spell or
intentionally mistaking symbol(s) within the word, this type of errors is
known as non-word errors where the word can't be found in the language
lexicon.
The second is by using correctly spelled word in wrong position in the
sentences or unsuitable context. These errors are known as real-word errors
Chapter Two_ Part II Text Correction _______________________________________________________________________
36
where the incorrect word is accepted in the language lexicon.
[Gol96][Amb08]
Non-word errors are easier to be detected, unlike real-word errors; the
later needs more information about the language syntax and semantic nature.
Accordingly, the correction techniques are divided into isolated words error
detections that is concerned with non-word errors; and context sensitive
error correction which deals with real-words error. [Gol96]
2.8.1 Non-word errors
Those errors include the words that are not found in the lexicon; a
misspelled word contains one or more from the following errors:
Substitution: one or more symbols are changed.
Deletion: one or more symbols are missed from the intended word.
Insertion: adding symbol(s) to the front, end, or any index in the word.
Transposition: two adjacent symbols are swapped.
The four errors are known as Damerau edit operations.
2.8.2 Real-word errors
These errors occur through mistaking an intended word by another
one that is lexically accepted. Real-word errors can be resulted from
phonetic confusion like using the word "piece" instead of "peace" which
usually leads to semantically unaccepted sentences, after applying non-word
correction, or even from misspelling the intended word and producing
another lexically accepted word. [Amb08]
Sometimes, the confusion results in syntactically unaccepted
sentences; like writing the sentence "John visit his uncle" instead of "John
visits his uncle".
Chapter Two_ Part II Text Correction _______________________________________________________________________
37
Correcting real-word errors is context sensitive in that it needs to
check the surrounding words and sentences before suggesting candidates.
2.9 Error Detection Techniques
Indicating whether a word is correct or not is based on the type of
correction procedure; non-word error detection is usually checking the
acceptance of a word in the language dictionary (the lexicon) and marks any
mismatched word as incorrect. While real-word error is more complex task,
it requires analysing larger parts from the text, typically, paragraphs and full
text [Kuk92]. In this work, we mainly focus on non-word error detection
techniques.
Dustin defined spelling error as E in a given query word Q which is
not an entry in the underhand dictionary D. [Bos05] He outlined an
algorithm for spelling correction as shown in figure (2.6).
Spell error detection techniques can be classified into two major types:
2.9.1 Dictionary Looking Up
All the words of a given text are matched against every word in a
pre created dictionary or a list of all acceptable words in the language under-
consideration (or most of them since some languages have a huge number of
words and collecting them totally is semi impossible task). The word is
incorrect if and only if there is no match found. This technique is robust but
suffers from the long time required for checking; as the dictionary size
becomes larger, looking up time becomes longer. [Kuk92] [Mis13]
2.9.1.1 Dictionaries Resources
There are many systems deal with collecting and updating languages
lexical dictionaries. Example of these systems is the WordNet online
application; it is a large database of English lexicons. Lexicons (nouns,
Chapter Two_ Part II Text Correction _______________________________________________________________________
38
verbs, adjectives, articles etc) are interlinked by lexical relations and
conceptual-semantic. The structure of WordNet is a network of words and
concepts that are related meaningfully and this structure made it a good tool
for NLP and Computational Linguistics.
Another example is the ISPELL text corrector; an online spell
checker provides many interfaces for many western languages. ISPELL is
the latest version of R. Gorin's spell checker which developed for Unix.
Suggestion a spell correction is based on only one Levenshtein edit distance
depending on looking up every token in the input text against a huge lexical
dictionary. [ISP14]
2.9.1.2 Dictionaries Structures
The standard looking up technique is to match every token in the
dictionary with every token in the text, but this process requires a long time
because NL dictionaries are usually of huge sizes and string matching needs
longer time than other data types do. A solution for this challenge is to
reduce the search space in such a way keeps similar tokens grouped together.
Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05]
Algorithm: Spell_correction
Input: word w
Output: suggestion(s) a set of alternatives for w
Begin
If (is_mistake(w))
Begin
Candidates=get_candidates( w)
Suggestions=filter_candidates( candidates)
Return suggestions
End
Else
Return is_correct
End.
Chapter Two_ Part II Text Correction _______________________________________________________________________
39
Grouping according to spell or phones [Mis13], and using hash tables are
two fundamental ways to minimize search space.
Hashing techniques apply a hash function to generate a numeric key
from strings. The numeric keys are references to packets of tokens that can
generate the same key indices; hash functions differ in their ability to
distribute tokens and how much they minimize the search space. A perfect
hash function generates no collisions (hashing two different tokens to the
same key index), and a uniform hash function distribute tokens among
packets uniformly. The optimal hash function is a uniform perfect hash
function which hashes one token to every packet; such situation is
impossible with dictionaries due to the variance of tokens. [Nie09]
Spell and phones dependent groups use limited set of packets and
generate keys according to spell or pronunciation; they are another style of
hashing and sometimes of clustering. SPEEDCOP and Soundex are
examples. [Mis13] [Kuk92]
2.9.2 N-gram Analysis
N-grams are defined to be n subsequences of words or strings where n
is variable, often takes values: one to produce unigrams (or monograms),
two to produce bigrams (sometimes called "digrams"), three to produce
trigrams, or rarely takes larger values. This technique detects errors by
examining each n-gram from the given string and looking it with a
precompiled n-gram statistics table. The decision depends on the existence
of such n-gram or the frequency of it occurrence, if the n-gram is not found
or highly infrequent then the words or strings which contain it are incorrect.
[Kuk92] [Mis13]
Chapter Two_ Part II Text Correction _______________________________________________________________________
40
2.10 Error Correction Techniques
Many techniques have been proposed to solve the problem of
generating candidates for the detected misspelled word; they vary in the
required resources, application scope, time and space complexity, and
accuracy. The most common are [Kuk92] [Mis13]:
2.10.1 Minimum Edit Distance Techniques
This technique stands on counting the minimum number of primal
operations required to convert the source string into the target one. Some
researchers refer to primal operations to be insertion, deletion, and
substitution of one letter by another; others add the transposition between
two adjacent letters to be the fourth primal operation. Examples, Levenshtein
Algorithm which counts one distance for every primal operation, Hamming
Algorithm works like Levenshtein but limited with only strings of equal
lengths; and Longest Common Substring finds the mutual substring between
two words.
Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has
no limitation on the types of symbols, or on their lengths. It can be executed
in time complexity of O(M.N) where M and N are the lengths of the two
input strings.
The algorithm can detect three types of errors (substitution, deletion,
and insertion). It doesn't account the transposition of two adjacent symbols
as one edit operation; instead, it counts such errors as two consecutive
substituting operations giving edit distance of 2.
Chapter Two_ Part II Text Correction _______________________________________________________________________
41
One of the well-known modifications of the original Levenshtein
method was done by his friend Fred Damerau, who made a research and
found that about 80% to 90% of errors are caused by the four types of error
altogether which are known as Damerau-Levenshtein Distance. [Dam64]
The modified method required execution time longer than the original;
in every checking round, the method applies additional comparison to check
whether a transposition took place in the string then applies another
comparison to select the minimum value between the previous distance and
the distance with the occurrence of a transposition operation. This step
Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11]
1. Algorithm: Levenshtein Edit Distance
2. Input: String1, String2
3. Output: Edit Operations Number
4. Step1: Declaration
5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0,
cost=0
6. Step2: Calculate Distance
7. if String1 is NULL return Length of String2
8. if String2 is NULL return Length of String1
9. for each symbol x in String1 do
10. for each symbol y in String2 do
11. begin
12. if x = y
13. cost = 0
14. else
15. cost = 1
16. r=index of x, c=index of y
17. min1 = (distance(r - 1, c) + 1) // deletion
18. min2 = (distance(r, c - 1) + 1) //insertion
19. min3 = (distance(r - 1,c - 1) + cost) //substitution
20. distance( r , c )=minimum(min1 ,min2 ,min3)
21. end
22. Step3: return the value of the last cell in the distance matrix
23. return distance(Length of String1,Length of String2)
24. End.
Chapter Two_ Part II Text Correction _______________________________________________________________________
42
multiplied time complexity by factor of 2, resulting in (2*M.N).Hence, in
this work, the original Levenshtein method (figure (2.7)) is modified to
consider the Damerau's four errors types within a time complexity shorter
than the time consumed by Damerau-Levenshtein Algorithm and close to the
original method. Figure (2.8) shows the modification of Damerau on
Levenshtein method.
1. Algorithm: Damerau-Levenshtein Distance
2. Input: String1, String2
3. Output: Damerau Edit Operations Number
4. Step1: Declaration
5. distance(length of String1,Length of String2)=0, min1=0, min2=0,
min3=0, cost=0
6. Step2: Calculate Distance
7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do
10. for each symbol y in String2 do
11. begin
12. if x = y
13. cost = 0
14. else 15. cost = 1
16. r=index of x, c=index of y
17. min1 = (distance(r - 1, c) + 1) // deletion
18. min2 = (distance(r, c - 1) + 1) //insertion
19. min3 = (distance(r - 1,c - 1) + cost) //substitution
20. distance( r , c )=minimum(min1 ,min2 ,min3)
21. if not(String1 starts with x) and not (String2 starts with y) then
22. if (the symbol preceding x= y) and (the symbol preceding y=x)
then 23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost)
24. end
25. Step3: return the value of the last cell in the distance matrix
26. return distance(Length of String1,Length of String2)
27. End.
Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]
Chapter Two_ Part II Text Correction _______________________________________________________________________
43
2.10.2 Similarity Key Techniques
As its name clarifies, this technique finds a unique key to group
similarly spelled words together. The similarity key is computed for the
misspelled word and mapped to a pointer refers to the group of words that
are similar in their spell to the input one. Soundex algorithm finds keys
depending on the pronunciation of the words, while the SPEEDCOP system
rearranges the letters of the words by placing the first letter, followed by
consonants, and finally vowels according to their occurrence sequence in the
word and without duplication.[Kuk92] [Mis13]
2.10.3 Rule Based Techniques
This approach applies a set of rules on the misspelled word depending on
common mistakes patterns to transform the word into valid one. After
applying all the applicable rules, the set of generated words that are valid in
the dictionary suggested as candidates.
2.10.4 Probabilistic Techniques
Two methods are mainly based on statistics and probability:
1) Transition Method: depends on the probability of a given letter to be
followed by another one. The probability is estimated according to n-
gram statistics from big size corpus.
2) Confusion Method: depends on the probability of a given letter to be
confused or mistaken by another one. Probabilities in this method are
source dependent, as example: Optical Character Recognition (OCR)
systems vary in their accuracy and their basics in recognizing letters,
and Speech Recognition (SR) systems usually confuse sounds.
Chapter Two_ Part II Text Correction _______________________________________________________________________
44
2.11 Suggestion of Corrections
Suggesting corrections may be merged within the candidates'
generation; it is fully dependent on the ou