USING RULE-BASED METHODS AND MACHINE LEARNING FOR …
47
USING RULE-BASED METHODS AND MACHINE LEARNING FOR SHORT ANSWER SCORING BENEDITH MULONGO, FREDRIK PIHLQVIST KTH ROYAL INSTITUTE OF TECHNOLOGY ELEKTROTEKNIK OCH DATAVETENSKAP
USING RULE-BASED METHODS AND MACHINE LEARNING FOR …
USING RULE-BASED METHODS AND MACHINE LEARNING FOR SHORT ANSWER
SCORING
BENEDITH MULONGO, FREDRIK PIHLQVIST
KTH ROYAL INSTITUTE OF TECHNOLOGY
E L E K T R O T E K N IK O C H D A T A V E T E N S K A P
Abstract
Automatiskt rättning av korta texter är ett område som spänner allt
från naturlig språkbehandling till maskininlärning. Projektet
behandlar maskininlärning för att förutsäga korrektheten av svar i
fritext. Naturlig språkbehandling används för att analysera text
och utvinna viktiga underliggande relationer i texten.
Det finns idag flera approximativa lösningar för automatiskt
rättning av korta svar i fritext. Två framstående metoder är
maskininlärning och regelbaserad metod. Vi kommer att framföra en
alternativ metod som kombinerar maskininlärning med en regelbaserad
metod för att approximativt lösa förenämnda problemet.
Studien handlar om att implementera en regelbaserad metod,
maskininlärning metod och en slutgiltig kombination av båda dessa
metoder. Utvärderingen av den kombinerade metoden utförs genom att
titta på de relativa ändringarna i prestanda då vi jämför med den
regelbaserade och maskininlärning metoden.
De erhållna resultaten har visat att det inte finns någon ökning av
noggrannheten hos den kombinerade metoden jämfört med endast
maskininlärning metoden. Den kombinerade metoden använder
emellertid en liten mängd märkta data med en noggrannhet som är
nästan lika metoden med maskininlärning, vilket är positivt.
Ytterligare undersökning inom detta område behövs, denna uppsats är
bara ett litet bidrag till nya metoder i automatisk rättning.
Nyckelord: maskininlärning; naturlig språkbehandling; automatisk
rättning; regelbaserat system; självlärande
Abstract
Automatic correction of short text answers is an area that involves
everything from natural language processing to machine learning.
Our project deals with machine learning for predicting the
correctness of candidate answers and natural language processing to
analyse text and extract important underlying relationships in the
text.
Given that today there are several approximative solutions for
automatically correcting short answers, ranging from rule-based
methods to machine learning methods. We intend to look at how
automatic answer scoring can be solved through a clever combination
of both machine learning methods and rule-based method for a given
dataset.
The study is about implementing a rule-based method, a machine
learning method and a final combination of both these methods. The
evaluation of the combined method is done by measuring its relative
performance compared to the rule-based method and machine learning
method.
The results obtained have shown that there is no increase in the
accuracy of the combined method compared to the machine learning
method alone. However, the combined method uses a small amount of
labeled data with an accuracy almost equal to the machine learning,
which is positive.
Further investigation in this area is needed, this thesis is only a
small contribution, with a new approaches and methods in automatic
short answer scoring.
Keywords: machine learning; natural language processing; automatic
answer scoring; rule-based system; self-learning
Table of Contents 1
Introduction..………………………………………...............................................…………………1
3.3 Experimental
System..……………………………………………........................15
5.3 Combined
Results..……………………………………………..............................32
6.3 Combined
Analysis..……………………………………………............................36
1 Introduction
There is currently an ongoing change in the education system, where
more courses are given online. This shift towards e-learning enable
the courses to be given to more students worldwide than in
traditional classroom setting. But how do these students get
assessment on their newfound knowledge?
When the response is only a number or a choice between predefined
proposed candidate answers, the assessment is relative easy. But
whenever the answer is in free-text and the question is relative
open, many difficulties arises. This is due to the richness of
natural languages that enables two answers with different
vocabulary and words usage to be similar despite their apparent
linguistic and syntactic dissimilarity.
The big problem is how to find and grasp the semantic similarity
between the student answers and the references answers and how
those similarity can be used to grade the student answers.
Different approaches will be investigated in this thesis
work.
1.1 Background
Automatic Short Answer Grading (ASAG) is a term that encapsulates
the process to grade student answers that are in free text form.
When a student answers a question in this medium at least one
person needs to manually grade it. When a computer can do this
grading automatically you have an automatic short answer system.
This can be used when the question can be answered in multiple ways
and every possible solution can not be precomputed [1].
A couple of different approaches have been tested during the years
to improve the results of Automatic Short Answer Grading. One of
the most successful and popular one is a machine learning approach.
This relies on a large dataset of graded answers to train on. It
works by extracting features from the training set and predicting
the grade of ungraded answers [2].
Another method to grade short answers is to specifically design a
ruleset that works as a filter. A student answer goes through this
rule set and the result of doing so should be the grade of the
answer. This require extended knowledge in the question area from
the constructor of the rule set [3].
There have been many attempts to combine different machine learning
approaches to build a robust model. An earlier research presents a
system that make use of synonyms to build an additional training
dataset. After the extension of the dataset by synonyms, different
decision tree classifiers are trained on these data. The rules
generated by the decision trees are then extracted and used to
classify new unseen data [4].
1.2 Problem
How is a student answer automatically graded if the answer given is
a short text in natural language? There is currently no fully
working automatic short answer grading tool that have a broad use.
The solutions that are in use are often limited in several factors,
either in scope or in correctness [5].
The biggest issue with Automatic Short Answer Grading is the amount
of work required to get it up and running. The machine learning
approach needs a large amount of data to train on before a model
can be used. This is most of the time hard to achieve because of
the lack of a reasonable amount of data to train on. Even though
that amount of data is available, sometimes in real classrooms
situations reusing the same question can enable cheating and
memorized knowledge.
Another issue is the repetition of the machine learning process
whenever another exam or question needs to be assessed. The
knowledge learning when building a model for a type of question or
a type of exams are not easily transferable to other questions or
exams in other subjects. That is a big issue because if this
repetition is cumbersome then the usefulness of the automated
scoring system is questionable.
Aside the machine learning approach, there is another approach
based on rules, when an advanced pattern matcher suited for the
given subject is used to map the student answers with predefined
rules and facts. Whenever a match occurs the student get a
predefined number of points. The problem with this approach is the
amount of pre-work needed to make rules and the difficulty to
anticipate different cases to span all the ways an answer can be
written.
The thesis will investigate how those two methods can be combined
to avoid the difficulty that each of them hold.
How can machine learning and rule-based scoring be combined to
automatically grade short answers?
1.3 Purpose
The purpose with the thesis is to investigate by experiments three
different approaches to implement a model for automatic short
answer scoring. The observation and the result obtained are
analysed to investigate if the approaches are good solutions to
short answer scoring in future works.
1.4 Goal
The goal of this project it to find out how machine learning based
methods and approaches, that solve the short answer scoring problem
can be combined with a more rule-based methods in order to decrease
the amount of data needed for training and furthermore increase
the
accuracy of the predictor but alleviating the need to make detailed
hand-crafted rules.
Benefits, Ethics and Sustainability
Today there is a huge demand of online courses and many
universities propose education online. Due to the number of
students enrolled it is almost infeasible to hire a reasonable
number of graders to assess students answers [6]. This phenomenon
has been coined as massive open online courses. For this kind of
systems, it is very beneficial to have a automatic assessment tool
that can not only grade multi-choice question but also free-text
answers to open questions. In that aspect is this project useful
even though it raises many questions of judicial and ethical
characters.
Impartiality and fault-tolerance of the system is one of ethical
aspects that must be consider for automatic assessment tools. The
system must be impartial and give the score that the student
deserves and in the same time be free from any attacks and
programming errors.
Furthermore, the system’s scores might not be legally defensible,
that means the system may not be held legally accountable for the
score given and the legitimacy of the grade given by the system may
be questioned [7]. Many more questions of ethical, sustainable
aspects may be raised, but those are outside the scope of the given
thesis because this thesis study solely the technical aspects of
the construction of an automatic answer scoring system.
1.5 Metodologi
The thesis project uses a quantitative research approach by
applying experimental methods to answer the questions stated [8].
The experiment involves building three different systems, two built
on established methods and one on a combination of the two. The
systems will be evaluated using standardized numerical measurements
that is used in this science field. These involves accuracy,
precision and recall.
To properly evaluate the experimental systems the same dataset is
used to build and test the systems. The dataset originates from
earlier research in the same area. The final stage involves
validating the systems and drawing conclusions based on the
experimental results.
1.6 Scope
The attribute extraction algorithms implemented are often the
simplest version of the once that are references. This is because
of our previous knowledge in the field.
This thesis does not answer why the machine learning algorithms
used are better or worse than one another. The focus is on the
features extracted from the written text. The machine learning
algorithms are researched to get a strong knowledge base to work
from.
The thesis does not apply data handling methods that have no
errors. The tokenization, synonyms and lemming of words is done by
using already implemented methods and the limitations that that
entails.
The textual entailment between the student answer and the reference
answer are not considered and the negation detection algorithm for
the dataset is not implemented in this research.
A big impediment of a rapid and easy deployment of short answer
assessment tools is their inability to be useful outside the area
the system is trained for. That means the process must be repeated
all the time a new subject has to be automatically assessed. We
have not investigated how the short answer scoring system can be
used outside the subject it is trained, the ability to do that is
called transfer learning [9].
1.7 Disposition
In section 2 the necessary background and related works will be
presented and discussed. The methodologies and methods used during
the thesis work will be presented in section 3. Section 4 will
present the approaches used and the different choices made during
the system implementation. The results will be exposed in section 5
for the machine learning system, rule-based system and the combined
method. The results will be analysed and discussed in section 6 and
we will finally conclude in section 7, where we will also give
suggestions to future works.
2 Background
In this section, the technical background necessary to understand
the thesis will be presented. Machine and three classifiers will be
presented in 2.1 Classifiers. In section 2.2 Technical Background
individual methods used in the thesis will be understood. Section
2.3 Related Work will be presented and studied to thoroughly
understand the research area
2.1 Classifiers
A classifier is a machine learning model using the data X and the
labels y in other to predict the labels y of unseen samples X.
Three classifiers are used in this thesis: Logic Regression, Naïve
Bayes and Random Forest are presented.
Machine Learning
Machine learning is a field within computer science and artificial
intelligence. Machine learning uses a large amount of data combined
with mathematical and statistical techniques to give the computer
the ability to learn patterns that can be used to improve a task
without the need to explicitly programming the task.
There exists a fairy amount of different machine learning
algorithms today. Three of the classifiers regarding this thesis
work will be presented. The reason these are picked are because of
their general high performance during the initial testing.
Logistic Regression
Logistic regression is a machine learning method originally from
statistics. Logistic regression uses a logit or sigmoïd function to
calculate the probability of the outcome, the function links the
predictors to the outcomes.
A logistic function is a function defined by the mathematical
function below and represented in Figure 1.
Figure 1: Logistic Regression Function
() = 1
1 +
Let p be the probability that the student answer has score = 3 and
~p the probability that the student has score different of 3. The
odds of the for score = 3 is:
=
∼
The trick behind the logistic regression is to take the logit of
the odds, as explained below:
() = () = (
=
1 +
6
The probability p is then used to predict the class of an instance
given the x and the weights W. Logistic regression uses maximum
likelihood estimate to classify. In this case classify an instance
to the class where the probability p is the maximum.
Gradient descent is an algorithm usually used in logistic
regression to learn the weights of the data. A thoroughly
explanation of the implementation of gradient descent and logistic
regression can be found in Machine learning in action[20] and
Primer with Matlab[21].
Naive Bayes
A probabilistic classifier that is based on Bayes theorem shown
below, which in short helps predict the probability of an event
happening with understanding of prior events [22].
(|) = (|) ⋅ ()
()
The approach classifies the target class to the one with the target
class with highest probability. This is done by having A be the
target class y and B being the feature vector {1, 2. . . }.
(|) → = ∈((|1, 2. . . ) )
(|) ⋅ ()
() → ∈(
(1, 2. . . ) )
= ∈((1, 2. . . | ) ⋅ ()
By looking at the previous feature set and the target class and
calculating the probability via frequency a classifier can be
trained. The difficulty come with increasing the feature set,
estimating the probability of is then very difficult.
The naive approach assumes that the features are conditionally
independent given the target class. This is just the product of all
features combined.
= ∈(() ∏ (|)
)
The training of the method involves creating the probability of
features to class target. The method then uses this set probability
to determine the class for a new instance.
Random Forest
This classifier is an ensemble algorithm, this means it uses more
than one set of itself to determine the class of the target.
Specifically, random forest uses several decision trees.
A decision tree predicts the classifier by constructing a tree of
decisions. As the sample to be predicted goes through the tree a
decision is made at each step. The outcome of the process is the
predicted class of the sample.
The trees are trained using randomly selected portion of the
training- set. A majority vote between the different trees are held
to determine the predicted class of the target [23]. A depiction of
the division and voting is shown in Figure 2.
Figure 2: Random Forest
2.2 Theoretical background
The theoretical background will explain all the necessary
knowledges needed to understand the process used in this
thesis.
Data analysis
Data analysis is the first step in every machine learning project.
A thoroughly understanding to the data makes it easy to detect
outliers, clean the data and build a suitable model.
On the other hand, data pre-processing is the technique used to
pre- process the data, for example by removing noises, missing
values and other data samples that may alter the prediction
result.
In short answer scoring, the only data available is the student’s
answers text. It is difficult to analysis raw texts to find
interesting relations without a pre-labeld transformation of the
texts into a different format. Nonetheless, we can analysis the
frequency of some words in the text as depicted in Figure 3. We can
also conduct advanced analysis by looking for the most frequent
bigrams, also called collocated words.
Furthermore, a grammar check can be used to grammatically correct
the students answer. The grammar checker has some inherent problems
because of it's implementation in Python. The result is still
better then not performing the grammar check.
Cleaning
The dataset used is full of different noises, composite words that
should be separated, grammatical errors, words that do not fit in
the phrases, grading errors etc. The cleaning of the dataset is of
great importance. The cleaning process involves separating
composite words like “container.also” to “container also”.
Furthermore, using an approximate string matching algorithm to
compensate for such errors.
Tokenize
The process of taking a sentence and creating a list of words in
the same order of the sentence is tokenization. This technique is
used to better perform operations on text. An example of tokenizing
is shown below.
I want to have fun → [, , , , ]
Stemming
Stemming is a method to find the root of a word. The root of
running is to run. This process is performed on the individual
word, often used on a tokenized list. Different stemming techniques
exist but the idea is the same everywhere, to get the root of the
word.
Lemmatization
Lemmatization is like steaming but uses a vocabulary and
morphological analysis of the word to return its lemma. The
difference is that stemming does not consider the context the word
was used in. lemmatization tries to do this by using the words part
of speech.
Bag of words
A bag of words model (BOW) is one common way to vectorize a phrase
or text by using each unique words and their occurence rate.
Given the two following texts 1,2and the list or dictionary . Two
Bag of Word vectors can be created based on the dictionary.
9
= {, , , , , , , , ]
(1, ) = [1,1,1,1,0,1,1,0,1,1]
(2, ) = [1,1,1,0,1,1,0,1,1,1]
In this case a binary vector is shown as an example but that does
not need to be the case. It is binary only because the words occur
once, and it has zeroes where there is no occurrence of the word
under consideration for example the word quantity has no occurrence
in the second text where we have put zero.
N-grams
N-gram is a method used to represent and analyse a text. Given the
sentence ”We need to know the quantity of vinegar”, we can either
choose to analyse the sentence by considering each word uniquely
[We, need, to, know, the, quantity, of, vinegar] this process is
called unigram. Otherwise we can choose to consider the sentence as
a sequence of two words [We need, need to, to know, know the, the
quantity, quantity of, of vinegar], this representation is called
bigram. We can continue the process to trigram and hopefully to n-
gram.
Feature extraction
Feature extraction is a technique used before building a machine
learning model. The feature extraction is used to find
characteristics or attributes of each sample in the dataset. For
example, a feature or attribute of a text can be the length of the
text, the presence or absence of some keywords, the similarity
score between the sample and some references answers. Those
features are used to predict the class of unseen samples given that
we know their attributes or features.
[11, 21] → 1
[12, 22] → 2
[13, 23] → ?
The above examples show how two feature vectors can be used to
infer the class of the third example given its feature
vector.
Feature selection
After different features have been implemented, a feature selection
algorithm could be used to find the most predictive features. The
most predictive features are features that contain a lot of
information towards the category of interest. Knowing the state of
those features increase our knowledge of the class the sample
belong to. In many cases feature selection increase the performance
of the classifier. It
10
can also decrease the training time by reducing the dimension of
features.
The most well-known algorithm for feature selection, is the BORUTA
algorithm [11]. In this thesis we will use the feature selection
from SKLEARN1. A complete and theoretical introduction to features
selection can be found in An Introduction to Variable and Feature
Selection[18] and Improving Question Classification by Feature
Extraction and Selection[19].
2.3 Related work
As described in the introductory chapter, this thesis is about
automatic short answer scoring. It is a very broad subject with
many hard problems with no known general solutions due to the
difficulty of the subject. To build a good performing answering
scoring system, expert knowledge in the domain for which the system
is build for are required, furthermore knowledges in computational
linguistics, data science, computer science, machine learning,
mathematics is required.
Plenty of researches and studies have been conducted in the area of
short answer scoring. Different approaches have been tested and
studied. Approaches that solely rely on the total understanding of
the question using advanced natural language processing techniques
to methods that rely on pattern recognition and machine learning.
Some studies have also considered combinations of machine learning
with rule-based approaches.
The common characteristics of the aforementioned approaches are
that they are difficult to use outside the domain for which they
were built for. This is the main reason behind the difficulty to
deploy short answer scoring systems on a large scale. Every time a
new exam will be assess, a new system needs to be build, which is
very cumbersome.
Machine Learning
To apply machine learning in a given domain, the data plays a vital
role. The data is of a paramount importance in machine learning.
The importance of the data is still true in the are of short answer
scoring. There have been many researches dealing with the
application of machine learning in the domain of short answer
scoring [7]. Of all the researches done in this domain, we will
here only review two of them, because they are the most relevant
and most related to this thesis work, furthermore they used the
same dataset as the dataset used in this thesis.
1 SKLEARN, science skit learn is on of the most popular open source
machine learning library written in Python. It can be found at
http://scikit- learn.org/stable/index.html for more
information.
The interesting attributes implemented in the paper are the
following:
Latent Dirichlet Allocation (LDA), is an algorithm for topic
modelling to automatically find the topic of unlabelled texts [10].
The authors used a model by constructing two LDA topic spaces. The
similarity between the student’s answer and the two implemented LDA
topic spaces were used as a feature of the student’s answer.
Well-formedness Features used to check the grammatical correctness
of the answer.
Length Features, the length of each individual words and characters
were used as attributes.
Language Model Features, two language models where trained and the
perplexity of each answer in those language models was used as
features.
The result obtained by the paper is presented in table 1.
Table 1: Prognosis Essay Scoring and Article Relevancy Using
Multi-Text Features and Machine Learning
In the paper prognosis Essay Scoring [11], the authors followed
almost the same approach as the one explained above. The only
difference is that here the authors have used a restricted number
of features. They used word2vec, regular expressions, text
statistics and N-grams. Those features were used to train a random
forest and a gradient boosting classifier with Boruta selection
algorithm. Their result named proposed are presented in table
2.
This thesis takes inspiration from the two aforementioned
approaches but uses different classifiers and features algorithms.
A detailed explanation of the methods used, and the results
obtained are presented in following chapters.
Rule-based systems
A rule-based system can be described as a system which consist of a
working memory or knowledge base, a rule base, an inference engine
and an execution engine [12]. The knowledge base or working memory
describes the fact and conditions. The rule base describes a
relation between the premises and the conclusions. The inference
engine has a pattern matcher that applies rules given the fact, an
agenda that list all relevant and applicable rules. The execution
engine decides which rules to apply given the input.
Two main algorithms are used to infer a rule-based system, forward
chaining and backward chaining[13]. Forward chaining uses a top-
down approach, it begins with the facts and use the rules to infer
the conclusions or trigger an action given the facts. Backward
chaining uses a bottom-up approach, it starts with some hypothesis
or goals and searches the rule space for rules that could be used
to prove the observed hypothesis, by setting new sub goals to prove
as the process is moving forward.
We will not in this thesis use a rigid and logical definition of
the rule- based system as described above. Rule-based system in the
context of the thesis defines a system that understand the student
answer. This is done by building a model of the reference answers
and use a pattern-matcher to compute the relation between the
student’s answers and the reference answers. This process can also
be described, in some extent, as an information extraction
technique. The reference answers and the patterns formed can be
considered as rules, the student answer can be represented as a
fact and the system will infer if the fact follows the rules.
Figure 4: Illustration of a simple mark scheme template
The author used the above template to match the student answer to
the reference answers. The authors also used some kind of text pre-
processing and natural language processing techniques with sentence
analysis.
S. Pulman and J. Sukkarieh follow almost the same approach as T.
Mitchell in their paper, but with a big emphasis on the synonyms
construction [15]. They use an information extraction approach with
handmade patterns as rules. They also use an inductive programming
approach but found it not be very promising.
A more complete resource about information extraction and rule-
based approaches can be found at “A systematic approach to the
automated marking of short-answer questions”[16], where the authors
begin by first spell checking the student answers and parsing with
the Stanford parser. In the second step, they used Stanford parser
partly to find the part of speech (POS) of the student’s answer
partly to found out the type-dependency parse tree of the student
answers text. The obtained POS are used to analyse the answer
syntactically by the used of the Question answer language, an EBNF
grammar developed by the author to describe different patterns that
the correct answers must follow. The typed-dependency parser is
used to analyse the grammatical relation in the text in order to
ensure that the given text follow the correct predefined
grammatical patterns.
A different approach using rules to match the student’s answer with
predefined correct model answers or reference answers [3]. In the
paper a scoring algorithm is described and given in a pseudocode.
The scoring is capable to automatically correct a student answer,
given some predefined constructed patterns of the models answers.
If the student’s answer match one of those patterns, the score of
the student so far is increased by one.
[13][14][15].
Combined approach
There have been many attempts to combine different machine learning
approaches in order to build a robust model. In Hybrid approach for
automatic short answer marking, the authors describe a system that
use synonyms to build additional training datasets and build
different decision tree classifier for each model answers. The
rules are extracted from the decision trees and are then used to
classify new unseen text answers [4].
Although not considered in the work section, there are some
previous work about transfer learning for short answer scoring done
by different researchers [9]. Transfer learning is a very promising
field, especially for short answer scoring, because it will
alleviate the need to always repeat the same training process
or/and information extraction process to each new questions or
exams. That is cumbersome and not so desirable if we wish a broad
deployment of this kind of systems.
The approach used in the thesis is different from the
aforementioned methods. Using a semi-supervised approach combined
with a rule- based system. Self-learning is a semi-supervised
algorithm. A semi- supervised algorithm is an algorithm that uses a
combination of labeled and unlabeled data to predict new examples.
The semi- supervised algorithm used in the work is self-learning
[17].
The self-learning method used in the thesis combines a machine
learning method and a rule-based method. The combination comes from
how the machine learning method trains itself. This is done to
increase performance and decrease the amount of data needed during
training.
3 Metodologi
In order to answer the research question presented in this thesis,
a research method needs to be applied. There are two basic research
methodologies: quantitative research method and qualitative
research method [8]. Quantitative research implies that the answers
to the research question can be answered through and with
quantifiable results. This is usually applied in experiments,
testing and computer systems. This method requires use of statistic
methods to validate the results.
Qualitative research is the opposite and focuses on non-numerical
methods. The methodology focusses on understanding the meaning,
opinion and behaviour to reach a result.
3.1 Research Approaches
The two most common research approaches are: inductive and
deductive [24]. The inductive approach is based on analysing the
data and formulating views and opinions of the phenomenon. The
deductive approach verifies or falsify a hypothesis. This is done
by testing a theory on a large dataset, the result from this must
be measurable. This thesis uses deductive approaches to verify the
given hypothesis.
3.2 Data collection and analysis
The data collected is based on a well-known dataset from the data
science competition website Kaggle [35]. The dataset is used for
experimental studies in short answer scoring. It is constituted of
a student answer and two scores ranging from 0 to 3, where 3 is a
full- mark.
The dataset is transformed in order to work in our research
methods. This is done by dividing the dataset into two or more
separated sets. Train-set, Validation-set and Test-set. This
division is done by randomly selecting elements from the original
dataset. Train-set is used for training an initial or a base
classifier for the given research problem. Validation-set is used
for changing variables in the experimental system to increase its
performance and its accuracy. The test-set is used to measure the
goodness of the model, its ability to generalise well outside known
examples.
3.3 Experimental System
The system is built in three parts where each part corresponds to a
method constructed to tackle the research question. Each part is
evaluated separately using existing evaluation methods. Two of the
methods are based on known techniques which in the past have given
quantifiable results. The third is a combination of the previous
two and will be tested in the same manner as the previous
one.
The system will use an incremental model where the system will be
improved after iterations of development [25]. This incremental
model is picked because it will greatly improve the results and is
relatively inexpensive to apply on a software system.
Python3 is used to build the system [26]. The Python programming
language have a lot of tools to achieve the results needed to
answer the research question. A very important tool is the machine
learning library Sklearn, which used extensively in this thesis
[27]. To analysis
3.4 Evaluation
All the systems implemented is evaluated in the same manner. The
evaluations are based on established methods in machine learning
and Automatic Short Answer Grading [2]. The assessment measures are
inter alia Accuracy, Precision, recall, f1-score. All the measures
presented here will be used later and are important in order to
assess the model, but we will however focus mainly on the accuracy
when comparing the results.
Accuracy
Accuracy is represented as a fraction of the correct predictions of
the model over the hole dataset. Here is the predicted value and y
is the true value of the sample. The samples noted by nsamples is
the number of elements in the dataset that was tested. The accuracy
is a value between 0-1 [29].
(, ) = 1
1( = )
To remove uncertainty in the accuracy a statistical method called
cross-validation is sometimes used instead. The training set is
divided into multiple separate sets and the accuracy of each of
them are calculated separately. The result of each is later
normalised with the mean score and added standard deviation. Here k
is the number of dividing sets and si is the true value and i is
the predicted value. The Cross validation is given by ω.
= ( ∑
Figure 1: Confusion Matrix
Figure 5 shows a confusion matrix for two classes 0 and 1. A
confusion matrix is matrix used to assess the performance of a
model by
comparing the true values and the predicted ones. It shows how well
the model is capable to predict the true value of each
samples.
True positive (TP) = true value is predicted to be true True
negative (TN) = false value is predicted to be false False
positive(FP) = false value is predicted as true False negative(FN)
= true value is predicted to be false
Precision
Precision or positive predictive value is a measurement of the
precision of a classifier. It is the fraction of relevant instance
among the set of all the retrieved instances. It gives a measure of
how well the model performs in classifying the instances in the
right category. This is done by taking the true positives (tp) and
false positives (fp) and determining the ratio of tp to the hole.
[31]
=
+
Recall
Recall or sensitivity is a measurement of the fraction of relevant
retrieved instances over the total amount of all relevant
instances. This is done by taking the true positives and false
negatives (fn) and returning the ratio between true positives
[32].
=
+
1 = 2 ⋅
=
4 Work
This section will present the process used in this thesis and the
way the work is conducted. The data will be presented in section
4.1 and the remaining sections 4.2-4.4 will present the work
conducted in the thesis project.
4.1 Data
The dataset is from the Automated Student Assessment Prize (ASAP),
which is a joint effort of the Hewlett Foundation and Open
Educational Solutions to gather all the current approaches to
automated scoring systems for open-ended student response tasks.
The competition is available at Kaggle [35]. The dataset is one of
the biggest publicly available dataset for short answer scoring
today [2]. Figure 6 shows the structure of the dataset.
Figure 2: Dataset Structure
The dataset is in a tab-separated value file with 5 columns. The
first column is the identity number of each question which is
unique for each question and ranges from 1 to 27 588. The second
column is the essay set, there are 10 questions and the EssaySet
describes and identifies the question and ranges from 1 to 10.
Score 1 is the first score and Score 2 is the second score from two
different correctors. But the final score is Score 1. The essay
test is the student text answers.
In this project, we only consider the first question and only
assess our system using score 1. The reason to limit ourselves to
the first question is due partly to time aspect partly to the
reliability of score1, which is the final human score. The final
dataset used in this project corresponds to the EssaySet 1 with
score 1 as label which is exactly 1672 data samples.
There are some grammatical errors in the dataset and some
assessment errors from the human corrector, but those errors do not
constitute the whole dataset.
The errors can be caused by misspelling when pupils were written
their answers on the paper. Other errors can be due to optical
character recognition system when translating handwritten texts to
machine-encoded texts.
4.2 Machine Learning Method
The matrix of features X and their corresponding label y as
depicted in section 2.2 are used in three different
classifiers:
• Logistic regression • Naive Bayes • Random forest
The machine learning model for each classifier is evaluated and
assessed and the result obtained can be found in section 5.
Approach
Using machine learning for short answer scoring, we want to find
patterns in the dataset that can help us predict the score of other
student answers not present in the training dataset. The only thing
available is the student’s answers texts and their corresponding
scores.
As stated before machine learning algorithms work with numerical
values2, the main difficulty is to implement algorithms to
transform the student answer to a numerical value that can be feed
into a machine learning algorithm. To achieve this transformation
from free text to numerical values, 19 features or characteristics
of the text are implemented. Those features take the text as input
and return a numerical value representing a distance measure
between the text and a reference answer for example as a similarity
score, a number of keywords present in the text etc.
Feature Implementation
Features are characteristics of text. Those characteristics can
help us know if the text is worth 3 points or 0 point depending of
the numerical values of features, sometimes called
attributes.
A simple example of a feature is the number of keywords present in
the student’s answer text. A detailed explanation of these features
implemented below can be found in [2][7][10][36]–[42]. The
following features are implemented and computed for each
answer.
Cosine Similarity
The simplest form of the cosine similarity uses the bag-of-words
representation of two text and compute the distance between their
respective vectors using the following formula:
= 1 ∗ 2
| 1|| 2|
2 Most machine learning algorithms work with numerical values,
although there is string kernel function that can directly work
with string, but the final result of the kernel function is a
numerical value. Some algorithms may also use categorical values as
{sunny, cold, rainy...}, but you get the idea, do not you?
The similarity score between the vector representation of the
reference texts and the student answers were used as a feature in
the project.
Keywords
Keywords is the simplest feature, where we have a list of each
keywords and simply count of many times each keyword occur in the
student answers.
We use two versions of the keywords features, one that simply count
the occurrence of each keywords and return the result as a feature
and another one that normalizes the result. Those two features are
then saved in a feature matrix as explained above.
Latent Semantic Analysis
A complete explanation of the latent semantic analysis will be
lengthy and out of the scope of this thesis. Nonetheless the latent
semantic analysis can be defined as a method to grasp the latent
(hidden) semantic space of two or more texts even if they do not
necessarily share the same words.
Given the dictionary [we, need, to, know, measure, the, quantity,
amount, of, vinegar] and four vector representations of four text
using this dictionary we can build the following matrix:
[ [1, 1, 1, 1], [1, 1, 0, 3], [1, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0,
1], [1, 1, 1, 1], [3, 0, 1, 0], [0, 1, 0, 1], [1, 1, 1, 2], [1, 1,
1, 1] ]
Each column is a transposed vector representation of each text
using the bag-of-words approach with a common dictionary.
After the construction of the matrix we calculate the singular
value decomposition of the matrix X:
() =
The obtained matrices are reduced to a lower dimension to find an
approximation and in the same time reducing the computation time
needed as () =
is in a lower dimension.
The terms are represented by and the text (documents) by ,
the similarity between two terms or document in the semantic space
can be computed by using the cosine similarity between the two
vectors.
21
Two versions of the LSA model have been implemented as features,
one that uses one reference answer where each sentence creditworthy
is considered as document. The feature is calculated by averaging
the similarity scores between the student’s answers and each
sentence in the reference answers.
Another LSA model is built with a set of hundred reference answers
as document and the highest similarity between the student’s
answers and the reference answers set is return as a feature.
Partial Word Overlap
Partial word overlap is a method to compare two texts by computing
the word overlap between them using the following formula:
(, ) = | ∩ |
| + |
We use the word partial here because it not a complete matching of
words between the two texts in comparison, but an approximate
matching where we allow some difference between each word for
example vinegar and vinnagar will be matched with an approximate
string matching.
Language Model
The language model is a method commonly used in automatic word
suggestion or completion for example when typing in Google or other
search engines.
Language model is used to calculate the probability of the next
coming words given the preceding observed words. For example:
( | ) (| )
When typing in google ‘how to’, the phrase ‘lose weight’ and ‘draw’
appear among the words showing up first. That means in Google there
is a higher probability that peoples usually search ‘lose weight’
or ‘draw’ after typing ‘how to’. To estimate those probability, we
need a very large corpus that is representative for the goal or
application considered. To reduce null probability, the probability
is approximate to unigram and the independence of words is
sometimes assumed.
We have here made a large corpus of acceptable answers and
calculate the perplexity of each student answer as a feature, that
is the probability that the answer is from the corpus given the
answer text.
Latent Dirichlet Algorithm
Latent Dirichlet allocation is a very advanced algorithm using both
higher probabilistic model distribution and advanced methods as
Gibbs sampling because the correct estimate of the probabilities
used in the model is NP-hard.
22
We use the algorithm implemented in Gensim to implement the LDA
features. Two versions are implemented.
The first version uses the references answers and make an LDA model
of it. Then the student answers are reduced to the same dimension
as the references answers and the topic distribution probability of
the student answers is used as a feature.
Word Alignment
Word alignment is a way to compare two texts by calculating or
estimating how many semantic similar words the two texts have in
common. It reminds in many ways to the partial word overlap, but
here we are only interested to the word to word semantic similarity
not necessary their syntactic similarity.
The formula to calculate the word alignment is:
((1), (2)) =
((1)) + ((2))
(1) (2) are two input texts
= number of content word in the input text without stop words (all
word counts)
= number of word in the input text aligned (without stopwords
)
Corpus Similarity
Corpus similarity uses a set of keywords from the reference answer
and look for each such word in the student’s answer, if not founded
the algorithm looks up synonyms of the words and again match it
against the student’s answer. The number of matched word is used as
a feature.
Jaccard
Jaccard similarity is used to calculate the similarity between the
vector representation of the student answer respective the
reference answers using the following formula:
(, ) = | ∩ |
| ∪ |
The similarity between the student’s answers and each reference
answers is calculated, and the average score is return as a
feature.
Dice Similarity
Dice similarity calculate the similarity between the vector
representation of the student answer respective the reference
answers using the following formula:
23
||||
The similarity between the student answers and each reference
answers is calculated, and the highest score is return as a
feature.
Blue Score
Bleu score is a machine language translation method that can be
used to estimate the quality of a machine translation. This quality
estimate is calculated by comparing the machine translation against
a set of reference human translations. The score is way to
benchmark the performance of the machine translation and its
quality. There have been recent research and attempts to modify the
blue scoring algorithm to fit different needs for example in
textual entailment or recently in essay scoring [36]. In order to
use bleu as features, we have both used an unmodified blue
algorithm from the natural language toolkit and a modified blue
algorithm based on article [36] although some steps in the paper
have been disregarded or modified.
Ngram
The reference answers are firstly broken up in different sentences
each of them creditworthy. Every sentence is analysed
independently. Using the string approximate matching algorithm
implemented in python, called Ngram. We calculate the similarity
between the student’s answer and every sentence. The three highest
values are returned as a feature.
Key
The feature that is extracted is the number of unique keywords
found by comparing the answer to a selected list of keywords. The
keywords are selected by modifying the initial reference answers.
The modification is done by removing punctuations, stop words and
duplicated words. The remaining words are put into a list ki this
is done for all the reference sentences resulting in a big list K,
represented as follow:
= {1, 2. . . }
The extraction algorithm compares the student’s answer to each
element in the list K. The comparison is done by counting how many
of the keywords in ki was found in the student’s answer. The
results are stored in a list S.
= {1, 2. . . } = [0, ||]
The resulting numerical returned is the sum of the three highest
values si in S.
The reference answers are divided up into sentences. This is done
by selecting the nouns and the corresponding synonyms words. Each
sentence is put into a list. Stemming is used on all words in in
order to get a higher similarity rate. Each sentence is tested
independently against the student’s answer. If the student’s answer
has all the words present in the reference sentence a credit is
given. The process is repeated for all sentences. The sum of all
obtained points or credits is returned as a feature, where sum
exceeding 3 is replaced by 3. An example of reference sentences is
given in figure 7.
Figure 3: Reference Sentences
Feature Extraction
Features extraction refers here to the computation of the features
for each student answer in the training set. Each feature mentioned
above were computed for every student answers in order to build a
matrix that a machine learning algorithm can use in order to make a
prediction.
For each student answer we have computed a vector represented as
follows :
→ () = [ 1, 2, 3 . . . 16 ]
→ () = [0,3]
The list of each features vector computed, and their corresponding
scores is represented as matrix X and the labels as y, forming
together our training set.
Feature Selection
Given many features are implemented, some of them may be useless or
inaccurate or less predictive than other. Therefore, we need a
method to select the best performing features given the scores to
be predicted. By best here we mean the features with higher
accuracy or high information gain towards the prediction
classes.
25
The process to choose the best performing features are called
features selection as explained in the background chapter. The
features selection algorithm implemented in the science-kit learn
python library are used.
Machine Learning System Architecture
The validation data is used for parameter tuning, in order to
increase the system performance or change the system behaviour. It
is therefore of great importance that this set is isolated from the
test data to avoid wrongful results or bias. A step by step process
is shown in Figure 8.
4.3 Rule-based System Method
In this section a detailed overview of the final rule-based system
will be presented. Natural language processing is used to build the
rule system and the process will be explained in detail.
Rule-based System Architecture
The system is built in two main levels. In the first level we have
a filter which decides which pattern matcher that should be used in
the next level. The output of the system is the predicted score of
the student’s answer. A step by step depiction of the system is
shown in Figure 9.
The filter level takes two inputs. The first input is the student’s
answer which
is processed as a list of sentences = {1, 2. . . } and the second
input is a list of keywords . The output is a list of sentences ′ =
{1, 2. . . } to
each trail given by a list = {1, 2. . . }. Each list can contain
nothing or all the words present in the original student’s
sentence.
If a sentence contains a keyword that sentences is attached to the
corresponding sentence list . Each pattern can be different, the
input of the pattern box is a list of sentences. The test is
performed in the same manner and if the sentence does pass the
pattern box the output is true, otherwise it is false.
The output of each trial is summed up into a single value . The
maximum points this system can give out is 3. Therefore, the final
output of the system is which ensures that the final score is in
the correct range [0,3].
Figure 4: Machine Learning Process
26
Patterns and Rules
Following a simple rule-based method, two main techniques are used
to predict the score of the student answer. The first is the
keyword filter and the second is the grammatical similarity
comparator based on part of speech tagging.
Keyword selection
From the reference answers, six keywords are extracted. The
keywords selected are nouns and are important in order to
accurately answer the given question. Other words such as
temperature, sample are extracted from the analysis of the dataset.
Each keyword is extended by its corresponding synonyms from wordnet
[43]. Finally, each word in the student’s answer are now matched
with the original word list in figure 10 or their synonyms taken
from Wordnet.
Figure 6: Keyword List
Part of Speech structure
The system used to filter each student’s answer is constructed in
the same manner. The difference is in the structure of the
sentences. This is done by analysing the grammatical structure of
the sentence by the help of the part of speech tagging from
NLTK.
Figure 11 shows how the comparison is made between the student’s
answer and the reference example. If the part of speech tag of the
student’s answer and the reference sentence overlap during
the
comparison, then the answer is worth a point. Each trial gives
results in either 0 or 1 no matter how many sentences are
tested.
Figure 7: Part of Speech Pattern Matcher
In the beginning, the student answer is break up in sentences. A
possible representation of a student answer is = {1, 2. . . } .
Every sentence is put in the first level of the system, the pattern
matcher. The pattern matcher makes the sentence ready for the next
level by returning a list of possible patterns to test the
sentences on for further analysis
→ {1, 2. . . }
The second level, analysis and compare the student’s answer against
the selected patterns output by the pattern matcher. If the
similarity is strong enough then the student answer gets a point
otherwise no point.
4.4 Combined Method
In the third part of this project, we have combine a machine
learning approach with a rule-based system for the short answer
scoring problem. The machine learning algorithm used is a
semi-supervised algorithm called self-learning. This algorithm is
selected due to its simplicity and its ability to learn from few
data samples. The rule- based system is the same as the system
described in section 4.3.
Combined attributes
The initial machine learning model used in the self-learning
algorithm is the same as the model presented in section 4.2 with 19
different features. The feature selection implemented in the
Sklearn are used in order to reduce dimensionality and increase the
accuracy of the classifier.
Self Training Implementation
As explained in the introductory chapter, there have been many
attempts to combine different techniques for short answer scoring.
We have used here a self-learning method combined with a rule-based
system, the rule-based system described before.
The pseudo-code of the self-learning system is presented at Figure
12.
28
Figure 8: Self Learning Rule-based Algorithm
The condition in the if statement have been changed in order to use
the prediction from the rule-based system instead. Self-learning
usually uses the probability of its own prediction and label the
unlabeled data only if the probability is above a predefined
threshold. Instead of using the classifier’s own confidence
estimation, the rule-based system is used to compensate from some
errors that the machine learning can make due to few data examples
used in the training and the weakness of the model.
Here the unlabeled examples are labeled only if the prediction of
the rule-based system coincide with the prediction of the
classifier. There is some possibility that both the classifier and
the rule-based system are wrong about a given example, but the
overall benefit to have them combined is much higher. And it
alleviates the need to label those samples in beforehand.
5 Results
The presentation of the results of the experiments in this chapter
will follow the overall structure of the thesis work. The result
for the machine learning approach will be presented first, followed
by the presentation of the rule-based system result and finally the
result for the combination of the two methods will be
presented.
In each part, the model evaluation result will be presented and the
result for the model optimization will be shown whenever it is
applicable.
5.1 Machine Learning Results
Three machine learning classifiers is used to build the machine
learning model and their respective accuracy and other metrics are
presented. Only the Logistic Regression will be presented here. The
results of Naive Bayes and Random Forest will be available in the
appendix.
29
Feature correlation and importance
The features vector plays a big role in the accuracy of the machine
learning model. As described in section 2, it is of great
importance to find a pair of features with high predictive power in
order to build an accurate model. Even if all features are not
predictive, an analysis of their respective correlation can help.
Figure 13 shows the correlation between every features used in the
machine learning.
Reading the correlation matrix shown in Figure 13 we can infer that
four features are not very correlated with the rest of features.
Bingo score, align, key, and LDA are the features that are not very
correlated or that are negatively correlated. This correspond to
the parameter found when searching the best number of features to
select using grid search. The best number of features to select
varies between 15 and 16, which exactly 19 - 4 (total number of
features - bad features = best features).
Figure 9: Feature Correlation Matrix
Logistic Regression
The logistic regression algorithm implemented in the science-kit
learning library in python are used to implement our machine
learning model.
Before using the feature vector, we have use the feature selection
algorithm implemented in the science-kit library to increase the
performance of the model using the validation set.
30
The dataset is divided in three parts where the train part is 70%
of the data, the test part is 20% of the data and the validation is
only 10% of the dataset. The result is presented below.
Feature Selection
The feature selection algorithm is used to find the best features
to used in the model in order to increase the accuracy of the
model. The parameter found are 15 features as shown in the above
Figure 14.
Figure 10: Feature Selection on Logical Regression
Confusion Matrix
The confusion matrix of the logistic regression model is shown in
Figure 15. The matrix is normalized.
Figure 11: Logical Regression Confusion Matrix
31
Numerical Results
Accuracy and cross validation for the logistic regression model is
shown Table 3. Table 4 shows the precision, recall and f1-score of
the model. Support is the number of samples the numbers are based
on.
Table 3: Logical Regression Accuracy
Table 4: Logical Regression Test score
Roc-curve
Figure 16 depicts the roc-curve for each individual class given by
the implemented logistic regression model. The area under each
curve is also shown in the figure.
Figure 12: Logical Regression ROC-curve
32
5.2 Rule-based System Results
We present here the results of the rule-based system whose
structures is explained in section 4.2.
Confusion Matrix
Figure 17 shows the confusion matrix of the rule-based
system.
Figure 13: Rule-based Confusion Matrix
Numerical Results
The accuracy, precision, recall and f1-score is shown in the Table
5.
Table 5: Rule-based Test Scores
5.3 Combined Results
The dataset is divided in four parts. Two different partitions are
tested. The first partition uses 20% for training, 70 % for testing
and 10% for validation, the second partition uses 30% for training,
60% for testing and 10 % for validation.
33
Confusion Matrix
The confusion matrix and combined results are shown in Figure 18
and Figure 19.
Figure 14: 20% for training, 70 % for testing and 10% for
validation
Figure 15: 30% for training, 60 % for testing and 10% for
validation
34
Comparison
The comparison between simple logistic regression and self-learning
rule-based system. The table 6 shows the numerical results of the
between the different systems for the two partitions used.
Table 6: 20% for training, 70 % for testing and 10% for
validation
Table 7: 30% for training, 60 % for testing and 10% for
validation
Table 8: 60 % for training, 30 % for testing and 10 % for
validation
35
6 Analysis
In this section, the results presented in section 5 will be
analysed. The structure of this section will follow the general
structure used in this thesis so far. The analysis of the machine
learning system will be presented in section 6.1. In section 6.2,
the rule-based system will be discussed thoroughly and finally in
section 6.3 the hybrid system will be commented and analysed.
6.1 Machine Learning Analysis
The machine learning method have a result of 55 % on the test
dataset which represents 20 % of the total samples. This accuracy
is obtained using feature selection and parameter optimization.
Given the number of features used and implemented, the accuracy is
very low. The reason behind it can be a poor implementation of
particular features. When you look at Figure 14: Feature Selection
Graph the increase in accuracy as the number of features increase
is very small. Figure 13: Feature Correlation Matrix shows that
most features have a positive correlation with each other. This
indicates that the features reinforce each other correct
predictions but also the negative once.
Another reason behind the low accuracy rate is how the reference
answers are written. The initial set of reference answers are only
a subset of all correct answers. The space of all correct answer
cannot be anticipated. Having a closed set of reference answers,
the difficulty to extend the reference answers with synonyms and
paraphrases will decrease.
One factor that made the task difficult to predict was the fact
that a student answer with a score of 3 could be entirely different
to another student answer that got a 3 as score. This incoherency
makes it difficult for the machine learning method to determine the
class boundaries for the different scores. Because the features
picked are mostly related to the reference, and how this similarity
differentiates the answers. Compiling the dataset into three binary
answers to the question would then greatly improve the
method.
Figure 16: Logistic Regression ROC-curve indicates that score 0 and
3 are much easier to classify compared to score 1 and 2. The reason
behind this behaviour is that score 3 and 0 are very different from
each other, therefore easier to recognize. If the initial problem
would have been changed to a binary classification problem, it
would have been much easier to recognise probable correct answers
from probable incorrect answers with a very high accuracy.
Therefore, a binary discriminative classifier can already be
implemented with a high accuracy in order to reduce the number of
exams needed to be corrected by a human. The only exams corrected
by humans will therefore be the answers having a high probability
to be correct.
Regarding table 4: logistic regression test score, we can see that
the result of the precision and recall completely agree with the
result from
36
the roc-curve. The answers with score 0 and 3, have a higher
precision and recall value. The precision here is a measure of how
well the classify could determine the score of answer worth score 0
as score 0 and the recall is the ability to not label the score in
the wrong class.
In summary we can infer from the result that the decision boundary
between score 3 and 2 versus score 1 and 0 is very small as the
confusion matrix in figure 15 shows. From the confusion matrix, it
can be observed that 76 % of answers belonging to score 0 were
correctly classified as score 0 and 12 % of the answers were
wrongly labeled as belonging to class 1 instead. Those observations
show the difficulty to correctly determine which score a answer is
worth when the decision have to be made between very close classes
as 0 and 1 or 2 and 3. The same behaviour can be observed for
neighbouring classes.
6.2 Rule-based Analysis
The rule-based are thought as an independent solution to the short
answer scoring problem. It has used no training data. The accuracy
is calculated for the whole dataset. The accuracy is 45 %. It is
lower than the machine learning method but it still quite good,
given that no training data is used. The rule-based method is used
for benchmark purpose and for helping the machine learning
algorithm to label new examples.
The rule-based system suffers from the same problem faced when
constructing a machine learning system for short answer scoring,
that is the difficulty to write beforehand a set of acceptable
reference answers or patterns that the student’s answers may be
corrected from.
Regarding our dataset, the set of all acceptable reference answers
is not a closed set. If a thoroughly analysis of different possible
answers was possible and if the set were more closed it would have
been much easier to write rules and patterns in the rule-based
system.
6.3 Combined Analysis
Using 20% of the data for training and 70% for the testing, the
machine learning method outperforms the self-learning rule-based
method on the test data only by 2.5%. However, when considering
cross-validation, we can see that the self-learning rule-based
method have a higher accuracy.
Using 30% of the data for training and 60% for the testing, the
machine learning method still performs much better than the self-
learning rule-based method by 5.3% using the test data. There is a
high decrease in the test data accuracy but the cross validation
for the self- learning rule-based approach are much higher than the
machine learning method.
The reason behind the higher cross-validation of the self-learning
rule-based method can be due to the data used and selected during
the
37
self-learning approach. The data selected by the self-learning
rule- based method are more reliable which lead to a noise
reduction in the dataset.
One reason to why the accuracy does not improve much is because of
the limitation the rule-based systems put on the self-training. The
input from the rule-based system should have had a weight attached
to it that could have been examined more during testing.
We must consider how the two methods process the training data. The
self-training method uses only half of the labels in the training
set. In many real-world applications, the reduction of the required
amount of labeled data would result in a positive impact, as it
will reduce the need to annotate the data. This makes the
assumption that labeling the data is a time-consuming task.
7 Conclusion
We have combined the rule-based system with the machine learning
method by using a semi-supervised approach. We have shown that the
machine learning model can be combined with the rule-based system
using a semi-supervised algorithm called self-learning.
The goal of this thesis is to reduce the amount of labeled data
used for training and in the same time increase the accuracy of the
system. Regarding the reduction of the amount labeled data needed
for training, we can claim that the combined method used a smaller
amount of labeled data compared to the machine learning method.
However, there is no evidence based on the result that a small
amount of training data may increase the accuracy of the
model.
As stated before, there is no evidence that a small amount of
training data may increase the accuracy the combined model. We can
however state based on the results presented on table 8 that the
combined method performs better than the machine learning approach
when using a considerable amount of unlabeled training data.
An investigation of using small number of labeled data and a big
amount of unlabeled training data are needed because labeled data
are difficult to obtain due to the fact a human corrector is
required. Furthermore, the methods used here can spark interest in
this subject and help further research.
7.1 Future Work
The features implemented used in the thesis work are needed to be
further investigated. Because the accuracy obtained are not very
good. Those features impact the machine learning methods accuracy
on the test data.
There is a big amount of noises in the dataset, and that is normal
because the answers are written by children. But a good method
to
38
clean the data from noises and grammatical errors may increase the
accuracy of the model and should be investigated.
Furthermore, a better thesaurus of synonyms must be found in order
to create paraphrases of the student answers, that may also help
the rule-base system. The rule-based system can also be extended by
using a mount of training data in order to write better rules and
learn rules from the data.
Regarding the combination of the rule-based system with self-
learning, there are some drawbacks with the methods. Although the
self-learning algorithm are easier to combine with rule-based
system, it is however prone to errors because the data are trained
using the same features even when using the rule-based system. An
extension of the self-learning algorithm called co-training can be
used in order to build a more robust model.
References
[1] J. Lockwood, “Handbook of Automated Essay Evaluation Current
Applications and New Directions Mark D. Shermis and Jill Burstein
(eds.) (2013) New York: Routledge. Pp. 194 ISBN: 9780415810968,”
Writing & Pedagogy, vol. 6, no. 2, pp. 437–441, 2014.
[2] D. Higgins et al., “Is getting the right answer just about
choosing the right words? The role of syntactically-informed
features in short answer scoring,” arXiv [cs.CL],
04-Mar-2014.
[3] S. Pulman, “Automarking: using computational linguistics to
score short ‚free− text responses,” 2003.
[4] S. T. Alotaibi and A. A. Mirza, “Hybrid approach for automatic
short answer marking,” in Southwest decision sciences forty-third
annual meeting.-2012, 2010.
[5] S. Burrows, I. Gurevych, and B. Stein, “The Eras and Trends of
Automatic Short Answer Grading,” International Journal of
Artificial Intelligence in Education, vol. 25, no. 1, pp. 60–117,
2014.
[6] S. Jing, O. C. Santos, J. G. Boticario, C. Romero, M.
Pechenizkiy, and A. Merceron, “Automatic Grading of Short Answers
for MOOC via Semi-supervised Document Clustering,” in EDM, 2015,
pp. 554–555.
[7] D. R. P. Marn, “Automatic evaluation of users’ short essays by
using statistical and shallow natural language processing
techniques,” Master’s thesis, Universidad Autónoma de Madrid.
http://www. ii. uam. es/dperez/tea. pdf, 2004.
[8] A. Håkansson, “Portal of Research Methods and Methodologies for
Research Projects and Degree Projects,” in Proceedings of the
International Conference on Frontiers in Education: Computer
Science and Computer Engineering (FECS); Athens, 2013, pp.
1–7.
[9] S. Roy, H. S. Bhatt, and Y. Narahari, “An Iterative Transfer
Learning Based Ensemble Technique for Automatic Short Answer
Grading,” arXiv [cs.CL], 16-Sep-2016.
[10] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet
Allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022,
2003.
[12] E. Friedman-Hill, Jess in Action: Rule-based Systems in Java.
Manning Publications, 2003.
[13] C. Grosan and A. Abraham, Intelligent Systems: A Modern
Approach. Springer Science & Business Media, 2011.
[14] T. Mitchell, T. Russell, P. Broomhead, and N. Aldridge,
“Towards robust computerised marking of free-text responses,”
2002.
[15] S. G. Pulman and J. Z. Sukkarieh, “Automatic short answer
marking,” in Proceedings of the second workshop on Building
Educational Applications Using NLP - EdAppsNLP 05, 2005.
[16] R. Siddiqi and C. Harrison, “A systematic approach to the
automated marking of short-answer questions,” Multitopic
Conference, 2008. INMIC, 2008.
[17] V. Iosifidis and E. Ntoutsi, “Large Scale Sentiment Learning
with Limited Labels,” in Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
2017, pp. 1823–1832.
[18] I. Guyon and A. Elisseeff, “An Introduction to Variable and
Feature Selection,” J. Mach. Learn. Res., vol. 3, no. Mar, pp.
1157–1182, 2003.
[19] N. Van-Tu and L. Anh-Cuong, “Improving Question Classification
by Feature Extraction and Selection,” Indian J. Sci. Technol., vol.
9, no. 17, 2016.
[20] P. Harrington, Machine learning in action, vol. 5. Manning
Greenwich, CT, 2012.
[21] Pascal (new York University Wallisch (New York, Ny, and Usa)),
Neural Data Science - a Primer with Matlab (r) and Python (tm).
Elsevier Science Publishing Company, 2017.
[22] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997, pp.
177–180.
[23] “How the random forest algorithm works in machine learning,”
Dataaspirant, 22-May-2017. [Online]. Available:
http://dataaspirant.com/2017/05/22/random-forest-algorithm-
machine-learing/. [Accessed: 18-May-2018].
[24] W. M. K. Trochim and J. P. Donnelly, “Research methods
knowledge base,” 2001.
[25] “What is Incremental model- advantages, disadvantages and when
to use it?,” 17-Apr-2018. [Online]. Available:
http://istqbexamcertification.com/what-is-incremental-model-
advantages-disadvantages-and-when-to-use-it/. [Accessed: 18-May-
2018].
[26] “Welcome to Python.org,” Python.org. [Online]. Available:
https://www.python.org/. [Accessed: 18-May-2018].
[27] “scikit-learn: machine learning in Python — scikit-learn
0.19.1 documentation.” [Online]. Available: http://scikit-
learn.org/stable/index.html. [Accessed: 18-May-2018].
[28] “Natural Language Toolkit — NLTK 3.3 documentation.” [Online].
Available: https://www.nltk.org/. [Accessed: 18-May-2018].
[30] “sklearn.metrics.confusion_matrix — scikit-learn 0.19.1
documentation.” [Online]. Available: http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.confusion_matri
x.html. [Accessed: 18-May-2018].
[31] “sklearn.metrics.precision_score — scikit-learn 0.19.1
documentation.” [Online]. Available: http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.precision_score.
html. [Accessed: 18-May-2018].
[33] “sklearn.metrics.f1_score — scikit-learn 0.19.1
documentation.” [Online]. Available: http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
[Accessed: 18-May-2018].
[34] A. P. Bradley, “The use of the area under the ROC curve in the
evaluation of machine learning algorithms,” Pattern Recognit., vol.
30, no. 7, pp. 1145–1159, Jul. 1997.
[35] “The Hewlett Foundation: Short Answer Scoring.” [Online].
Available: https://www.kaggle.com/c/asap-sas. [Accessed:
23-Apr-2018].
[36] F. Noorbehbahani and A. A. Kardan, “The automatic assessment
of free text answers using a modified BLEU algorithm,” Comput.
Educ., vol. 56, no. 2, pp. 337–345, 2011.
[37] Yanling Li, Y. Li, and Y. Yan, “New similarity measures for
automatic short answer scoring in spontaneous non-native speech,”
in International Conference on Automatic Control and Artificial
Intelligence (ACAI 2012), 2012.
[38] F. S. Pribadi, T. B. Adji, and A. E. Permanasari, “Automated
Short Answer Scoring using Weighted Cosine Coefficient,” in 2016
IEEE Conference on e-Learning, e-Management and e-Services (IC3e),
2016.
[39] Omran and Omran, “AUTOMATIC ESSAY GRADING SYSTEM FOR SHORT
ANSWERS IN ENGLISH LANGUAGE,” Journal of Computer Science, vol. 9,
no. 10, pp. 1369–1382, 2013.
[40] A. Islam and D. Z. Inkpen, “Semantic similarity of short
texts,” in Current Issues in Linguistic Theory, 2009, pp.
227–236.
[41] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to
latent semantic analysis,” Discourse Process., vol. 25, no. 2–3,
pp. 259–284, 1998.
[42] JURAFSKY and D, “Speech and language processing : an
introduction to natural language processing,” Computational
Linguistics, and Speech Recognition, 2000.
[43] G. A. Miller, “WordNet: A Lexical Database for English,”
Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.