Imputing Missing Values Using Expected-Maximization …...complete the research project successfully. I would also like to express my deepest and sincere gratitude to my research project

Imputing Missing Values Using Expected-Maximization Algorithm

Thulare Evans : 1799336

Supervisor : Dr. Ritesh Ajoodha

A research project submitted to the

DEPARTMENT OF COMPUTER SCIENCE AND APPLIED MATHEMATICS

UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG

SOUTH AFRICA

In partial fulfilment of the requirements for the degree of Master of Science (MSc)

e-Science

25 SEPTEMBER 2018

Declaration

I Thulare Evans Molahlegi declare that this research is my own original work and has not

been submitted before for any other degree, part of degree or examination at University of

the Witwatersrand or any other University.

I

Dedication

I dedicate this research project to Almighty God my creator, my source of inspiration. He

has been the source of my strength throughout this research project.

I also dedicate this work to my family and friends. A special feeling of gratitude to my

loving and carrying single parent Mrs Nelly Thulare who has encouraged me all the way

and taught me that even the largest task can be accomplished if it is done one step at a time.

To my brother Mr Mahlatse Thulare, I will always appreciate all you have done.

II

Acknowledgements

I would like to thank the Almighty for His showers of blessings throughout my work to

complete the research project successfully.

I would also like to express my deepest and sincere gratitude to my research project su-

pervisor, Dr. Ritesh Ajoodha (Ritso) for giving me the opportunity and providing me with

priceless guidance throughout. His energetic personality, vision, sincerity and motivation

have deeply inspired me. It was a great privilege and honour to work and study under his

supervision.

I am extremely grateful to my parents, Mrs Thulare SN and Mrs Thulare MM for their

love, understanding, prayers and sacrifices for educating and preparing me for my future.

The support of the DST-CSIR National e-Science Postgraduate Teaching and Training Plat-

form (NEPTTP) towards this research is hereby acknowledged. Opinions expressed and

conclusions arrived at, are those of the author and are not necessarily to be attributed to the

NEPTTP.

III

Abstract

The study aims to evaluate the performance of Expectation-Maximization (EM) algorithm

when estimating missing values and to observe how estimated distributions diverge with

the true distributions using Kullback-Leibler (KL) Divergence on a generated data set of

40 000 observations, simulated by Bayesian Networks (BN). BN was used to generate the

data because it can precisely manage the correlation of variables in the data set. Missing at

Random (MAR) was assumed in this case. Different percentage of data set was hidden then

estimated, to see how well the EM algorithm perform when we increase the percentage of

missing values.

It is discovered that EM algorithm does not perform well on the massive percentage of

missing values in a data set. The KL divergence has proved that the more we have estimated

missing values in the data set say more than 50 per cent the more we lose the structure of the

data and EM algorithm will produce estimated values that are less reliable. KL divergence

plot showed that the EM algorithm can perform well when estimating missing values of

less than 50% of the data set. If one estimate massive missing values in the data set say 80

or 90 per cent of the data will get misleading results which lead to inaccurate results.

Keywords: Expected-Maximization algorithm, Kullback-Leibler Divergence, Bayesian

Networks, Missing at Random,

IV

Contents

Declaration I

Dedication II

Acknowledgements III

Abstract IV

List of Figures VII

List of Tables VIII

List of Abbreviations IX

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Expectation-Maximization (EM) Algorithm . . . . . . . . . . . . 3

1.1.3 KullbackLeibler (KL) divergence . . . . . . . . . . . . . . . . . 4

1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Literature Review 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . 9

2.3 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Listwise and Pairwise Deletion . . . . . . . . . . . . . . . . . . . 11

2.3.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Single and Multiple Imputation . . . . . . . . . . . . . . . . . . 13

2.3.4 Sampling Importance Resampling . . . . . . . . . . . . . . . . . 15

V

2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Methodology 17

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 ground truth Bayesian network . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Sampling Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Separating your complete dataset into copies with missing components . . 19

3.5 Learning Bayesian networks from the missing datasets . . . . . . . . . . 19

3.5.1 Parameter leaning . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5.2 Model learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Evaluating the learned Bayesian network using KL divergence . . . . . . 20

3.7 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Results and Discussion 21

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Conclusion and Recommendation 25

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 Recommendations for Future Work . . . . . . . . . . . . . . . . . . . . 25

VI

List of Figures

3.1 The digram below shows BNs structures . . . . . . . . . . . . . . . . . . 18

4.1 The digram below shows BN structure for generating data set . . . . . . . 21

4.2 The digram constitute 10% of the authentic data and 90% of estimated data 22

4.3 The figure shows the KL divergence with missing data percentage . . . . 23

4.4 The figure shows the KL divergence with missing data percentage and 50

observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

VII

List of Tables

4.1 Sample of a data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Sample of a 10% Missing values . . . . . . . . . . . . . . . . . . . . . . 22

VIII

List of Abbreviations

EM : Expectation Maximization

ML : Maximum Likelihood

KNN : K-Nearest Neighbour

KNNimpute : K-Nearest Neighbour impute

KL-Divergence : KullbackLeibler Divergence

MAR : Missing at Random

MCAR : Missing Completely at Random

MNAR : Missing Not at Random

BNs : Bayesian Networks

MI : Multiple Imputation

SI : Single Imputation

MVNI : Multivariate Normal Imputation

MICEs : Multiple Imputation by Chained Equations

RF : Random Forest

SVD : Singular Value Decomposition

SVDimpute : Singular Value Decomposition impute

MSE : Mean Squared Error

cGDI : Column-wise Guided Data

SIR : Sampling Importance Resampling

SRMI : Sequential RegressionMultivariate Imputation

GA : Genetic Algorithm

PGMs : Probabilistic Graphical Models

DAG : Directed Acyclic Graph

MAP : Maximum a Posteriori

IID : Independent and Identically Distributed

IX

Chapter 1

IntroductionMissing data are a part of almost all research, and we all have to decide how to deal with it

from time to time, whether by imputing them or ignoring them. The most popular and sim-

ple method of handling missing data is to ignore the attributes with missing observations

[Azadeh et al. 2013]. The problem of missing data is a common issue and determining the

right approach to mitigate this often becomes a major challenge encountered by machine

learning practitioners when working with real-world data since many statistical models and

machine learning algorithms rely on a complete dataset.

Data problems with missing values and latent variables are common in practice. Miss-

ing data are variables without observations. Most statistical procedures usually eliminate

entire cases whenever they encounter missing data in any variable included in the analy-

sis [Ramezani et al. 2017]. In the surveys, missing data can be caused by a lot of things.

The person taking the survey may not understand the question asked and end up leaving

questions unanswered. The person answering the questions may refuse to answer some of

them due to privacy issues. The respondent may give answers that are not relevant to the

questions or lose interest along the way while completing the questionnaire. Every ques-

tion that has no answer is regarded as a missing data point. In research missing data may

occur due to human error (maybe forget to record a certain measurement).

It is well known that the naive method, ignoring the missing data methods like complete

case analysis can lead to seriously biased parameter estimations and can also affect the data

quality and thereby the final knowledge discovered there-from. A good/advisable way to

deal with missing data is by data imputation before processing it, that is to estimate and

fill the missing data according to the information approaches such as Expectation Max-

imization (EM), Regression imputation, K-Nearest Neighbour (KNN) imputation, mean

imputation, Hot-deck imputation etc. To apply any of the methods listed above, one must

first understand the nature of the missing data.

1

1.1 Background

Recently Balakrishnan et al. [2017] highlighted that Expected-Maximization (EM) algo-

rithm is a tool in maximum-likelihood that is widely used to estimate the missing values in

a dataset and there is now a very rich literature on its behaviour (eg., [Wu 1983] [Xu and

Jordan 1996], [Neal and Hinton 1998], [Hastie et al. 2009], etc. )

1.1.1 Bayesian Networks

Griffiths and Yuille [2008] has shown that Bayesian Networks (BNs) also known as ’belief

networks’ or ’casual networks’ or just Bayes nets, belongs to Probabilistic Graphical Mod-

els (PGMs) family that can be used to build models from data or representing multivariate

probability distributions, it is also used for different application in various areas such as

machine learning, text mining, natural language processing, speech recognition, signal pro-

cessing, bioinformatics, error-control codes, medical diagnosis, weather forecasting, and

cellular networks. The BNs combines the principles of graph theory, probability theory,

computer science and statistics. BNs can be referred as the graph that is made up of nodes

and directed links between them, where nodes represents variables and links are added to

between that nodes to indicate one node is directly influences by the other. Every BNs is

a Directed Acyclic Graph (DAG). Network is defined as B = 〈G,Θ〉 where G is the DAG

whose nodes x1, x2, ..., xn represents random variables and Θ represents the parameters of

the network. Θ contains the parameter θxi|πi = P B(xi|πi) the set of parents of Xi in G.

PB(X1, X2, X3, ..., Xn) =n∏i=1

PB(Xi|πi)

=n∏i=1

θXi|πi

(1.1)

The goal/focus of this paper is to generate data using BNs and use this BNs to find

the relationship between the variables. So, that when we estimate the the missing values

using EM algorithm we going to use the Conditional probability or just dependent prob-

ability. for example, using the above figure, if X1 is missing or hidden we can find it by

2

P (X1|X2, X3, X4, X5).

1.1.2 Expectation-Maximization (EM) Algorithm

The EM Algorithm is part of the maximum likelihood based on estimating missing data.

Unlike other methods like complete case analysis, Hot-Deck Imputation, etc. EM algo-

rithm does not ’fill in’ the missing values, instead it estimates the parameters directly by

maximizing the complete data log likelihood function, then it utilize that estimated param-

eters to estimate the missing values in data set. It estimates those parameters by iterating

between the E-step and the M-step [Dempster et al. 1977].

In the E-step, the log-likelihood function of the parameters given data is calculated. It

is assumed that the data is partitioned into observed data and missing data. Let X be our

complete data, then we have X = (Xobs, Xmis). The distribution of X depend on unknown

parameter θ, i.e.

P (X|θ) = P (Xobs,Xmis|θ) = P (Xobs|θ)P (Xmis|Xobs, θ) (1.2)

Equation 3.1 can be written as likelihood function below.

L(θ|X) = L(θ|Xobs,Xmis)

= cL(θ|Xobs)P (Xobs|Xmis, θ)(1.3)

where c is the constant in missing data mechanism and it can be ignored when working

under MAR assumption. In our case because we assumed MAR, c is ignored. when taking

the log both sides in equation 3.2, we get the following equation.

l(θ|X) = l(θ|Xobs) + logP (Xmis|Xobs, θ) + logc (1.4)

Where l(θ|X) is the log likelihood of complete-data, l(θ|Xobs) is the log likelihood of

the observed data, logc is the constant and P (Xmis|Xobs, θ) is the predictive distribution of

the missing data, given θ [Schafer 1997].

Since Xmis cannot be calculated directly because is unknown, we have to guess θ de-

noted as θ(t) to compute the expectation of l(θ|X) with respect to P (Xmis|Xobs, θ(t)) i.e.

3

Q(θ|θ(t)) = E[l(θ|X)|Xobs, θ(t)]

=

∫l(θ|X)P (Xmis|Xobs, θ

(t))dXmis

= l(θ|Xobs) +

∫logP (Xmis|Xobs, θ)P (Xmis|Xobs, θ

(t))dXmis

(1.5)

A the M-step of the EM algorithm, θ is obtained by maximizing the expectation of the

log likelihood of complete-data from the E-step. Mathematically,

θ(t+1) = arg maxθ

Q(θ|θ(t)) (1.6)

EM algorithm is initially start with the guess of θ0 and iterate between the E-step and

M-step until it converges. It converges when the estimates of θ are nearly identical [Dong

and Peng 2013].

It is found that, EM algorithm is an iterative method to find the maximum likelihood

(MLE) or maximum a posteriori (MAP) estimate for models with missing values. The

following four steps describes how EM works.

1. Initialization step: get an initial estimate of θ0. This can be a random initialization.

2. Expectation step: In this step, is assumed the parameters θ(t−1) are fixed from the

previous step of initialization, then compute the expected values of the missing val-

ues. In most cases, it compute the function of the expected values of the missing

values.

3. Maximization step: Given the values we got in the E-step, we estimate new values

of parameter θt that maximize a variation of the likelihood function

4. Exit condition: in these step, if the likelihood of the observations are almost identical,

it exit or otherwise it will go to step 1 iteratively.

1.1.3 KullbackLeibler (KL) divergence

In this section we are going to take a look at way of comparing two probability distributions

called Kullback-Leibler (KL) Divergence. In statistics, we use KL Divergence to measure

4

how much information we lose when we approximate. KL Divergence is a natural distance

measure from a probability distribution p(x) to an estimated probability distribution q(x).

KL Divergence is commonly used in pattern recognition and in the fields of speech and

image recognition [Hershey and Olsen 2007].

KL Divergence is also called the relative entropy in machine learning between two

probability distribution functions p(x) and q(x).

D(p||q) =

∫p(x)log

p(x)

q(x)dx (1.7)

Below is the properties of the KL divergence.

1. If D(p||p) = 0 then it is called self similar, which means it is roughly look the same

on any scale.

2. D(p||q) = 0 if and only if p = q and is called self identification or a limiting case

3. If D(p||q) ≥ 0 for all p, q, this property is called positivity.

It must be noted that The larger the divergence between p(x) and q(x), the higher the

value of D(p||q); if there is no much difference between p(x) and q(x), the small value

of D(p||q) will be; and finally KL Divergence is not a metric as D(p||q) 6= D(q||p). The

importance of KL-divergence lies in its ability to quantify how far off your estimation of a

distribution may be from the true distribution. The objective of this project is to check how

q(x) converge to p(x) as we increase the missing values in the data.

1.2 Aims and Objectives

This research aims to present empirical results which demonstrates the ability of EM algo-

rithm to recover different percentage of missing values.

The objectives of this paper are to:

∗ To evaluate the performance of EM algorithm during the estimation of missing val-

ues.

5

∗ To determine how the algorithm performs during the estimations of missing data on

different percentage of hidden values.

∗ To use KL divergence to see how close or far do the original distribution and esti-

mated distribution differ by quantity as we increase percentage of missing data.

∗ Recognise the types and patterns of missing data, and when EM algorithm can be

used and when can be unbiased

For the above objective to be met, discrete data is generated randomly and Missing

at Random (MAR) is assumed since values were randomly removed. A total of 40 000

observations. We will first hide each and every point in the data set then estimate that

hidden values, then compare the true values with the estimated ones.

KL divergence will be applied to compare the distribution which generated the com-

plete dataset and the distribution which was obtained from learning 10 percent of the data,

the value we get will be the value that tell how far are we to the true values in the dataset.

We continue learning from different percentages of the data and calculate the KL diver-

gence of each data after hiding certain percentage.

1.3 Problem Definition

The concept of missing values it is important in researches in order to analyse the data

successfully. If the researcher fail to handle the missing values, then they might end up

drawing inaccurate conclusion. For example, The problem of missing data is relatively

common in almost all research and can have a significant effect on the conclusions that

can be drawn from the data if not taken into consideration. Papers like Papageorgiou et al.

[2018], Agarwal and Tangirala [2017], Zhang [2006] and Zarate et al. [2006] have found

the importance of recovering the missing data. To solve this problem, one should under-

stand the nature of missing data before trying to impute or delete the missing ones. This

research project might help researchers when imputing missing values using EM algorithm.

It will give them an understanding on how EM algorithm perform on different percentage

of missing data.

6

1.4 Research Question

The research questions of this research project are:

∗ How does EM algorithm perform when estimating the missing values?

∗ How does EM algorithm perform on a compact data set?

∗ What is going to happen if we use EM algorithm to estimate a massive missing values

of about 90% of the data?

1.5 Structure of the Report

In this chapter (Chapter 1), introduction, background of the study, aims and objective prob-

lem definition and research questions are outlined in order. Literature review of missing

data and methods or tests are discussed in detail in Chapter 2. Chapter 3 represent the

methodology followed by this research. The analysis and discussion of the study is struc-

tured in Chapter 4 and finally the conclusions and recommendations are in Chapter 5.

7

Chapter 2

Literature Review

2.1 Introduction

Missing data occur when there is no data value that is stored for the variable in an observa-

tion or dataset. Missing data occur commonly and it affects the conclusions that are drawn

from a dataset [Ghahramani and Jordan 1995]. The occurrence of missing data can caused

by the non-response of a respondent or respondent does not understand the question, in-

correct measurement or human error, etc. Every survey question that has no answer is a

missing data. There is no a perfect way to deal with missing data.

2.2 Missing Data

Abdella and Marwala [2005] shows that missing values in a dataset refer to a case when

some of the components of the dataset are not available for all data items in the database

or may not be even defined. Missing values in the dataset create problems in many appli-

cations depending on the fields to accurate data.

Several research studies have concentrated on the impact of missing values in the

dataset and its management. Treating the missing values are considered as an important

task or step to take in the analysis since it improves the effectiveness of knowledge dis-

covery process [Nancy et al. 2017]. In the fields that are highly dependent on the data for

decision making, missing data is still a problem that need to be solved [Zha et al. 2013].

According to Matta et al. [2017], Missing values on datasets is a common feature to

any of longitudinal studies, is that feature that if not taken into consideration can reduce the

statistical power and also lead to biased parameter estimates. In the context of longitudinal

studies, statistical literature uses the term incomplete data and missing data interchange-

ably.

8

2.2.1 Missing Data Mechanisms

Abdella and Marwala [2005] illustrates the methods that has been used to handle the miss-

ing values in areas such as research in statistics, mathematics and other different disci-

plines. The good way to handle missing data depends on how the data points have gone

missing. The three types of missing data mechanisms are: Missing Completely at Random

(MCAR), Missing at Random (MAR), and non-ignorable. MCAR occur if the probability

of missing value for variable X is not related to the value X or any other variable in the

dataset. This only happens when the missing of data does not depend on the variable of

interest. MAR arises if the probability of missing data on a variable X depends on the

other variables but not on X itself. Finally, the non-ignorable happens if the probability of

missing data X is related to a value of X itself. This is the most difficult mechanisms to

approximate and model than the other two missing mechanisms.

Recently Vazifehdan et al. [2018] shows that in a real world, real datasets often include

missing values for various reasons. This is a major challenge when using the machine

learning approach. Most of the learning algorithms can not work with missing data. Im-

putation of missing values is very useful for unbiased predictions using machine learning

tools. Vazifehdan et al. [2018] and Schafer and Graham [2002] indicates the three types of

missing values that should be considered when using imputation method:

1. Missing at Random (MAR)

2. Missing Completely at Random (MCAR)

3. Missing Not at Random (MNAR)

Checking the missing mechanism is equivalent to testing the randomness of the data.

When investigating the missing mechanisms, one may test for correlation between missing-

ness and other variables in the dataset. If the correlation coefficients are low, this indicates

MCAR and high correlation coefficients reflect MAR [Musil et al. 2002]. When MCAR

mechanism is rejected, MAR or MNAR are assumed. Karangwa et al. [2016] displays

methods that have been developed to model MAR and MCAR data, which are: single-

based imputation such as mean imputation, regression imputation, interpolation, multiple

9

imputation based methods such as Multivariate Normal Imputation (MVNI) and the Mul-

tiple Imputations by Chained Equations (MICEs).

2.3 Test

According to Petrozziello and Jordanov [2017], Random Forest (RF) is known as a bet-

ter and an efficient algorithm in classification, however, it depends on the strength of the

dataset. In many fields, Common methods in dealing with missing values usually make use

of estimation and imputation approaches whose efficiency is connected to the assumptions

of data features. The strategy of data imputation before classification is preferred, where

you have to estimate and fill the missing values according to the information from the ex-

isting dataset. The types of imputations mentioned in the paper are: Mean imputation,

hot-deck imputation, K-Nearest Neighbours imputation (KNNimpute), regression imputa-

tion, Bayesian estimation and Expectation Maximization (EM). RF algorithm is designed

on a complete dataset, however, it is also common to have an incomplete dataset in clas-

sification cases. Eventually, the experimental results on different datasets showed that the

RF algorithm is an outstanding method to solve the classification problem on an incom-

plete dataset. Royston and others [2004] shows that the Hot-deck imputation may perform

poorly when many rows of data have at least one missing values. Troyanskaya et al. [2001]

found that KNNimpute appears to yield a more robust and sensitive method for missing

value estimation in the dataset than Singular Value Decomposition impute (SVDimpute).

Both methods exceed the commonly used row average method.

Medical data are likely to contain missing values due to reasons such as human er-

rors, different interpretations and administrator's faults. Under clinical diagnosis, machine

learning and data mining are common technologies that are used for analysis. However,

machine learning methods in a high volume of missing data will lead to high error rate, be-

cause they cannot estimate a high volume of missing data properly due to univariate nature.

The issue that should be taken into consideration is that the dataset often includes missing

values which reduces the accuracy of diagnostic [Nekouie and Moattar 2018]. The fear of

deleting the data is that the critical information might be lost and this will affect modelling

and analytical results significantly [Scheffer 2002].

10

Hlalele [2009] highlighted that imputing missing values have been an area of interest

especially in the field of statistics community due to the biasedness results. Missing values

has led to the development of models and methods that impute missing data. After the

missing values are imputed, are substituted by estimated values such that the dataset can

be analysed using a technique that requires complete dataset. It should be expected that

the missing data will have an impact on data analysis and when making a decision. The

most common way of dealing with the problem of missing data is the imputation of the

missing information. Many techniques in machine learning have also been employed to

handle the missing data points. A hybrid missing data imputation model is developed to

impute missing values in the datasets and is improved to increase its accuracy.

2.3.1 Listwise and Pairwise Deletion

Matta et al. [2017] shows that many of the studies when facing this challenge of missing

values choose to omit subjects with missing data completely. The method is usually called

lit-wise deletion or complete case analysis. This may result in an unacceptable level of

biased. Sometimes the thoughts of removing the missing values will come, thinking there

is no way of knowing what the missing values could have been. Missing data is theoreti-

cally challenging particularly on data analysis. Matta et al. [2017] represent three general

methods of analysing incomplete datasets.

1. Likelihood-based (including Bayesian) - This is a semi-parametric method that al-

lows the person who applies the method to be specific about the parameter based

model through estimating equations

2. Multiple Imputation (MI) - Multiple imputation for missing data is a method that

handles the missing data in the multivariate analysis. Rubin (1977) was the first

person to propose the method of multiple imputation for missing data.

3. Weighting - This is the process of adjusting the contribution of each observation in

a survey sample based on independent knowledge about appropriate distributions,

after weighing no observation should have a weight of zero.

Vazifehdan et al. [2018] mentioned that removing missing data points or using list-

wise deletion method is an acceptable but only with little proportion of missing values.

11

Karangwa et al. [2016] and Bailey et al. [1994] shows that a traditional way to handle

missing values in a dataset is to eliminate them from the analysis by the process of listwise

deletion. This strategy was approved by most of the statistical packages such as STATA,

SPSS, and SAS. The problems appear in the analysis stage when there is a lot of missing

data. When the proportion of missing data is high, the listwise deletion will reduce the

sample size as a result, a sample that is not representative of the population is obtained

leading to the reduction of power of the statistical test, biased parameter estimates, and

large standard errors. The degree of missing data has a negative impact on the data analy-

sis when the missing values are excluded from the analysis. Generally, no matter the degree

of missing values is, problems associated with missing data will always arise.

As highlighted by Enders [2001], The analysis of missing data was revolving around

listwise and pairwise deletion Methods. In software packages, they have recently devel-

oped other methods than deletion methods for treating missing values.

2.3.2 Expectation Maximization

Missing values of the discrete datasets are imputed using Bayesian networks. The Expec-

tation Maximization (EM) algorithm is one of the best methods. The advantage of EM is

that, is trained with missing data meaning it deals with the missing data quite easily. In

recent years, machine learning and deep learning has improved the imputations of missing

values in many fields. Dempster et al. [1977] shows that if the iteration of the algorithm

consists of an expectation step followed by maximization step, then the algorithm is said

to be Expectation Maximization (EM). The main purpose of the EM algorithm is to pro-

vide the iterative computation of the maximum likelihood estimate for data in which some

variables are unobserved [Wu 1983] and [McLachlan and Krishnan 2007].

Enders [2001] indicated that the EM algorithm is an iterative method of two-step ,

where the missing values are assigned and estimate the unknown parameters. On the first

step usually called the Estep, the missing values are substituted by the conditional expec-

tation of the missing data points given the observed data. This particularly means that the

missing data will be replaced by predicted data calculated from the regression equations.

On the second and last step called the Mstep, the estimates of maximum likelihood of

12

the mean and covariance matrix are calculated using the values calculated on the Estep.

The Mstep is taken when the data is complete. The covariance and regression coefficients

calculated from Estep are used to calculate the missing data estimates on the next Estep,

and the process iterates again until the difference between covariance matrices in follow-

ing Msteps falls below some specified convergence condition [Enders 2001] and [Borman

2004]. It should be noted that the EM algorithm cannot be used to obtain directs estimates

of the linear model parameters like regression model. Otherwise, the EM algorithm can

only be used to calculate Maximum Likelihood estimates of a mean vector and covari-

ance matrix. Missing values in the original dataset are estimated and imputed using the

regression equations that are generated from the new covariance matrix [Dellaert 2002].

The understanding of the difference between MCAR and MAR is very critical. The list-

wise and pairwise deletion can only yield unbiased estimates when missing data is MCAR

and in practice, this is difficult to meet. On the other hand Maximum Likelihood yield un-

biased estimates under both MCAR and MAR. It has been seen that even when the data is

assumed to be MCAR, Maximum Likelihood methods gives efficient parameter estimates

than the listwise and pairwise deletion [Allison 2001]. Maximum Likelihood is regarded

as the method that gives unbiased estimates under MAR than other methods. There is

three maximum likelihood estimation algorithm that can handle missing data currently:

The multiple-group approach, full information maximum likelihood estimation, and the

EM algorithm [Allison 2001] and [Enders 2001].

EM alternates between an Estep calculating an expectation of the likelihood by in-

cluding the missing variables as if they were observed and a maximization (Mstep). The

challenges of the EM algorithm is the computational complexity of the E-step and the M-

step and the slow convergence problem. The computational complexity in Estep and Mstep

is another challenge of EM algorithm and last one, the EM algorithm can take more than

necessary number of iterations to converge [Sundararajan 2016].

2.3.3 Single and Multiple Imputation

The study of Tai et al. [2016] aimed to compare the two data imputation methods and to

provide a framework to evaluate the performance of imputed data. The two Imputation

13

methods are Single Imputation technique and Multiple Imputation (MI) technique. Com-

plete dataset was identified and randomly selected for training datasets and testing datasets.

Using the training set, regression-based Single Imputation was used to estimate the length

of stay in intensive care management. Again Multiple Imputation was used. The length

of stay distribution and cross-validation metric-root Mean-Squared Error (MSE) was com-

pared to come up with Imputation that performs better. MI performed better than the Single

Imputation.

MI is one of the most applicable methods for dealing with missing values in the multi-

variate analysis [Abdella and Marwala 2005]. Little and Rubin [2014] has listed the idea

of MI as follows:

1. Impute missing values using a proper model that includes random variation.

2. Iterate the step n times, normally from 3 to 5 times, create n complete datasets.

3. Execute the desired analysis on each dataset using standard complete data methods.

4. Calculate the mean of the parameter estimates of all the n samples to come up with

a single point estimate.

5. Calculate the standard errors by finding the mean of the squared errors on then esti-

mates, Calculate the variance of the samples and finally, combine the two quantities.

The Multiple Imputation (MI) and Maximum Likelihood (ML) execute better than

other methods even when the missing values in the dataset is non-ignorable. The two

method happens to appear the best choice for handling the missing data in most cases. The

researcher has found that the proposed model was more than 95% accurate [Abdella and

Marwala 2005].

Ramezani et al. [2017] shows that most of the previous papers or journals found a way

around the problem of missing values by ignoring them, vanishing all the records related to

missing values from the training set. The researcher introduced a novel intelligent system

for handling diabetic data with missing values. This method used the technique of multiple

imputations to increase the accuracy of the results. The problem of missing values on high

14

dimensional data causes inaccurate results in classification techniques. This concept of

missing values is a very critical issue in mathematical modelling of data. Walczak and

Massart [2001] mentioned that the multiple imputations is another statistical technique

for analysing incomplete datasets to obtain missing values. Orthogonal transformation

techniques were used to reduce the dimensionality of input data.

Petrozziello and Jordanov [2017] investigated data imputation techniques for pre-processing

of dataset with missing values. Real-world datasets contain missing values due to ei-

ther sensors failures or human errors. Petrozziello and Jordanov [2017] and Nelwamondo

[2008] mentioned that dealing with the missing data is a very important step to take in data

cleaning since in methods like machine learning, statistical analysis, and any other process

require complete datasets. The overall accuracy is estimated by evaluating estimations

of missing values on a dataset. To tackle this problem of missing values, a Column-wise

Guided Data (cGDI) was proposed. cGDI is used for selecting the best model from a multi-

tude of imputation techniques through a learning process of known data. After 13 publicly

datasets was conducted, cGDI became better.

2.3.4 Sampling Importance Resampling

In science and engineering, missing data are frequently encountered. Wang et al. [2017]

focuses on the estimation of parameters in estimating an equation with non-ignorable miss-

ing data. The Sampling Importance Resampling (SIR) was proposed to calculate the con-

ditional expectation for non-respondents. It is well known that the ignoring the missing

values can lead to seriously biased parameter estimations if the distribution of respon-

dents differs from the non-respondents individuals. When dealing with MAR data the

methods such as parametric likelihood method, imputation methods, and inverse proba-

bility weighted method are suggested. The SIR was found excessive especially for high-

dimensional cases.

2.4 Conclusion

Missing data is something that you cannot avoid easily, the best thing that one can do

is to reduce their occurrence in trial design and conduct. Sensitivity analyses should be

15

part of the primary step to take in the analysis of the data. Considering the sensitivity of

assumptions about the missing data mechanism should be a compulsory component of the

analysis. The listwise and pairwise deletion must not be applied to impute missing data

unless the dimension is small. Single Imputation method can produce bias results if the

proportion of missing data is larger than 5% [Stuart et al. 2009]. Hot-deck imputation is

suggested as a better solution practice to missing data problems [Myers 2011]. MI imputes

each missing value multiple times, is a powerful and flexible technique to deal with missing

data. The advantage of MI is easy to implement and is appropriate for a large dataset.

The EM algorithm is a favoured method when estimating for parameters for Bayesian

networks in the presence of incomplete data. EM becomes a useful tool for building a

method of statistical and mathematical analysis. It is a very useful algorithm for proba-

bilistic models and is simpler to implement. Many articles highlighted that EM algorithm

produced unbiased results.

16

Chapter 3

Methodology

3.1 Introduction

This chapter consist of the methodology followed by this research. The sections below

explains how the research were conducted.

The figure illustrated below, shows how the steps were followed to come up with the

results. At first the BN structure was created and named ’bnet’ as shown in the figure to

generate a random dataset, the dataset that was created have 4 rows and 10 000 columns,

the total data points generated was 40 000. With the 40 000 data points, we start by hiding

everything and show 0% of the complete data set, then we name it ’bnet0’. This implies

that in the structure (distribution) bnet0, there is 100% estimated data. Secondly we show

only 10% of the data then hide 90% of the data, we store it and create another BN with that

remaining 10% of the data and rename it bnet10. We continue with this order until we hide

nothing which mean we show 100% of the data and name the structure bnet100. For each

bnets represented in the digram, we estimate their missing values using EM-algorithm. We

want to see if we continue loosing data how good/bad the model will perform. will see this

by plotting KL divergence.

17

Figure 3.1: The digram below shows BNs structures

The last step will be calculating the KL divergence for each Bayesian networks from

bnet0 to bnet100, we will calculate their KL-divergence with their estimated imputed val-

ues and see how different the distributions are, compare with the original distribution that

created the complete data set.

3.2 ground truth Bayesian network

Ground truth is a basic term used in different fields that refer to the information provided

by direct observation. This means that the ground truth is the process of collecting valid or

provable data.

3.3 Sampling Data set

Sampling data set is the process of generating a data set from a Bayesian Network. In this

study we will use we create a structure of BN with four variables that have correlation.

18

3.4 Separating your complete dataset into copies with miss-

ing components

After generating the data set we then split the data. We stored some of the data and hidden

some. The first point we hidden 100% of the data and remain with 0% of the data (this

means that we remain with empty data set). We estimated 100% of the values in the data

set and store them in a Bayesian structure and create a distribution for that data set. The

next step was to was to hide 90% of the data and remain with only 10% of the original data

set. We estimated 90% of the values in the data set and store them in a Bayesian structure

and again create its own distribution. We continue with the same procedure from hiding

100% of the data till we hide 0% of the data. When we hide 0% of the data, we remain

with 100% of the original data set. meaning we hide nothing we take the sample as it is

and we create its own Bayesian structure and its distribution.

3.5 Learning Bayesian networks from the missing datasets

This section also called the structural learning, is the process that utilizes the observed

data to learn the links or paths of the BNs. The structure of this network is determined

by marginal and conditional independence tests. In this section we will discuss methods

that used when when learning BNs. Learning parameters of the BNs directly mean to learn

the conditional probability distributions and the graphical model of dependencies. The

objective of this research project, is to find a posterior distribution of models, given the

observed data set.

3.5.1 Parameter leaning

Once the structure of the network has determined, we determine the parameters. Parameter

learning is regarded as the process of using the data to the the distributions of the BN.

In this research project we used Bayesian network that uses the EM algorithm to perform

Maximum Likelihood (ML) to learn the parameters of BN. We used EM algorithm to learn

19

the parameters that will estimate the missing values in a data set. Learning the parameters

with incomplete data set, we had to turn to numerical optimization techniques. We use the

Bayesian approach to learn parameters of a posterior distribution on the parameter space

by applying Bayes's rule. We then used the expectation of the parameters with respect to

posterior distribution.

3.5.2 Model learning

Applying Bayesian on the entire posterior distribution is found by the integration and ap-

plication of Bayes's rule. For model selection we used the Maximum a posteriori Model

(MAP) where we avoided normalizing constant.

3.6 Evaluating the learned Bayesian network using KL

divergence

After learning the Bayesian Network which consist of, parameter learning and the model

learning. We evaluate the learned structure using the KL divergence. We stored the learned

BNs in each per cent of missing data from 100% to 0% missing data. The learned BNs is

utilized to create its own distribution. After creating different distributions, the final step

will be comparing the estimated distributions with the original distribution that created the

data set.

3.7 Motivation

Bayesian Networks are probabilistic models that are flexible and it is shown that they are

increasing exponential in many fields including genetics and genomics [Bae et al. 2016].

The reason why BNs where chosen in this research project to generate the data is because

it can precisely manage the correlation of variables in the data set. The EM algorithm

was chosen because it's best performance when estimating missing values in the data set

[Stephens and Scheet 2005] and lastly Solanki et al. [2006] shows that KL divergence is

the best method to use when comparing two distributions.

20

Chapter 4

Results and Discussion

4.1 Introduction

This chapter describes the analysis of imputation of missing values and the comparison of

distributions using a randomly generated data, with 40 000 observations. Complete data set

was utilized to intentionally hide some values using different percentage from 0% to 100%,

then after estimate them (missing values) using EM algorithm for each step of percentage

missing. The different distributions was drawn and compared with the distribution that

generated complete data set (original data set).

4.2 Data analysis

Bayesian Network (BN) structure was first created to randomly generate the data to be

analysed. The BN structure generated 40 000 observations. We apply the methods of BN

structure to generate the data because we want it to precisely control the relationship or

correlations in the simulated data. The data set generated is discrete data that consist of 10

000 variables and 4 rows of observations. This data follows Bernoulli distribution since it

composed of numbers 1 and 2. Figure 4.1 below shows how the data was created.

Figure 4.1: The digram below shows BN structure for generating data set

Table 4.1 illustrated below shows the sample or example of our data set generated by

21

the Bayesian Network. It is only consist of 2 and 1 which in our case was representing

True and False, 2 is True and False otherwise.

Table 4.1: Sample of a data set

Col1 Col2 Col3 ... Col100001 1 1 2 ... 12 1 1 2 ... 23 2 2 2 ... 14 1 2 2 ... 2

From generated data, the next step was to hide some of the data points with respect to

% missing in a data. The next table (table 4.2) will be the table consist 10% of the data set

of the complete data set, where NaN simply means the data is not observable.

Table 4.2: Sample of a 10% Missing values

Col1 Col2 Col3 ... Col100001 NaN 1 NaN ... 12 1 NaN 2 ... NaN3 2 NaN NaN ... 14 NaN 2 2 ... NaN

Data was shown from 0% to 100%, where we start with an empty data set meaning we

show 0% of the data set and will try to estimate the whole data set till we show 100% of the

data set. From table 4.2, we estimate 90% missing values using EM algorithm. We then

fill in the missing values with estimated ones, then after we generate a new BN structure.

Figure 4.2 represents the new BN structure with its 90% estimated values.

Figure 4.2: The digram constitute 10% of the authentic data and 90% of estimated data

The last step was to compare the distributions of each percent missing with the origi-

22

nal distribution using KL divergence. The distributions was drawn for each BN structure

created, even the first BN structure that generated data inclusively. The second BN struc-

ture's distribution that constitute 90% estimated data was compared with the distribution

that generated complete data set and it is found that the KL divergence was extremely high

meaning the distributions do not diverge (they differ a lot). We continued comparing the

distributions until we compare the distribution that generated complete data set with the

one that have 0% estimated missing values and we got a very little (close to zero) KL

divergence, implying the distribution almost the same.

Figure 4.3: The figure shows the KL divergence with missing data percentage

The plot shows the decreasing in KL divergence as we learn from more observations.

KL divergence shows that if one have about 40% of latent values in a data set, EM algo-

rithm can estimate them well, since we observe from the plot that if we only have 60% of

the data set, the estimated distribution is slowly converging to original distribution. But it

is shown that the more the values we estimate in the data the more we lose the structure

of the data. Again from the plot we observe that, if we have about 50% of the data or

more, EM algorithm may fail to truly estimate the missing values. Especially if you have

only 10% of the data (meaning 90% of the data is missing) the structure of the data will be

completely different.

23

4.3 Discussion

It is observed that even if we use a sophisticated method of estimating missing data like

EM algorithm, if we have a lot of missing or latent values such as 50% or more in the data

set it will not produce reliable results. We should have at most 40 or less than 50 percent

of missing values because if we have more than 50% of the missing values, the estimated

values will be completely different from true values. Below is the figure that shows that

if we have a very little observations like maybe 250, EM algorithm performed very bad to

that point of you can not even interpret it. It produced good results when we have about 40

000 observations.

Figure 4.4: The figure shows the KL divergence with missing data percentage and 50observations

24

Chapter 5

Conclusion and Recommendation

5.1 Conclusion

The results of this study have shown that EM algorithm can precisely produce reliable

estimated missing values in the data set when we have a large data set and the percentage

of missing data is not huge say less than 50 per cent. If we have about 30 000 or more

observations, it performs well and if we have maybe less than 50% of missing values in the

data set. The KL divergence has proved that the more we have estimated missing values

in the data set say more than 50 per cent the more we lose the structure of the data. We

observed on the KL divergence plot that, EM algorithm can perform well when estimating

missing values of less than 50% of the data set. If one estimate massive missing values

in the data set, maybe 80 or 90 per cent of the data will get misleading results which lead

to inaccurate results. Therefore it is advisable not to use the EM algorithm if one has got

massive missing values in a data set.

5.2 Recommendations for Future Work

Firstly, for future work instead of just imputing or estimate missing values in the data set

using the EM algorithm, it is needed to check how much data is missing. This is necessary

for methods like EM algorithm because it is not performing well in all kind of missing data,

it has it's limitations. The other thing is to check or take into consideration the missing

mechanisms like MAR, MNAR and MCAR. Further investigation is also needed on why

EM algorithm performs badly when per cent of missing data increases.

The effect of variables in the data set should be investigated and the use of another

distribution like Multinomial distribution instead of the Bernoulli distribution. The other

thing to be considered in future work may be using more complex independence assump-

tions and a lot more samples.

25

Bibliography[Abdella and Marwala 2005] Mussa Abdella and Tshilidzi Marwala. The use of genetic

algorithms and neural networks to approximate missing data in database. In Com-

putational Cybernetics, 2005. ICCC 2005. IEEE 3rd International Conference on,

pages 207–212. IEEE, 2005.

[Agarwal and Tangirala 2017] Piyush Agarwal and Arun K Tangirala. Reconstruction of

missing data in multivariate processes with applications to causality analysis. In-

ternational Journal of Advances in Engineering Sciences and Applied Mathematics,

9(4):196–213, 2017.

[Allison 2001] Paul D Allison. Missing data, volume 136. Sage publications, 2001.

[Azadeh et al. 2013] Ali Azadeh, SM Asadzadeh, R Jafari-Marandi, S Nazari-Shirkouhi,

G Baharian Khoshkhou, Sahar Talebi, and Arash Naghavi. Optimum estima-

tion of missing values in randomized complete block design by genetic algorithm.

Knowledge-Based Systems, 37:37–47, 2013.

[Bae et al. 2016] Harold Bae, Stefano Monti, Monty Montano, Martin H Steinberg,

Thomas T Perls, and Paola Sebastiani. Learning bayesian networks from correlated

data. Scientific reports, 6:25156, 2016.

[Bailey et al. 1994] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by

expectation maximization to discover motifs in bipolymers. 1994.

[Balakrishnan et al. 2017] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al.

Statistical guarantees for the em algorithm: From population to sample-based analy-

sis. The Annals of Statistics, 45(1):77–120, 2017.

[Borman 2004] Sean Borman. The expectation maximization algorithm-a short tutorial.

Submitted for publication, pages 1–9, 2004.

[Burge and Lane ] John Burge and Terran Lane. Selecting bayesian network parameteri-

zations for generating simulated data.

26

[Dellaert 2002] Frank Dellaert. The expectation maximization algorithm. Technical report,

2002.

[Dempster et al. 1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum

likelihood from incomplete data via the em algorithm. Journal of the royal statistical

society. Series B (methodological), pages 1–38, 1977.

[Dong and Peng 2013] Yiran Dong and Chao-Ying Joanne Peng. Principled missing data

methods for researchers. SpringerPlus, 2(1):222, 2013.

[Enders 2001] Craig K Enders. A primer on maximum likelihood algorithms available for

use with missing data. Structural Equation Modeling, 8(1):128–141, 2001.

[Ghahramani and Jordan 1995] Zoubin Ghahramani and Michael I Jordan. Learning from

incomplete data. 1995.

[Griffiths and Yuille 2008] T Griffiths and Alan Yuille. A primer on probabilistic infer-

ence. The probabilistic mind: Prospects for Bayesian cognitive science, pages 33–

57, 2008.

[Hastie et al. 2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised

learning. In The elements of statistical learning, pages 485–585. Springer, 2009.

[Hershey and Olsen 2007] John R Hershey and Peder A Olsen. Approximating the kull-

back leibler divergence between gaussian mixture models. In Acoustics, Speech

and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol-

ume 4, pages IV–317. IEEE, 2007.

[Hlalele 2009] Nthabiseng Unathi Hlalele. The impact of missing data imputation on HIV

classification. PhD thesis, 2009.

[Karangwa et al. 2016] Innocent Karangwa, Danelle Kotze, Renette Blignaut, et al. Mul-

tiple imputation of unordered categorical missing data: A comparison of the multi-

variate normal imputation and multiple imputation by chained equations. Brazilian

Journal of Probability and Statistics, 30(4):521–539, 2016.

27

[Little and Rubin 2014] Roderick JA Little and Donald B Rubin. Statistical analysis with

missing data, volume 333. John Wiley & Sons, 2014.

[Matta et al. 2017] Tyler H Matta, John C Flournoy, and Michelle L Byrne. Making an

unknown unknown a known unknown: Missing data in longitudinal neuroimaging

studies. Developmental cognitive neuroscience, 2017.

[McLachlan and Krishnan 2007] Geoffrey McLachlan and Thriyambakam Krishnan. The

EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.

[Musil et al. 2002] Carol M Musil, Camille B Warner, Piyanee Klainin Yobas, and Su-

san L Jones. A comparison of imputation techniques for handling missing data.

Western Journal of Nursing Research, 24(7):815–829, 2002.

[Myers 2011] Teresa A Myers. Goodbye, listwise deletion: Presenting hot deck imputa-

tion as an easy and effective tool for handling missing data. Communication Methods

and Measures, 5(4):297–310, 2011.

[Nancy et al. 2017] Jane Y Nancy, Nehemiah H Khanna, and Kannan Arputharaj. Im-

puting missing values in unevenly spaced clinical time series data to build an effec-

tive temporal classification framework. Computational Statistics & Data Analysis,

112:63–79, 2017.

[Neal and Hinton 1998] Radford M Neal and Geoffrey E Hinton. A view of the em algo-

rithm that justifies incremental, sparse, and other variants. In Learning in graphical

models, pages 355–368. Springer, 1998.

[Nekouie and Moattar 2018] Atefeh Nekouie and Mohammad Hossein Moattar. Missing

value imputation for breast cancer diagnosis data using tensor factorization improved

by enhanced reduced adaptive particle swarm optimization. Journal of King Saud

University-Computer and Information Sciences, 2018.

[Nelwamondo et al. 2007] Fulufhelo V Nelwamondo, Shakir Mohamed, and Tshilidzi

Marwala. Missing data: A comparison of neural network and expectation maxi-

mization techniques. Current Science, pages 1514–1521, 2007.

28

[Nelwamondo 2008] Fulufhelo Vincent Nelwamondo. Computational intelligence tech-

niques for missing data imputation. PhD thesis, 2008.

[Papageorgiou et al. 2018] Grigorios Papageorgiou, Stuart W Grant, Johanna JM Takken-

berg, and Mostafa M Mokhles. Statistical primer: how to deal with missing data in

scientific research? Interactive cardiovascular and thoracic surgery, 2018.

[Pearl 1998] Judea Pearl. Graphical models for probabilistic and causal reasoning. In

Quantified representation of uncertainty and imprecision, pages 367–389. Springer,

1998.

[Petrozziello and Jordanov 2017] Alessio Petrozziello and Ivan Jordanov. Column-wise

guided data imputation. Procedia Computer Science, 108:2282–2286, 2017.

[Raghunathan et al. 2001] Trivellore E Raghunathan, James M Lepkowski, John

Van Hoewyk, and Peter Solenberger. A multivariate technique for multiply imput-

ing missing values using a sequence of regression models. Survey methodology,

27(1):85–96, 2001.

[Ramezani et al. 2017] Rohollah Ramezani, Mansoureh Maadi, and Seyedeh Malihe

Khatami. A novel hybrid intelligent system with missing value imputation for di-

abetes diagnosis. Alexandria Engineering Journal, 2017.

[Royston and others 2004] Patrick Royston et al. Multiple imputation of missing values.

Stata journal, 4(3):227–41, 2004.

[Schafer and Graham 2002] Joseph L Schafer and John W Graham. Missing data: our

view of the state of the art. Psychological methods, 7(2):147, 2002.

[Schafer 1997] Joseph L Schafer. Analysis of incomplete multivariate data. Chapman and

Hall/CRC, 1997.

[Scheffer 2002] Judi Scheffer. Dealing with missing data. 2002.

[Schneider 2001] Tapio Schneider. Analysis of incomplete climate data: Estimation of

mean values and covariance matrices and imputation of missing values. Journal of

climate, 14(5):853–871, 2001.

29

[Solanki et al. 2006] Kaushal Solanki, Kenneth Sullivan, Upamanyu Madhow, BS Man-

junath, and Shivkumar Chandrasekaran. Provably secure steganography: Achieving

zero kl divergence using statistical restoration. In Image Processing, 2006 IEEE

International Conference on, pages 125–128. IEEE, 2006.

[Stephens and Scheet 2005] Matthew Stephens and Paul Scheet. Accounting for decay

of linkage disequilibrium in haplotype inference and missing-data imputation. The

American Journal of Human Genetics, 76(3):449–462, 2005.

[Stuart et al. 2009] Elizabeth A Stuart, Melissa Azur, Constantine Frangakis, and Philip

Leaf. Multiple imputation with large data sets: a case study of the children’s mental

health initiative. American journal of epidemiology, 169(9):1133–1139, 2009.

[Sundararajan 2016] Priya Krishnan Sundararajan. Improving the performance and under-

standing of the expectation maximization algorithm: Evolutionary and visualization

methods. 2016.

[Tai et al. 2016] M Tai, E Onukwugha, et al. Data imputation for missing values in a

claim-based administrative database: Comparison of imputatation approaches. Value

in Health, 19(7):A855, 2016.

[Troyanskaya et al. 2001] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat

Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Miss-

ing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525,

2001.

[Vazifehdan et al. 2018] Mahin Vazifehdan, Mohammad Hossein Moattar, and Mehrdad

Jalali. A hybrid bayesian network and tensor factorization approach for missing

value imputation to improve breast cancer recurrence prediction. Journal of King

Saud University-Computer and Information Sciences, 2018.

[Walczak and Massart 2001] Beata Walczak and Desire L Massart. Dealing with missing

data: Part ii. Chemometrics and Intelligent Laboratory Systems, 58(1):29–42, 2001.

[Wang et al. 2017] Xiuli Wang, Yunquan Song, and Lu Lin. Handling estimating equation

30

with nonignorably missing data based on sir algorithm. Journal of Computational

and Applied Mathematics, 326:62–70, 2017.

[Wu 1983] CF Jeff Wu. On the convergence properties of the em algorithm. The Annals

of statistics, pages 95–103, 1983.

[Xu and Jordan 1996] Lei Xu and Michael I Jordan. On convergence properties of the em

algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996.

[Zarate et al. 2006] Luis E Zarate, Bruno M Nogueira, Tadeu RA Santos, and Mark AJ

Song. Techniques for missing value recovering in imbalanced databases: Appli-

cation in a marketing database with massive missing data. In Systems, Man and

Cybernetics, 2006. SMC’06. IEEE International Conference on, volume 3, pages

2658–2664. IEEE, 2006.

[Zha et al. 2013] Yong Zha, Ali Song, Chuanyong Xu, and Honglin Yang. Dealing with

missing data based on data envelopment analysis and halo effect. Applied Mathe-

matical Modelling, 37(9):6135–6145, 2013.

[Zhang 2006] Yin Zhang. When is missing data recoverable? Technical report, 2006.

31

Documents

Imputing Missing Values Using Expected-Maximization …...complete the research project successfully. I would also like to express my deepest and sincere gratitude to my research project