Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Detecting Conversion of Mild Cognitive Impairment toAlzheimer
A Comparison Between Classifiers
Maria Ines Barbosa Silva Crujo Uva
Thesis to obtain the Master of Science Degree in
Biomedical Engineering
Supervisor: Prof. Maria Margarida Campos da Silveira
Examination Committee
Chairperson: Prof. Raul Daniel Lavado Carneiro MartinsSupervisor: Prof. Maria Margarida Campos da SilveiraMembers of the Committee: Dr. Durval Campos Costa
Prof. Joao Miguel Raposo Sanches
November 2014
Our human compassion binds us the one to the other - not in pity or patronizingly, but as human beingswho have learnt how to turn our common suffering into hope for the future.
Nelson Mandela
Acknowledgments
This work was supported by Fundacao para a Ciencia e Tecnologia through the Alzheimer´s Dis-
ease: Image Analysis and Recognition (ADIAR) project (PTDC/SAU-ENB/114606/2009).
Data collection and sharing for this project was done by the Alzheimer’s Disease Neuroimaging
Initiative (ADNI).
I would like to thank all of those who contributed, direct or indirect, to the successful of this work:
• My advisor, Professor Margarida Silveira, for sharing your scientific knowledge and for support-
ing me throughout the project, always with very constructive suggestions.
• My classmates, with who I spent great times during this journey in IST.
• My family, friends and Joao, for had been an unconditional support in all stages of my life.
To all of you, I send my sincere acknowledgments.
iii
Abstract
The incidence of dementia worldwide is expected to double in the next 20 years. Alzheimer’s
Disease (AD) is the most common form of dementia affecting people over the age of 65. The early
diagnosis of the disease is very important because time constitutes a barrier to improve the quality
of life for those who suffer from the disease. The efforts for early detection of AD give rise the Mild
Cognitive Impairment (MCI) concept which is used to characterize a person between normal and
AD stage. The diagnosis can be done by an expert physician using neuroimaging techniques like
Positron Emission Tomography (PET) and Magnetic Resonance Imaging (MRI). In the recent years,
new ways of Computer Aided Diagnosis (CAD) tools have been helping the detection of the disease
since they are not subject to a subjective evaluation. During this study three different classifiers
were used on Fluorodeoxyglucose (FDG)-PET images to detect MCI to AD conversion. The three
classifiers were Support Vector Machine (SVM), AdaSVM and AdaBoost. To improve the predictive
power, different feature selection and feature extraction methods were also studied. The one that
shows a better performance is AdaBoost, when Control Normal (CN) and AD patients were used as a
training set and the obtained classifier tested on MCI cohort, which achieved an accuracy of 79%, with
a sensitivity of 76,9%, a specificity of 80,2% and a Balanced Accuracy (BA) of 78,6%. The area under
the Receiver Operating Characteristic (ROC) curve was also computed and registered a remarkable
value of 83,4%.
Keywords
Alzheimer’s Disease, Mild Cognitive Impairment, Positron Emission Tomography, Computer Aided
Diagnosis, Feature Selection, Feature Extraction
v
Resumo
Numa era em que se espera que o numero de pessoas classificadas como dementes duplique, e,
sendo a Doenca de Alzheimer (AD) a manifestacao de demencia mais comum a afetar a populacao
com mais de 65 anos, um diagnostico precoce torna-se essencial. Os esforcos efectuados para
caracterizar a doenca, na sua fase mais inicial, conduziram ao aparecimento do conceito de Defice
Cognitivo Ligeiro (MCI) que e por definicao um estado intermedio entre o estado cognitivo normal
(CN) e a manifestacao de sintomas de AD. Este diagnostico pode ser realizado por um especialista,
recorrendo para tal a testes neuropsicologicos ou a imagens medicas como e o caso das obtidas por
emissao de positroes (PET) ou atraves de ressonancia magnetica (MRI).
Nos ultimos anos, o desenvolvimento tecnologico possibilitou o aparecimento de varios meios
de diagnostico computacionalmente assistidos (CAD) que melhoraram em muito o desempenho de
diagnostico, visto que nao requerem qualquer tipo de avaliacao subjetiva.
Nesta dissertacao estudaram-se tres tipos de classificadores distintos, a saber: SVMs, AdaBoost
e AdaSVM, com o objectivo de avaliar qual o melhor classificador para a deteccao da conversao de
MCI para AD atraves de imagens FDG-PET. Diferentes tipos de selecao e extracao de caracterısticas
foram igualmente estudados. O que registou um melhor desempenho foi o AdaBoost tendo atingido
um nıvel de precisao de 79%, com uma sensibilidade de 76,9%, uma especificidade de 80,2% e uma
precisao balanceada de 78,6%. A capacidade em detetar com maior certeza um novo caso positivo,
tambem foi avaliada, tendo-se verificado um desempenho de 83,4%.
Palavras Chave
Doenca de Alzheimer, Defice Cognitivo Ligeiro, Tomografia por Emissao de Positroes, Diagnostico
Assistido por Computador, Selecao de Caracterısticas, Extracao de Caracterısticas
vii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Numbers and Facts of the Disease . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Pathology, Diagnosis and Treatment . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 PET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3.A The Importance of PET as a diagnostic technique for AD . . . . . . . . 7
1.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 State of the Art 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Materials and Methods 19
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 ADNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Subjects Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Imaging Acquisition and Processing . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 SVM - Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1.A SVM - Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1.B SVM - Mathematical Concepts . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1.C Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2.A AdaBoost - Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2.B AdaBoost - Mathematical Concepts . . . . . . . . . . . . . . . . . . . . 28
3.2.3 AdaSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Voxel Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
3.4.1.A 1st Mask: Voxels inside the brain volume . . . . . . . . . . . . . . . . . 33
3.4.1.B 2nd Mask: Voxels inside specific brain regions . . . . . . . . . . . . . . 33
3.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.1 Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Experimental Results 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Model’s adjustable parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 SVMs Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Training with CN and AD subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Training and testing with MCI data . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3 Training with all classes and testing with MCI population . . . . . . . . . . . . . . 51
4.4 AdaBoost Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Training with CN and AD subjects and using only the voxels within the Regions
of Interests (ROIs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 Training and testing with MCI data and using only the voxels within the ROIs . . 57
4.5 AdaSVM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 Training with CN and AD subjects and using only the voxels within the ROIs . . . 59
4.5.2 Training and testing with MCI data and using only the voxels within the ROIs . . 60
4.6 Summary of all the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Conclusions and Future Work 65
Bibliography 69
Appendix A ROIs A-1
x
List of Figures
1.1 Statistics of dementia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Costs of dementia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Changes in several causes of death. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Different stages of AD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 PET explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Metabolic activation in CN, MCI and AD patients. . . . . . . . . . . . . . . . . . . . . . . 7
1.7 A scheme of all strategies adopted during this dissertation. . . . . . . . . . . . . . . . . 9
3.1 Basic concepts of SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 A schematic interpretation of a multiclass SVM algorithm. . . . . . . . . . . . . . . . . . 27
3.3 Basic concepts of AdaBoost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Basic concepts of AdaSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 A schematic view of the cross-validation process. . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Feature space transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 ROIs transformation mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 An example of the influence of different penalizations. . . . . . . . . . . . . . . . . . . . 41
4.2 ROC curves for a SVM classifier with CN and AD as a training set. . . . . . . . . . . . . 43
4.3 Sensitivity and specificity variation for a SVM classifier when CN and AD are used as
a training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 The outputs of SVM classifiers by using the voxels inside the entire brain volume. . . . . 45
4.5 The outputs of SVM classifiers by using the voxels which fall inside the ROIs. . . . . . . 46
4.6 Features selected by PCC to build the classifier when CN and AD subjects are used
for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 Features selected by MI to build the classifier when CN and AD subjects are used for
training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8 A mathematical interpretation of correlation. . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 ROC curves for a SVM classifier with MCI as a training set. . . . . . . . . . . . . . . . . 50
4.10 Sensitivity and specificity variation for a SVM classifier when MCI subjects are used as
a training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.11 ROC curves for a SVM classifier with all classes as a training set. . . . . . . . . . . . . . 52
4.12 A comparison between SVM classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xi
4.13 ROC curves for AdaBoost and SVMs classifiers with CN and AD as a training set and
by using different feature selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.14 Features selected by Boosting when CN and AD subjects are used for training the model. 57
4.15 Subject’s weights in each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.16 A comparison between different classifiers when CN and AD are used for training. . . . 60
4.17 A comparison between different classifiers when MCI subjects are used for training. . . 61
4.18 A comparison between the outputs of AdaSVM and AdaBoost classifiers when CN and
AD patients are used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.19 A comparison between the outputs of AdaSVM and AdaBoost classifier when MCI
subjects are used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1 Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-
maud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
A.2 (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from
Champalimaud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.3 (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from
Champalimaud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.4 (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from
Champalimaud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.5 (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from
Champalimaud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
xii
List of Tables
2.1 Summary of the State of the Art. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Summary of the State of the Art (Continued). . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Dataset description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 SVM Results when CN and AD are used for training. . . . . . . . . . . . . . . . . . . . . 43
4.2 SVM Results when MCI subjects are used for training. . . . . . . . . . . . . . . . . . . . 49
4.3 SVM Results when all classes are used for training. . . . . . . . . . . . . . . . . . . . . 52
4.4 AdaBoost Results when CN and AD are used for training just with information from ROIs. 55
4.5 AdaBoost Results when MCI subjects are used for training just with information from
ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 AdaSVM Results when CN and AD are used for training just with information from ROIs. 59
4.7 AdaSVM Results when MCI subjects are used for training just with information from
ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 A summary of all the results obtained during this thesis. . . . . . . . . . . . . . . . . . . 62
xiii
Acronyms
ACC Accuracy
AD Alzheimer’s Disease
ADNI Alzheimer’s Disease Neuroimaging Initiative
AUC Area Under the Curve
BA Balanced Accuracy
CAD Computer Aided Diagnosis
CDR Clinical Dementia Rating
CN Control Normal
CSF Cerebrospinal Fluid
CV Cross-Validation
FDA Food and Drug Administration
FDG Fluorodeoxyglucose
FWHM Full Width at Half Maximum
LDA Linear Discriminative Analysis
LNOCV Leave-N-Out Cross-Validation
LOOCV Leave-One-Out Cross-Validation
MCI Mild Cognitive Impairment
MCI-C MCI Converters
MCI-NC MCI Non-Converters
MI Mutual Information
ML Machine Learning
MMSE Mini Mental State Examination
xv
MRI Magnetic Resonance Imaging
NBIB National Institute of Biomedical Imaging and Bioengineering
NFTs Neurofibrillary Tangles
NIA National Institute of Aging
NIH National Institute of Health
NINCDS-ADRDA National Institute of Neurological Disorders and Stroke-Alzheimer Disease and Re-
lated Disorders
PCC Pearson Correlation Coefficient
PET Positron Emission Tomography
RBF Radial Basis Function
ROC Receiver Operating Characteristic
ROIs Regions of Interests
SENS Sensitivity
SPEC Specificity
SVM Support Vector Machine
VEB Voxels in the Entire Brain volume
VI Voxel Intensity
xvi
1Introduction
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
1.1 Motivation
Every seven seconds a new case of dementia appears in the world and the prevalence of the
disease is increasing over the years [1]. Due to the increase of life expectancy, the incidence of
dementia is expected to double during the next 20 years reaching an estimated number of 115.4
million people worldwide in 2050 [1,2]. On average these people live just four to eight years after the
diagnosis and die because, up to now, there is no cure.
1.1.1 Numbers and Facts of the Disease
Alzheimer’s Disease (AD) was first described by Alois Alzheimer, a German Neurologist, in 1906.
At that time the disease was considered rare because on average the life expectancy was about 50
years and so very few people reached the critical age of 65 years in which the likelihood of developing
dementia almost doubles every five years [3,4]. Today, with the increase of life expectancy AD consti-
tutes the most common cause of dementia [3,5]. By dementia we mean a kind of mental disorder that
completely changes the life of patients and their families and is characterized by the loss of memory
and other intellectual abilities which interferes with daily life activities due to the death or malfunction
of the nerve cells [3,6,7].
As can be verified with the analysis of Figure 1.1 much of the incidence increases because of
people with dementia in low and middle income countries [1] where the majority of people don’t
receive a diagnosis and therefore it is difficult to have access to a correct treatment [8].
Figure 1.1: Trends of growth of people with dementia (in millions) [1].
Dementia is a leading cause of disability and need for care. It has associated direct costs, related
with medical and social care, and indirect costs, justified by unpaid caregiving [9]. The amount of
money spent with dementia worldwide crossed the threshold of 604 million dollars in 2010. The costs
are enormous and inequitably distributed [9]. If dementia were a country, it would be the 18th largest
economy (see Figure 1.2) [4].
2
Figure 1.2: Comparison between the costs of dementia in US and others countries economies [4].
According to the Alzheimer’s Association, between 2000 and 2010 the number of deaths resulting
from heart diseases, stroke, prostate cancer and HIV decreased 16%, 23%, 8% and 42% respectively
but the proportion of deaths related with AD increased 68% (see Figure 1.3) [7]. Something has to be
done to stop this epidemic and we don´t have time to loose to manage the impact of the disease [8].
Figure 1.3: Changes in several causes of death [7].
The late diagnosis constitutes a barrier for improving the quality of life for those who suffers from
the the disease. An early diagnosis is essential because it enables the patient to gain time to disease.
At the initial phase of the disease the patient is still capable to trace his own future plans as well
as taking part in decisions related with his care [8]. The societal costs can also be anticipated and
managed and, consequently, the burden costs of the disease can be reduced [9].
3
The efforts in characterizing the early signs of AD have attracted a lot of attention in the recent
years and led to the appearance of the concept of Mild Cognitive Impairment (MCI) [2,10]. The termed
MCI is used to characterize a person who has problems with memory or with another thinking skill but
do not interfere with daily life activities [3]. The MCI is considered a transition stage between normal
aging and AD. Several studies were made with the goal of better understanding MCI and today the
researchers know that there are different types of MCI, depending on the cognitive domains which
are affected but the subtype that is most related with AD is the amnestic MCI, i.e., when the impaired
areas are those that involve memory [2, 3]. The annual rate of conversion between amnestic MCI to
AD is around 12%, which is much higher than the rate verified in normal non-demented subjects [2]. It
is important to identify individuals that are likely to convert to initiate the treatment and by so delaying
the mental decline as much as possible.
Today we have sensitive markers and methodologies that allow preclinical detection of neurode-
generative diseases such as AD where the brain has patterns of atrophy and a decrease in metabolic
rate [11]. Neuroimaging techniques constitute one example of those kinds of methodologies. Struc-
tural Magnetic Resonance Imaging (MRI), which measures the degree of atrophy, and metabolic
Positron Emission Tomography (PET) that, by using 18-Fluorodeoxyglucose (FDG), a labeled analo-
gous of glucose, can measure the reductions in cerebral metabolic rate which are the most clinical
used modalities [11,12]. Together with Computer Aided Diagnosis (CAD), Neuroimaging has formed
a powerful tool that identifies an individual in a preclinical stage, which means, before the patient has
the symptoms, once the pathologies that lead to dementia start to occur [8].
1.1.2 Pathology, Diagnosis and Treatment
AD is an incurable physical disease that affects the human brain and is characterized by the
accumulation of insoluble fibrous material in the central nerve system which leads to cells death
[13]. Up to now the cause of the disease is still unknown but a set of factors such as age, genetic
inheritance, environmental factors or lifestyle constitute the majority of the risk factors for the onset of
the disease [5].
Although it is not fully understood there are three consistent hallmarks for the disease: accu-
mulation of senile plaques of β-amyloid peptides, accumulation of Neurofibrillary Tangles (NFTs)
composed by τ -protein and neuronal degeneration [1, 14]. The disease is characterized by a cog-
nitive decline, concretely a short-term recall memory loss [12,14]. However the existence of plaques
and NFTs are not unique of AD and are involved in the normal process of aging as well as in other
neurodegenerative disorders [8].
Since the hallmarks are not exclusive of AD the diagnosis of the disease is really difficult and was
based on the criterion of the Diagnostic and Statistical Manual of Mental Disorders, fourth edition
(DSM-IV-TR) and the National Institute of Neurological Disorders and Stroke-Alzheimer Disease and
Related Disorders (NINCDS-ADRDA) published in 1984. According to this criterion the diagnosis
should be made in two steps: first the identification of a dementia syndrome and secondly the appli-
cation of a protocol that had to be in line with the clinical characteristics of the AD phenotype. Today
4
the clinical phenotype of AD is not described in such exclusionary terms to allow for the detection of
the early stages of the disease [15]. An autopsy confirmation of the histopathology changes related
to AD is mandatory for a definite diagnosis [1].
To help in the task of dementia diagnosis clinicians commonly use the Mini Mental State Exami-
nation (MMSE). The MMSE is a series of questions and tests, which cover different mental abilities,
and with a maximum score of 30 points. Typically, a normal person has a MMSE score above 27
nevertheless, having a score below this value does not mean that a person has dementia, there may
be other reasons that justify this low score [16]. Actually one of the most relevant pitfalls of MMSE
is the impact of the cognitive reserve, i.e., the degree of education. This constitutes a problem that
hardly affects subjects with high education level in which the early signs of the disease can be easily
hidden. In these cases neuroimaging constitutes an important tool because it can not be manipu-
lated [17]. The severity of the symptoms of dementia is commonly measured in another numerical
scale known as Clinical Dementia Rating (CDR), which ranges between 0 and 3 accordingly to the
strength of symptoms.
AD is a progressive disease, which means that it gets worse over time but the speed of the deterio-
ration is different across subjects. The plaques of β-amyloid peptides deposits have several variations
both in form and in size. By contrast the neurofibrillary changes shows a characteristic pattern of dis-
tribution. These changes start in the medial temporal lobe and spread over all the cortex [18]. This
regular pattern allows the classification of the disease in six stages accordingly the cortical and sub-
cortical neurofibrillary changes. As is represented in Figure 1.4 the six stages can be grouped into:
transentorhinal stages, which corresponds to the silence period of the disease, limbic stages charac-
terized by the onset of the symptoms and isocortical stages where AD is fully developed [3,13].
Figure 1.4: Neurofibrilary changes along different stages of AD [13].
5
Up to now there is no cure for AD and the treatments are focused on slowing down the progres-
sion of the disease and improve patient symptoms [19]. U.S. Food and Drug Administration (FDA)
approved two different types of drugs. The first type is known as cholinesterase inhibitors drug and
avoids the breakdown of acetylcholine, a chemical messenger involved in cognitive brain domains
such as memory and learning. The second one is known as NMDA receptor antagonist, and acting
by blocking this receptor, which is involved in information processing and so, the brain damage is
slowed down. Different drugs are appropriated for different stages of the disease [3,19]. Recent stud-
ies which took place at Saint Louis University and published in the Journal of Alzheimer’s Disease,
demonstrated that an experimental drug called antisense oligonucleotide can reverse the symptoms
of AD in mice that are genetically engineered to model the disease [20]. This constitutes a hope in a
future cure for the millions of subjects that suffer from the disease.
1.1.3 PET
PET is a non-invasive nuclear medicine technique in which we obtain an image with the spatial
distribution of radiopharmaceuticals introduced into the body, i.e., we measure the physiology and
function rather than an anatomical map of the body [21].
The radiopharmaceuticals are designed to bind with the ligand of interest therefore they have to be
labeled analogues of a biological active molecule and their choice depends on what we want to see.
The radiotracer undergoes radioactive decay and a positron is emitted. This positron, after a while,
collides with an electron and two γ-rays are formed. They travel in opposite directions at an angle of
180◦ and with an energy of 511 KeV up to being detected. With the reconstruction of the annihilation
lines the original image can be achieved (see Figure 1.5) [21,22].
PET is a great tool for diagnosis and is largely used in areas such as oncology, cardiology and
neurology [21].
Figure 1.5: Formation and detection of γ-rays [22].
6
1.1.3.A The Importance of PET as a diagnostic technique for AD
In preclinical detection of AD developing preventive measures is a vital step to cover the gap
between the onset of the disease and the beginning of symptoms [23].
To do this kind of imaging a radiopharmaceutical called FDG is used. FDG works as an analogous
of glucose. The chemistry of both molecules is very similar but FDG has fluorine isotope instead of a
normal hydroxyl group at the second carbon.
The human brain needs energy to live, which is obtained through the glucose metabolism and to
do that, FDG initiates the same process as glucose cycle. In a first step FDG suffers phosphorylation
by hexokinase enzyme and is transformed into FDG-6-Phosphate. After this first phosphorylation step
FDG-6-Phosphate is not able to continue the glucose cycle and remains trapped inside the cells. The
transformation in molecule chemistry does not allow for an efficient excretion and the radionuclide
shows up [21].
A high metabolic active tissue shows a high uptake of FDG and a reduction in FDG uptake is
indicative if something goes wrong. The hypometabolism extension can predict the severity of the
disease [11,22,24].
Since it is possible to quantify the metabolic reduction this neuroimaging technique allows charac-
terization of the disease stage, as represented in Figure 1.6, and thus an appropriate treatment can
be applied even before the onset of the clinical signs.
Figure 1.6: Decrease in glucose consuption in patients with MCI and AD when compared with CN patients.
1.2 Proposed Approach
In Machine Learning (ML), the objects are represented by vectors in which each entry represents
one feature of the object. Typically each vector is constituted by a huge number of features, which
accounts for several problems. On one hand the task of creating these vectors is not straightforward
since there are different ways to do it and, most of the times, predict which one is the best for the
problem at hand is quite difficult. On the other hand, in classification problems, the number of features
is quite larger than the number of examples which leads to a dimensionality problem. To overcome
this issue, the number of features has to be reduced. The number of selected features influences
the performance of the classifier and to choose the best set of features, once again, there are a lot
of different possibilities. On top of this, to build the classifier that best fits the classification problem,
7
there are also many different ML algorithms which constitutes one more variable to take into account.
Our aim in this thesis is to study different methodologies in CAD of AD by using FDG-PET images
from Alzheimer’s Disease Neuroimaging Initiative (ADNI).
The main goal of this study is a comparison between three different classification procedures that
were used on FDG-PET images to detect MCI to AD conversion. The three classifiers were Support
Vector Machine (SVM), AdaSVM and AdaBoost and to do that, within each one of the algorithms,
different strategies were investigated.
The first problem to be addressed was the feature extraction procedure. Herein, Voxel Inten-
sity (VI), which is directly proportional to the severity of the disease, is used but to reduce the di-
mensionality of the problem two different pre-processing steps were also computed. In the first one
a binary mask was applied to choose just the voxels that fell inside the brain volume, and thus, all
the background of FDG-PET images was discarded. Although being reduced, the dimensionality of
the problem continues to be huge and by so this approach was only used when the classification was
accomplished by a SVM algorithm. In the second approach, a second binary mask was used but now
for choosing just the voxels within certain Regions of Interests (ROIs). With this last strategy, not only
the dimensionality of the problem is reduced but also the discriminative power of those regions can
be evaluated.
Three different feature selection methods were tested. The first two methods were Pearson Cor-
relation Coefficient (PCC) and Mutual Information (MI), two different ways for selecting features ac-
cording to a ranking criterion. These two methods were useful to choose the features that were used
to build the SVM model. The third approach, known as boosting, which is an embedded feature se-
lection procedure of AdaBoost algorithm, was used as a feature selection criterion for AdaBoost and,
for AdaSVM. All the strategies explained so far are outlined in Figure 1.7.
To conclude, three different tests were performed when the classification task was carried out by
SVM and by AdaSVM methods. There were included in this study with the intent of understanding
the importance of applying different penalties for misclassification when the data is imbalanced, i.e.,
when the number of negative and positive examples is different.
All these strategies will be better explained along this thesis and the results will tell which is the
best strategy to adopt to detect early MCI to AD conversion.
8
Figure 1.7: A scheme of all the different approaches included in this thesis. Inside the orange rectangle thereare represented the two masks used during the feature extraction procedure. The Mask 1 select the voxelsinside the entire brain volume and the Mask 2 select just those that fell inside the ROIs. In the green rectangleare represented the three distinct feature selection procedures and inside the blue one, the three classificationalgorithm already mentioned. The different tests performed to understand the best way for tuning the modelsparameters are represented by the black circles.
1.3 Original Contributions
During the last years, several works were done to account for MCI to AD prediction but, as can
be seen with a careful reading of the next Chapter, the work developed so far, and according to our
knowledge till then, are mainly focused on MRI images as well as on SVM and LDA classifiers.
This work brings some innovate techniques for MCI to AD conversion detection since it introduce
in this problematic the AdaBoost classifier and, at the same time, uses only PET images for the diag-
nosis. Moreover it also tests more efficient techniques for feature extraction procedure, by reducing
the classification problem dimension due the fact that only the regions that can be capable to differ-
entiate between MCI converters and MCI non-converters are considered. The feature selection is not
forgotten or left to chance, and different methods are tested with special emphasis on Boosting which,
together with SVM classifier, originates a new classification technique also known as AdaSVM. Still in
SVMs, different ways for tuning SVM parameters were also explored to guarantee the best possible
results. Here we introduced a method to choose the C parameter based on the Balanced Accuracy
of the model, which constituted a novelty for the state of the art known so far.
The data used during this thesis are obtained through ADNI database, which makes this study
comparable to the ones published and mentioned in Chapter 2. To finish is important to note that
when the same group has to be divided between training and testing set the use of cross-validation
ensures that the model is not prone to overestimates the results.
9
1.4 Thesis Outline
The remainder of this thesis is organized in four chapters. Chapter 2 contains the State of the Art.
In chapter 3, all the materials and methods used during this dissertation are described. The chapter
begins with a detailed description of the population and then all ML methods as well as all feature
extraction and selection techniques are explained. Chapter 4 contains all the results obtained along
this thesis and is divided in different subsections to account for the differences between classifiers,
i.e., SVM, AdaBoost and AdaSVM. Still in chapter 4, a complete discussion about the results and a
comparison between the classifiers are performed. Finally, chapter 5 encloses some conclusions and
the future work that can be done yet in this area.
10
2State of the Art
Contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
11
2.1 Introduction
To deal with the problematic of early diagnosis of AD several imaging processing techniques are
being used on imaging modalities such as PET and MRI, as already was addressed in section 1.1.1,
but these methodologies are not able to accurately per se predict the conversion from MCI to AD [25].
In the last years the problem of searching for patterns in data has received a lot of attention and
some machine learning techniques have been largely exploited [26].
The results of a machine learning algorithm can be seen as y(x) where x is a high dimensional
input features vector and y constitutes the output of the machine. This function is learnt during the
training phase, where a model is built, and its capacity for generalization is access during the test
phase in which new examples are used to avoid bias. Different problems can be solved by using
machine learning algorithms such as: supervised learning problems, where the correct label of the
test set is known, unsupervised learning algorithms, in which the label is unknown and the models
try to find groups of similar examples that are called clusters and reinforcement learning, which deals
with the problem of finding suitable actions to take in order to maximize the reward [26].
Within the problematic of early diagnosis of AD pattern recognition and machine learning tech-
niques are very useful because they allow an effective characterization of group differences as well as
a better identification of individuals at risk of cognitive decline with a high dimensional input data [27].
The high dimensional features vector makes the model prone to overfitting which constitutes a
problem, also known as curse of dimensionality where the number of features is largely greater than
the number of training examples. There are some solutions to overcome this issue and one of them
is the definition of ROIs to reduce the dimensionality problem, an approach followed by many re-
searchers, as we will see in the next section, but there are others who continue to prefer to use the
whole brain structure once the previous knowledge of the affected regions is not a requirement [25,28].
2.2 Previous Work
In recent years the problem of predicting the conversion from MCI to AD has received a lot of
attention and thus many studies have been published. To ascertain the most recent developments a
brief chronological review of the most important contributions will be presented.
In 2009, Misra et al. [10] tried to find predictors of short-term conversion from MCI to AD by
measuring the spatial distribution of brain atrophy and its longitudinal modifications in MCI Converters
(MCI-C) and MCI Non-Converters (MCI-NC). The group was constituted by 27 MCI-C and 76 MCI-NC
patients from the ADNI cohort (explained forward in Section 3.1.1) in which the classification was
based on CDR changes. A Voxel-Based Morphometry (VBM) was used to analyse the Magnetic
Resonance Images from baseline with the goal of identifying a minimal set of brain regions whose
volumes better discriminate between the two groups. Two different types of classifiers were used. In
the first attempt a predictive model was built based on Control Normal (CN) and AD information and
then applied to MCI patients. In the second one the model was built and tested based on MCI data in
a leave one out cross validation way. During this study different kernels as well as different numbers of
12
features were tested and the maximum Accuracy (ACC) achieved was 81,5%. The Receiver Operating
Characteristic (ROC) curve, which represents the trade-off between the Specificity (SPEC) and the
Sensitivity (SENS), was computed and the Area Under the Curve (AUC) was 77%. These constitute
the best results ever published however the small dataset makes the comparison with other studies a
difficult task [25].
In the same year, Querbes et al. [17] published another study debating the same topic. Here 382
ADNI patients were used where 130 were CN, 50 were MCI-NC, 72 were MCI-C and 130 were AD
subjects. For each patient, cortical thickness was computed from Magnetic Resonance (MR) baseline
images. In this study the brain was divided into different zones and combined with age in a Linear
Discriminative Analysis (LDA) and using an automatic procedure the optimal set of ROIs were chosen,
i.e., the ones that are more discriminative for the proposed task. Those zones were used to compute
the normalized thickness index that is used for prediction. Although this study reported an ACC of
73% with a SENS of 75% and a SPEC of 69% the results are biased due to the fact that the same test
set was used to determine the best zones and to test the predictive power of the model, making the
results most likely overestimated [25].
Years later, in 2011, Davatzikos et al. [28] also contributed for this problem with a dataset consti-
tuted by 54 AD, 63 CN, 69 MCI-C and 170 MCI-NC patients from ADNI. They used a VBM analysis to
predict the conversion and two different approaches for classification in which they trained the classi-
fiers with CN and AD information and test on MCI cohort. With the first classifier, that used just VBM
information, an ACC of 55,8% was achieved with a SENS of 94,7%, a SPEC of 37,8% and an AUC of
73,4%. When the information from τ -protein, a Cerebrospinal Fluid (CSF) Biomarker was included in
the SVM classifier, the ACC obtained increased to 61,7% with a SENS of 84,2%, a SPEC of 51,2%
but the AUC decreased a bit to 67,7%. The low discriminative performances between MCI-NC and
MCI-C were justified by the complexity and variance of the patterns of brain atrophy in MCI patients,
which are far more difficult to describe when compared with CN or AD patients.
Also in 2011 Westman et al. proposed a combined method to detect MCI to AD conversion. This
combined method join the ADNI database with AddNeuroMed, an European program, and comprised
a total number of 1067 patients, with 295 AD, 84 MCI-C, 353 MCI-NC and 335 CN subjects. To
predict the conversion, each of the subject´s MRI image was analysed in order to account for cortical
thickness and for volumetric changes measurements. The classifier was trained with CN and AD
information and, at the end, 71% of the MCI-C subjects were correctly labelled as AD like and 60% of
MCI-NC patients as CN like which means an ACC of 76% [29].
Cuingnet et al. published, also in 2011, a study in which ten different approaches to predict the
conversion between MCI and AD were used to build a SVM classifier. These ten approaches can be
divided into three different groups: five were based on voxels methods, three were based on cortical
thickness and two on the hippocampus, in which one was focused on hippocampal volume and other
on hippocampal shape. The population comprised 210 subjects from ADNI database and was split
into a training group with 67 MCI-NC and 39 MCI-C and a testing group constituted by 67 MCI-NC and
37 MCI-C. From those ten methods just four revealed to be slightly more accurate than chance. The
13
best results achieved, when the whole brain was used in a voxel approach, correspond to a SENS of
57%, a SPEC of 78% and an ACC of 71%. When just the cortex was considered, a SENS of 32%,
a SPEC of 91% and an ACC of 70% were achieved. Finally, when just the hippocampal volume was
taken into account, the SVM obtained a SENS of 62%, a SPEC of 69% and an ACC of 67% [30].
Wolz et al. tried to improve the ACC of MCI to AD prediction by combining features from several
structural MRI analysis techniques and to do that they used the baseline MRI images from 405 sub-
jects in ADNI dataset and a SVM classifier. The features that were used for classification were: the
hippocampal volume, the cortical thickness, the tensor-based morphometry and a new method based
on manifold learning in which the Laplacian eigenmaps are computed to estimate the low dimensional
representation of the images based on pairwise images similarities. Due to the fact that the number of
features in cortical thickness and tensor-based morphometry approaches are enormous the features
that were evaluated were those that fell inside certain predefined ROIs but, since the selection was
done using subjects from the test population, the achieved results can be slightly overestimated [25].
The best results were achieved when just features from manifold-based learning were used with a
SENS of 77%, a SPEC of 48% and an ACC of 65%. Moreover, they also performed the same test
with the population used by Cuingnet el al. (see [30]), and by using a LDA classifier. The best results
were achieved when all feature extraction methods were combined with a SENS of 67%, a SPEC of
69% and an ACC of 68%. The results were better than the ones obtained by the SVM classifier and
also had a higher SENS than those published by Cuingnet el al. [31].
In 2012 Eskildsen et al. [25] published a study in which not only the progression from MCI to AD
was estimated but also the time taken by subjects to convert. To do that the MRI scans from 6 months
(122 subjects), 12 months (128 subjects), 24 months (61 subjects) and 36 months (29 subjects) prior
to AD diagnosis were collected for MCI-C. Each group of MCI-C is compared with MCI-NC (134
subjects) based on VBM. The more discriminative features are chosen accordingly the largest value,
which means, the largest atrophy pattern and the ROIs were determined. Within these ROIs, features
were chosen during LDA classification and by combining the four classifiers they obtained an ACC of
73,5% with a SENS of 63,8% and a SPEC of 84,3%. When compared MCI-C36 with MCI-NC they
registered an ACC of 69,9%, a SENS of 55,2%, a SPEC of 73,1% and an AUC of 63,5%; for MCI-C24
vs. MCI-NC classifier, an ACC of 66,7%, a SENS of 59%, a SPEC of 70,2% and an AUC of 67,3%
were obtained; for MCI-C12 vs. MCI-NC classifier an ACC of 72,9%, a SENS of 75,8%, a SPEC of
70,2% and an AUC of 76,2% were achieved; Finally, for MCI-C6 vs. MCI-NC classifier they obtained
an ACC of 75,8%, with a SENS of 75,4%, a SPEC of 76,1% and an AUC of 80,9%. Later, during
the study they combine the VBM information with age and the following results were achieved: for
MCI-C36 vs. MCI-NC classifier an ACC of 72,4%, a SENS of 48,3%, a SPEC of 77,6% and an AUC
of 63,7%; for MCI-C24 vs. MCI-NC classifier an ACC of 67,2%, a SENS of 55,7%, a SPEC of 72,4%
and an AUC of 70,7%; for MCI-C12 vs. MCI-NC classifier an ACC of 70,6% with a SENS of 72,7%,
a SPEC of 68,7% and an AUC of 76,3%; for MCI-C6 vs. MCI-NC classifier they obtained an ACC of
74,6% with a SENS of 72,1%, a SPEC of 76,9% and an AUC of 81.1%. Several conclusions can be
taken from this study such as: the detection time decreases the SENS and increases the AUC or, by
14
including the age in LDA, the AUC, which is the most accurate measurement in imbalanced data, is
slightly better.
Still in 2012 Coupe et al. [32] published one more study about the detection of AD in pre-clinical
stages of the disease in which the ability to predict conversion is a challenging problem because the
brain pattern changes are subtler. Here, a new feature extraction method was exploited, the Scoring
by Nonlocal Image Patch Estimator (SNIPE) method, in which the nonlocal similarity of the subject to
all of the training dataset is computed which reduced problems related to intersubject variability due
the fact that a one-to-many mapping is allowed. For this purpose, all ADNI baseline dataset was used:
231 CN, 238 MCI-NC, 167 MCI-C and 198 AD in which the Hippocampus (HC) and the Enthorhinal
Cortex (EC) were selected and graded with resort to a LDA classifier. During the classification phase
different possibilities to perform the Cross-Validation (CV) were also studied such as Leave-One-Out
Cross-Validation (LOOCV), repeated Leave-N-Out Cross-Validation (LNOCV) and stratified k-fold.
For the grading method and with LOOCV procedure the performance of MCI-NC vs. MCI-C classifier
achieved an ACC of 71% with a SENS of 70% and a SPEC of 71%. For LNOCV procedure with 100
folds of repetitions, an ACC of 73% with a SENS of 72% and a SPEC of 74% was obtained. Finally,
the k-fold CV procedure scored an ACC of 73% with a SENS of 68% and a SPEC of 76%. The AUC
was not specified for any of the tests.
During the same year, Cho et al. also evaluated the predictive power of a LDA classifier in distin-
guishing between MCI converters and MCI non-converters. To do that, they used an ADNI population
with 131 MCI-NC and 72 MCI-C subjects, which later was split into training and testing sets, and
extracted the cortical thickness data from MR volumes that later were filtered to remove the noise
presented in high frequencies. At the end, this study achieved a SENS of 63% with a SPEC of 76%
and an ACC of 71% [33].
In 2013 Young et al. [34] used a Gaussian Process (GP) with multimodal data from MRI, FDG-PET,
genetic Biomarkers to differentiate between MCI-NC and MCI-C. The classification was done with 73
CN, 96 MCI-NC, 47 MCI-C and 63 AD patients from ADNI dataset. The GP model was computed
by training with CN and AD and then applied to MCI data. The output probabilities are dichotomised
to produce binary classification, i.e., if the patient will convert or not. To account for multimodal
kernel construction two main approaches were taken: Grid Search (GS), in which the weights of each
modality are chosen from a set of previously known values, or by Maximum Likelihood (ML) in which
the kernel parameters are learnt from the training data and thus we don’t have to resort to a GS with
CV. The results for GP method using only the information extracted from MRI were: an ACC of 64,3%
with a SENS of 53,2%, a SPEC of 69,8%, a Balanced Accuracy (BA) of 61,5% and an AUC of 64,3%.
For GP method using only the information from PET an ACC of 65% was reached, with a SENS of
66,0%, a SPEC of 64,6%, a BA of 65,7% and an AUC of 76,7%. Making use of multimodal kernel
but tuning the kernel parameters by means of ML the classifier achieved an ACC of 69,9%, with a
SENS of 78,7%, a SPEC of 65,6%, a BA of 74,1% and an AUC of 79,5%. Finally, the results with GS
approach registered an ACC of 67,1% with a SENS of 76,6%, a SPEC of 62,5%, a BA of 70,6% and
an AUC of 75,1%. All these values are summarized in Table (2.1).
15
As can be seen, in the last years many authors drove their efforts to improve MCI to AD conversion
prediction power and these studies were mainly focused on SVM and LDA classifiers. Nevertheless,
other publications had a huge impact to the development of this study, such as [35–40]. The one
published by Silveira et al. in 2010, [41] was also very important due to the fact that a comparison
between SVM, AdaBoost and AdaSVM was performed for three different classifiers, i.e., to distinguish
between CN and AD, between CN and MCI and between MCI and AD. Although the specific sep-
aration between MCI-NC and MCI-C was not taken into account, the obtained results for AdaBoost
revealed to be so promising that we are encouraged to apply the same principles to the problem at
hand. The results for CN/MCI, in terms of ACC, are shown in Table 2.2 just for informational reasons.
16
2.3 Summary
Table 2.1: A summary of all the results presented in this section in terms of Acuracy (ACC), Sensitivity (SENS), Specificity (SPEC) and Area Under the ROC Curve (AUC).
Article Participants Biomarker(s) Method Results (%)ACC SENS SPEC AUC
Misra et al.(2009) [10] 76 MCI-NC MRI VBM 81,5 - - 7727 MCI-C
Querbes et al.(2009) [17]
130 CN
MRI ROIs 73 75 69 -50 MCI-NC72 MCI-C130 AD
Davatzikos et al.(2011) [28]
63 CN MRI VBM 55,8 94,5 37,8 73,4170 MCI-NC69 MCI-C MRI VBM 61,7 84,2 51,2 67,754 AD CSF Biomarker τ -protein
Westman et al. (2011) [29]
295 AD
MRICortical Thickness
76 71 60 -353 MCI-NC84 MCI-C Volume335 CN
Cuingnet et al.(2011) [30]134 MCI-NC
MRIHippocampus 67 62 69 -
76 MCI-C Voxel Approach 71 57 78 -Cortical Thickness 70 32 91 -
Wolz et al. (2011) [31]
238 MCI-NC
MRIManifold Learning (SVM) 65 77 48 -167 MCI-C
134 MCI-NC Combination (LDA) 68 67 69 -78 MCI-C
17
Table 2.2: (Continued) A summary of all the results presented in this section in terms of Accuracy (ACC), Sensitivity (SENS), Specificity (SPEC) and Area Under the ROC Curve(AUC).
Article Participants Biomarker(s) Method Results (%)ACC SENS SPEC AUC
Eskildsen et al.(2012) [25]
29 MCI-C36 vs. 134 MCI-NC MRI
VBM
69,9 55,2 73,1 63,5MRI & Age 72,4 48,3 77,6 63,7
MCI-C24 vs. 134 MCI-NC MRI 66,7 59,0 70,2 67,3MRI & Age 67,2 55,7 72,4 70,7
128 MCI-C12 vs. 134 MRI 72,9 75,8 70,2 76,2MRI & Age 70,6 72,7 68,7 76,3
122 MCI-C6 vs. 134 MCI-NC MRI 75,8 75,4 76,1 80,9MRI & Age 74,6 72,1 76,9 81,1
Coupe et al.(2012) [32]231 CN/198 AD
SNIPEROIs, LOOCV 71 70 71 -
238 MCI-NC ROIs, LNOCV 73 72 74 -167 MCI-C ROIs, k-fold CV 73 68 76 -
Cho et al. (2012) [33] 131 MCI-NC/72 MCI-C MRI Cortical Thickness 71 63 76 -
Young et al.(2013) [34]
73 CN MRI
GP
64,3 53,2 69,8 64,396 MCI-NC PET 65,0 66,0 64,6 76,747 MCI-C MRI + PET + Genetics (GS) 67,1 76,6 62,5 75,1
63 AD MRI + PET + Genetics (ML) 69,9 78,7 65,6 79,5
Silveira et al. [41] 113 MCI/81 CN PETAdaBoost 79,63 - - -
SVM 74,07 - - -AdaSVM 71,52 - - -
18
3Materials and Methods
Contents3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
19
In this chapter the participants characteristics used during this study as well as the classifica-
tion techniques will be presented. The feature selection and feature extraction methods will also be
addressed.
3.1 Data
3.1.1 ADNI
ADNI is a public-private partnership launched in 2003 by the National Institute of Aging (NIA), the
National Institute of Biomedical Imaging and Bioengineering (NBIB), the FDA, private pharmaceuti-
cal companies, such as AstraZeneca, Novartis and Merk, and non-profit foundations, as Alzheimer’s
Association, in conjunction with the National Institute of Health (NIH) Foundation. Several cores con-
stitute ADNI: a clinical coordination centre, two neuroimaging cores, a biomarker core, an informatics
core and a biostatistics core. Its policy is sketched by following three main goals: [2,42,43]
1. Identification of possible neuroimaging biomarkers that allow early detection of AD;
2. Application of prevention as well as treatment techniques during the early stage of the disease;
3. Creation of a database of imaging and clinical data.
ADNI began in October 2004 and the first purpose was to recruit 200 CN individuals and 400
subjects who suffer from MCI to be followed for a period of three years and 200 patients with AD to
be followed for two years [42]. In 2011 the study included over 1000 patients and the results had
overcome the expectations [43].
During the next subsections the problematic of how the ADNI database is constituted, as well as
how images can be managed to be comparable, is addressed.
3.1.2 Subjects Characterization
For the present study individuals from ADNI [43] cohort who have PET baseline images available
were selected. In one initial phase 404 individuals were chosen but this number was reduced to 286
because 118 subjects did not have all neurological test measurement available. The 286 individuals
were divided into four groups, accordingly to their neuropsychological test scores: 78 CN, 91 MCI-NC,
52 MCI-C and 65 AD. Table (3.1) summarizes the main characteristics of each group.
Table 3.1: Description of the groups present in this study. Values are presented in Mean ± Standard Deviationformat.
Group CN MCI-NC MCI-C AD
Number of patients 78 91 52 65Age [years] 75,9 ± 4,9 75,5 ± 7,3 74,7 ± 6,9 76,0 ± 6,7Gender(% of Females) 37,2 30,8 40,4 40,0MMSE 29,1 ± 1,0 27,1 ± 1,6 27,4 ± 1,6 23,4 ± 2,0CDR 0,00 ± 0,00 0,50 ± 0,00 0,50 ± 0,00 0,81 ± 0,25
20
3.1.3 Imaging Acquisition and Processing
The FDG-PET Images in ADNI had been acquired by using one of the following PET scanners:
General Electric, Siemens or Philips and could be performed according to three different protocols
[44,45]:
1. Dynamic Protocol : six-5 minutes frames which should start 30 minutes after FDG injection [30
- 60 min];
2. Static Protocol : a single 30 minutes frame which should start 30 minutes after FDG injection [30
- 60 min];
3. Dynamic Quantitative Protocol : 33 frames acquired during 60 minutes. The acquisition should
start immediately after FDG injection [0 - 60 min].
The quantitative studies require a more strict technical protocol so they are just performed by a
few sites. The qualitative studies are easier to perform and therefore are more broadly used. [44]
With the aim of having a uniform database and making images from different scanners more
similar, there are strict process steps that should be applied to PET image data in a sequential way
[46]:
1. Dynamic co-registration: for registration purposes, all images are separated into different frames.
To decrease the number of artifacts caused by patient motion during the acquisition, each frame
is co-registered to the first extracted frame and the recombination origins a dynamic image set.
These images set present the same image size as well as the same voxel dimension and have
the same spatial orientation when compared with the original PET image data, which is called
’native’ space. This step can only be applied for images acquired under the protocol 1 or 3, i.e.,
Dynamic Protocol or Dynamic Quantitative Protocol;
2. Averaging: a single 30 min PET image frame, in the ’native’ space, is created by averaging the
six-5-minutes frames, in the case of the Dynamic Protocol or the last 6 frames, if the performed
protocol was the Dynamic Quantitative;
3. Image and Voxel size standardization: each subject’s co-registered averaged image is reori-
ented into 160x160x96 voxel grid having 1, 5 mm cubic voxel in which the anterior-posterior axis
is parallel to AC-PC line;
4. Resolution standardization: the previously obtained image is smoothed and an uniform isotropic
resolution of 8 mm Full Width at Half Maximum (FWHM) is achieved;
Before starting the analysis, all images have to go through two more transformation steps in order to
make possible a comparison between two different scanners:
5. Talairach Warping: all images are mapped to a Talairach space and thus the location of all
brain structures becomes independent of the shape, size and differences in the brains across
individuals. After Talairach warping a grid of 128x128x60 is generated;
21
6. Intensity standardization: the intensity of the image is normalized by using subject specific mask,
which has to ensure that the average of all voxels within the mask is exactly equal to one.
3.2 Machine Learning Algorithms
In this section all ML algorithms used in the present study will be better explained. ML is a great
tool because it tries to find a model that minimizes the error for a set of training examples but, at the
same time, allows some form of inductive bias, which gives to the algorithm the ability to generalize
[42].
3.2.1 SVM - Support Vector Machines
SVM became popular some years ago and has an important property of solving a convex opti-
mization problem, which means that any local solution is also a global optimum [26].
3.2.1.A SVM - Basic Concepts
Historically, the origin of SVM dates back to 1962 when Vapnik and Lerner proposed algorithms
for learning recognition and distinction nevertheless the notion of SVM as we know today, was just
introduced in 1995 also by Vapnik but, at this time, together with Corinna Cortes [47, 48]. They
defined SVM as a machine designed for binary classification problems where the input vectors are
non-linearly mapped to a very high dimension feature space. In this feature space SVM tries to find
a hyperplane that minimizes the classification error of the training examples and, at the same time,
maximizes the margin, i.e, the distance between the hyperplane and the closest examples in feature
space, which are called support vectors [48–51]. A schematic representation of a SVM algorithm is
represented in Figure 3.1.
22
Figure 3.1: This figure illustrates the basic concepts of SVMs. The support vectors are indicated by the circles.Image based on Bishop [26].
SVMs can have good generalization ability even if they work with a huge feature space dimension
because the hyperplane can be constructed by using just the support vectors. Vaipnik on his study
showed that the probability expected for an error on a test example is given by [48]:
E [Pr(error)] ≤ E [number of support vectors]
number of training vectors(3.1)
As can be seen by Equation (3.1), the error does not depend on the dimensionality of the space.
Most of the times the data is not linearly separable and to have an optimal separation hyperplane,
high dimensional kernels transformations have to be applied and, in extreme cases, i.e., when the
data distribution is really complex, this can leads to poorer generalization ability. This concept of
hard margins fell in 1995 when Vapnik and Cortes introduced the soft margin concept, which prevent
against overfitting by allowing that some data points are misclassified [26].
3.2.1.B SVM - Mathematical Concepts
Lets consider the binary classification problem where the training set χ is such that:
χ = {(x1, y1) , . . . , (xn, yn)} , xn ∈ RK , yn ∈ {−1, 1} (3.2)
23
Where xn, with n ∈ {1, . . . , N}, is the K-dimensional feature vector of each instance, yn ∈ {−1, 1}
is the class label and N is the total number of training examples. The hyperplane that separates
the training data is the one parameterized by vector w and a bias constant b, given by the following
equation:
w · x + b = 0 (3.3)
Assuming that the data is linearly separable, the decision function is given by
y(x) = w · x + b (3.4)
If the decision function given by Equation (3.4) correctly classifies all training instances, i.e., y(x) >
0 for instances with label y = 1 and y(x) < 0 for instances with y = −1, the hyperplane parameterized
by vector w and by constant b is known as canonical separating hyperplane, which means that a
canonical hyperplane should have a distance from support vector of at least 1 [26].
y(w · x + b) = 1 (3.5)
This can be shortly written by the following constrain which should be fulfilled by the closest training
points:
yi(w · xi + b) ≥ 1,∀i (3.6)
where xi with i ∈ RM and M ≤ N , are support vectors.
The optimal hyperplane from the set of separating hyperplanes is the one that maximizes the
margin, which means, the one that has the greatest distance between the hyperplane and the support
vectors, the nearest vectors. The distance is given by y(x)/‖w‖, on which ‖w‖ is the magnitude of
the vector w, so the distance of closest points to the decision boundary is given by:
d((w, b),xi) =y(w · xi + b)
‖w‖(3.7)
In order to maximize the margin, and according to Equation (3.7), the inverse of the magnitude of
w, ‖w‖−1, should be maximized, which is equivalent to minimizing 12‖w‖
2 subject to the constraint
(3.6) and so we have to solve the following quadratic optimization problem [26]:
Minimize:1
2‖w‖2
Subject to: yi(w · xi + b) ≥ 1,∀i(3.8)
To solve Equation (3.8) we make use of the Lagrange multipliers, αi ≥ 0 with i ∈ {1, . . . ,M} and
M equals to the total number of support vectors, which are very useful when the equation to be solved
has to fulfill one or more constraints. The Lagragian form of this problem is given by:
L(w, b,Λ) =1
2‖w‖2 −
M∑i=1
αi {yi(w · xi + b)− 1} (3.9)
24
where Λ = (α1, . . . , αi) and i ∈ {1, . . . ,M}.
Making use of the dual representation of the Lagrangian function it is possible to eliminate the
vector w as well as the constant b from Equation (3.9). By doing this transformation, the optimization
problem given by Equation (3.8) is easier to solve. First let’s consider the derivatives of the Lagrangian
function, (3.9) with respect to both w and b:
∂L(w, b,Λ)
∂w= (w −
M∑i=1
αiyixi) = 0 (3.10)
∂L(w, b,Λ)
∂b= (
M∑i=1
αiyi) = 0 (3.11)
By using the conditions (3.10) and (3.11) the dual representation of the problem given by Equation
(3.8) can be written as:
L(Λ) =
M∑i=i
αi −1
2
M∑i=1
M∑l=1
αiαlyiylk(xi,xl) (3.12)
The solution is obtained by maximizing Equation (3.12) subject to the following constraints:
Λ = (α1, . . . , αi) ≥ 0, i = 1, . . . ,M (3.13)
M∑i=1
αiyi = 0 (3.14)
In Equation (3.12), K(xi,xl) is the kernel function and can be defined as K(xi,xl) = φ(xi)Tφ(xl)
in which φ is the feature space transformation. For a linear transformation we can write φ(xi) = xi
and φ(xl) = xl so the kernel function can be resumed to K(xi,xl) = φ(xi)Tφ(xl) = xTi xl = xi · xl
and the Lagragian dual representation to:
L(Λ) =
M∑i=i
αi −1
2
M∑i=1
M∑l=1
αiαlyiyl(xi · xl) (3.15)
Going deeper into this problem, we already have to highlight that, the optimization of a problem
using Lagragian multipliers and subject to an inequality constraint, requires that the solution fulfill
three properties known as Kanush-Kuhn-Tucker (KKT) [26]:
αn ≥ 0 (3.16)
yny(xn)− 1 ≥ 0 (3.17)
αn {yny(xn)− 1} = 0 (3.18)
where n is the total number of instances. Through the analysis of equations (3.16) and (3.17) it
is easy to see that, for other vectors rather than support vectors, either αn = 0 or yny(x) = 1 so for
25
the solution of this problem, as we have been exploiting so far, just the support vectors assume an
important role. The bias parameter b can be computed by:
b =1
NS
∑i∈S
(yi −∑l∈S
αlyl(xi · xl)) (3.19)
where NS is the total number of support vectors.
As we have already pointed out, sometimes the data is too complex to be linearly separable. The
dual representation allows the introduction of kernels which can drive other transformations to the
feature space rather than linear such as Radial Basis Function (RBF) Kernel, widely used in CAD of
AD:
K(xi,xl) = φ(xi)Tφ(xl) = exp
{−γ‖xi − xl‖2
}(3.20)
Even using those high dimensional transformations, most of the times a linear separation of the
data is really impossible without any mistakes therefore the SVM has to be modified to account for
data misclassification. To do that, non-negative slack variables, ξn ≥ 0, n = 1, . . . , N , with N equals to
the number of instances were introduced. This extension of SVM theory is also known as soft margin
hyperplane and was introduced by Cortes and Vapnik in 1995 [48].
The slack variable, ξ, penalizes the misclassification data as function of distance. For points that
fall inside the margin but in the correct side of the decision boundary the slack variables take values
between 0 and 1, 0 < ξ ≤ 1. For those points which lay in the wrong side of the decision boundary
the slack variables take values greater than 1, ξ > 1 [26].
Now, to find the optimal solution we have to maximize the margin but, at the same time, minimize
the errors thus the optimization problem presented in Equation (3.8) suffers some transformations:
Minimize: C
N∑n=1
ξn +1
2‖w‖2
Subject to: yn(w · xn + b) ≥ 1− ξn, n = 1, . . . , N
(3.21)
where C is an adjustable parameter which controls the width of the margin accordingly with the
cost of misclassification.
The Lagragian function, as presented in Equation (3.9), becomes:
L(w, b,Λ) =1
2‖w‖2 + C
N∑n=1
ξn −N∑n=1
αn {yn(w · xn + b)− 1 + ξn} −N∑n=1
µnξn (3.22)
in which αn ≥ 0 and µn ≥ 0 are Lagrange multipliers. Finally, and by applying the partial derivatives
∂L/∂w, ∂L/∂b and ∂L/∂ξn and equals each one to 0, it is possible to achieve the optimization
problem in dual Lagrangian form:
26
Minimize: L(Λ) =
N∑n=1
αn −1
2
N∑n=1
N∑m=1
αnαmynymk(xn,xm)
Subject to: 0 ≤ αn ≤ CN∑n=1
αnyn = 0
(3.23)
3.2.1.C Multiclass SVMs
Up to now, just the binary classification problem was addressed but sometimes the problems
involve more than two classes (D > 2). In this case, there are some different methods to deal
with this in which several binary classifiers are combined to build a multiclass classifier. Two main
approaches are commonly used: one-versus-the-rest approach, in which D different classifiers are
built by using the class CD as the positive examples and the remaining D − 1 classes as negative
ones and the one-versus-one approach, the methodology adopted during this study [26].
In one-versus-one approach D(D − 1)/2 different classifiers were built to cover all possible pairs
of classes. All classifiers classify the test population and the respective class is attributed according
to the number of ’votes’, which sometimes reveals to be an ambiguous technique since two or more
classes can have the same number of ’votes’ (see Figure 3.2) [26].
Figure 3.2: A schematic interpretation of a multiclass SVM algorithm.
3.2.2 AdaBoost
Adaptive Boosting, also known as AdaBoost, was developed in 1996 by Freud and Schapire and
is the most common form of boosting algorithm [26]. It is called adaptive due to the fact that it is built
by classifiers that keep in mind the errors committed by the previous ones [52].
3.2.2.A AdaBoost - Basic Concepts
AdaBoost is an ensemble method that is used to boost the classification by using a combination
of weak and inaccurate rules, i.e., it combines a collection of weak classifiers to form a strong and
more accurate one [41,53,54].
27
AdaBoost is a sequential classifier in which each weak classifier uses the weighting coefficients
associated with each example computed by the previous classifier to perform its classification, as can
be seen by the analysis of Figure 3.3 [26]. After the first round of learning the weighting coefficient of
each example is re-weighted in order to penalize those, which were misclassified and at the end, all
weak classifiers are combined in a unique weighted classifier that can give good results even if each
weak classifier only has a performance slightly better than chance [26,54].
Figure 3.3: This figure illustrates the basic concepts of AdaBoost [26].
3.2.2.B AdaBoost - Mathematical Concepts
Lets consider the same notation (see Equation (3.2)). Each weak classifier selects a single feature
and the optimal threshold which best separates the positive and the negative examples, and can be
represented by [41,54]:
y(x, f, θ) =
{−1 if f(x) < θ1 otherwise
(3.24)
in which y(x, f, θ) represents the weak classifier, f the selected feature and θ the optimal feature
threshold. The threshold has to be carefully chosen because a high threshold lowers the detection
28
rate but if the threshold is too low the number of false positives will increase [54].
The learning process of the boosting algorithm starts with the weighting vector initialization. For
the first round of boosting the weights of the examples are set to one over the total number of examples
[53].
w(1) =1
N(3.25)
After this initialization step, a weak classifier for each feature is constructed. To evaluate the
performance of each weak classifier, a weighted error function is computed and the one that presents
the minimum error is selected [54]:
εk = minf,θ
N∑i=1
wi|y(xi, f, θ)− yi| (3.26)
The set of values that minimizes the error function is {fk, θk} and thus y(x, fk, θk) can be defined
just as yk(x). Note that there are KN weak classifiers, i.e., there is one weak classifier for each
feature/threshold combination. Although the error does not suffer any changes, after a few rounds
the confidence in those predictions increases a lot what accounts for the good generalization perfor-
mance. The level of confidence is measured by a quantity called margin [53].
margin =Correct label classification
Incorrect label classification(3.27)
As can be seen by the analysis of Equation (3.27) the margin is small when the ratio between the
correct label and the incorrect label is close to 1. In those cases the confidence is low and the gen-
eralization power is poor. Conversely, if this ratio is greater, the margin is big and the generalization
performance is better [53].
After the selection of another weak classifier, the vector of weights is updated to penalize classifi-
cation errors that occur:
w(k+1) = w(k) exp(αk ∗ ei) (3.28)
in which ei ∈ {−1, 1} is 1 if the ith example is classified correctly and otherwise is equals to −1.
αk is defined as αk = 12 ∗ ln( 1−εk
εk) [53].
After the weights vector is updated and before another learning round is performed the vector has
to be normalized.
w(k+1)i =
w(k+1)i∑N
j=1 w(k+1)j
(3.29)
with i = {1, . . . , N}, where N is the total number of training examples and k = {1, . . . ,K}, in which
K is the length of the features vector.
Finally, through the ensemble of all k weak classifiers, a more accurate classifier is obtained, also
known as strong classifier [53]:
29
Yk(x) =
{1 if
∑kk=1 αkyk(x) ≥ 0
−1 otherwise(3.30)
3.2.3 AdaSVM
SVM and more recently AdaBoost constitute machine learning techniques widely used in classifi-
cation tasks. Both have been separately applied to medical image analysis but recently the potential
of applying the two methods together has also been studied. Here we apply SVM and AdaBoost in a
sequential way, a method that takes the name of AdaSVM [51].
Since the work done in 2000 by Tieu and Viola, in AdaBoost algorithm, each weak classifier
depends only in one feature and as a result, the boosting process can be seen as a greedy feature
selection procedure because, in each round, a new weak classifier is selected [54]. The dependency
between features is encoded by the examples weights and by so the features are selected in an
efficient way [54]. On top of this, as can be seen by the analysis of Figure 3.3, AdaBoost makes
its prediction based on a majority voting scheme, just by adding weak learners, in which the more
accurate classifiers have the biggest weight and thus the training error of the final hypothesis strongly
decreases to zero [26,51,53,54].
Despite all its advantages such as the speed of learning, due to this high accuracy of the weak
classifiers, sometimes it is better to have a more robust classifier instead of having a set of less
accurate ones [55]. One possible way to overcome this disadvantage is by using SVM in a sequential
way, i.e., we use AdaBoost to select the features that most accurately fit the problem, and thus the
dimensionality of the problem is reduced, and SVM as a final classifier with those features previously
selected by Boosting [51]. By doing this procedure, the quality of the classification can be improved
[56]. Figure 3.4 shows a schematic representation of AdaSVM algorithm.
30
Figure 3.4: This figure illustrates the basic concepts of AdaSVM.
3.3 Cross-Validation
In prediction theory, the best classifier is the one that minimize the error in a test set. Moreover,
most of the times, tuning some model’s parameters is also an important step to improve the accuracy
of the classifier. To account for these requirements the dataset is usually split in three different groups:
a training group, a group that is used during the construction of the model, a validation set, used to
choose the parameters that maximize the accuracy of the model, and a test set, to test the predictive
power of a certain model. This separation is essential because the first source of bias, also known
as ”double-dipping” in pattern recognition techniques occurs when one sample is involved in its own
classification and, by so, the obtained results are overestimated [32].
There are different ways to split the dataset into the correspondent groups. The first one is sepa-
rate into three mutually exclusive subsets where 2/4 of the dataset corresponds to the training popu-
lation, 1/4 for the validation group and the remaining 1/4 for the test dataset. Although it is a correctly
possible approach this method raises some problems. In datasets where the number of features is
much larger than the number of examples, which constitutes a problem known as curse of dimension-
ality that will be addressed in Section 3.5, this separation in three different groups makes the training
group inexpressive which originates problems in classification accuracy. The prediction error of the
achieved model is not representative of the problem, since it does not have enough information.
To overcome the aforementioned issue different dataset separation methods were developed and
the most widely used is the k-fold cross validation. The cross validation technique is widely used to
31
evaluate the prediction errors of the models. Here, the data is randomly portioned into K sets with
more or less equal size and each of the K sets is used for testing while the other K-1 are used to
build the model. Into each of the sets used for training the classifier, another cross validation step is
usually done to account for training and validation sets (see Figure 3.5). At the end of the process
K different models were obtained and the resulting error corresponds to an average. Although it is
computational costly, since it has to build K different models, cross-validation has some advantages
such as making full use of the data and achieving a more accurate evaluation of the prediction error
with a lower variance once it corresponds to an average over K folds [57,58].
Figure 3.5: A schematic view of the cross-validation process.
3.4 Feature Extraction
3.4.1 Voxel Intensity
The aim of this thesis is the construction of different CAD systems to forecast the conversion
from MCI to AD as outline in section 1.2. To do that, and as feature extraction method, VI from 3D
FDG-PET images was used.
The VI values, V (x, y, z), were directly taken from ADNI FDG-PET images and are closely related
with glucose uptake and, because of that reason, as explained in section 1.1.3, VI values constitute a
good candidate for detecting preclinical stages of AD. As was detailed in section 3.1.3 the FDG-PET
images from ADNI are already processed, resulting in a normalized set of images all having the same
dimension, 128x128x60. In this way, the domain of V (x, y, z), D, can be state as:
D = {(x, y, z) ∈ N : 1 ≤ x ≤ 128, 1 ≤ y ≤ 128, 1 ≤ z ≤ 60} , V (x, y, z) ∈ [0 32700] (3.31)
32
3.4.1.A 1st Mask: Voxels inside the brain volume
In order to reduce the dimensionality of the classification task, another pre-processing step was
applied where just the voxels inside the brain were considered for classification purpose and the ones
that lie outside the brain were ignored. To do that, a binary mask was built which is true for those
voxels inside the brain and false otherwise. First, an average brain is computed by using CN patients’
information and then, this volume was threshold at 5% of the maximum value to account for some
variations across the population. The resulting image of the application of this mask is shown in
Figure 3.6.
Figure 3.6: On the left, a slice of the average brain of all CN patients. On the right the same illustrates the imageafter the binary mask application.
3.4.1.B 2nd Mask: Voxels inside specific brain regions
In this thesis the application of other masks rather than the previous one was also exploited.
Those masks correspond to different ROIs and our aim was to find the set of regions that have the
most discriminative power during the classification task, i.e., the regions that are more involved in
conversion process. These masks were built based on the brain regions previously delimited by Dr.
Durval Campos Costa, from Champalimaud Foundation (Figure 3.7).
For the purpose of this study, the selected regions are: the Left Lateral Temporal lobe, the Left
Dorsolateral Parietal, the Right Dorsolateral Parietal, the Superior Anterior Cingulate and the Posterior
Cingulate and Precuneus. Those regions showed a higher discriminative power and some of them are
in line with other studies like [17], such as the Left Lateral Temporal Lobe and the Posterior Cingulate
and Precuneus. The other three regions were included following some Dr. Durval Campos Costa
indications and the set achieved better results.
Although both masks can be seen as a feature selection step rather than a feature extraction
procedure it is more consensual to describe these masks here, as a pre-processing step because
they only depend on anatomic position and thus they not depend on features’ information.
3.5 Feature Selection
Although the feature extraction step account for computational time reduction it does not gives any
statistical information about dataset and the features vectors obtained after the pre-processing step
33
(a) (b)
(c)
Figure 3.7: In 3.7(a)the Left Lateral Temporal lobe appears in dark red, the Right Lateral Temporal in red andthe Inferior Anterior Cingulate in dark blue.In 3.7(b) the Left Mesial Temporal lobe appears in dark blue, the RightMesial Temporal in blue and the Inferior Frontal Gyrus/Orbitofrontal in dark red. In 3.7(c) the Left DorsolateralParietal lobe appears in dark red, the Right Dorsolateral Parietal in blue, the Superior Anterior Cingulate in redand the Posterior Cingulate and Precuneus in dark blue.
can easily have a huge dimension. A features vector with a high dimensionality and a comparatively
small sample size constitutes a common problem in pattern recognition and this creates some serious
challenges to the classifier’s performances since the risk of overfitting is higher, i.e., the classifier can
have a good performance in the sample size but lacks in generalization power and perform poorly in
unseen data.
During the feature selection procedure the features vectors of each instances are reduced to a new
space of variables, which has a reduced variability. Typically the problem becomes easier to solve
than the first one due to the fact that the selected features are the ones that are more discriminative
for the problem in hands [26].
There are some benefits of using feature selection procedures [59]:
• Facilitating data visualization: with a reduced number of features the identification of patterns in
data is easier;
34
• Reducing stores requirements: with fewer features, the memory occupied by the information is
also less;
• Reducing training and utilization times: the time needed to train and test the model is lesser
because the dimensionality of the problem is smaller;
• Improving prediction performance: the relevant features are selected and the noisy ones are
discarded and thus the classifier performance can be improved.
The feature selection procedure can be seen as a search problem where the number of features
is reduced to a subset of relevant ones. To help in this task four main strategies are used. The
first broadly class of methods are known as filter methods and here the features are selected just
based on training set characteristics and the performance of the classifier is not important for the final
decision. The major advantage of this method is to preform faster than the others. In the second class
of methods, also known as wrapper methods, the adopted strategy is different. Here the features are
selected based on the maximal ACC of the model in the training data. Although this strategy achieves
better results it is more computational costly. In the third method, embedded method, the features are
selected during the training process and this approach is specific to the applied algorithm. Finally, the
last method is also known by hibrid method and combines the first two approaches herein mentioned,
i.e., first a subset of candidate features are faster selected by a filtered approach and then, this subset
is reduced based on an accurate wrapper strategy [59–61].
During the next three sections the feature selection procedures exploited during this thesis are
resumed but first it is important to clarify some notation. During the following section, let consider the
χ dataset in which:
χ ={
(x(1), y(1)), . . . , (x(N), y(N))}
(3.32)
where x(i) = (x(i)1 , . . . , x
(i)K ) in which k represents a K-dimensional features vector x and i ∈ RN
is the ith example of the dataset.
3.5.1 Pearson Correlation Coefficient
In statistical analysis the dependence between two variables can be measured by correlation. The
PCC, represented by r, computes the linear association between two variables [62], and in this case
between each feature Xi and the class label Y , and can be represented by:
r =cov(xk, y)√
var(xk)√var(y)
(3.33)
where cov stands for covariance and var for variance. An estimate for r can be given by [63,64]:
r =
∑Ni=1(x
(i)k − xk)(y(i) − y)√∑N
i=1(x(i)k − xk)2
√∑Ni=1(y(i) − y)2
(3.34)
where xk is the mean value of the kth feature and y the mean value of the vector label y. The
r coefficient does not have any units and can have values from -1 to 1, −1 ≤ r ≤ +1. A value
35
of r close to ±1 indicates a strong linear relationship between the two variables regardless of the
direction but, while a positive value indicates a direct relation, i.e., an increase in the first variable
means an increase in the second one, a negative value represents an inverse relation and thus, when
one variable increases its values the other decreases [62]. Here, the best features for the model are
chosen based on a absolute value ranking of the r coefficient.
To conclude this topic, for the special case when the correlation is computed between a continuous
variable and a dichotomous one, as the case at hand, the correlation coefficient can be simplified to:
rpb =M1 −M−1
sxk
√N1N−1N2
(3.35)
where rpb is the point-biserial correlation coefficient, a mathematical equivalent of the Pearson
correlation coefficient [65], M1 is the mean value of all x(i) examples with label y = 1, M−1 is the
mean value of x(i) examples with label y = −1, N is the total number of examples in dataset and sN
is the standard deviation of xk given by:
sxk=
√√√√ 1
N
N∑i=1
(x(i)k − xk)2 (3.36)
Here we consider a dichotomous variable y = {−1, 1} but the same holds for other values rather
than −1 and 1.
3.5.2 Boosting
As explained in Sections 3.2.2 and 3.2.3, each weak classifier depends only in one feature and
thus the process of choose a weak classifier is, in fact, a feature selection procedure. The features
are chosen based on an error minimization (see Equation (3.26)), i.e., those that most accurately
span the classification problem [51]. Boosting can be seen as an embedded method because the
features are selected during the training process and although it is more computational costly, by this
way have the prior knowledge of most suited features for classification is not a requirement [51].
3.5.3 Mutual Information
There are several different approaches to measure the linear relationship between two variables,
such as the Pearson Correlation Coefficient, already mentioned in Section 3.5.1 or the Euclidean
distance but, sometimes, the relation between two variables can not be expressed in a linear way.
MI is a filter method based on information theory and represents different types of dependencies
between two variables and not just the linear ones [66]. The relation between two random variables x
and y can be measured by considering the divergence between the joint distribution and the product
of the marginal, and is given by [26]:
I(x,y) =
∫∫p(x,y)log(
p(x,y)
p(x)p(y)dxdy (3.37)
or in the discrete way, where the integrals are substituted by summations and the continuous
distributions estimated by frequency counts [67]:
36
I(x,y) =∑x∈K
∑y∈Y
P (x,y)logP (x,y)
P (x)P (y)(3.38)
where K and Y are all possible values for variables x and y respectively. The aim of using Mu-
tual Information as a feature selection procedure, is to select those that maximize the joint mutual
information [68]:
I(X1:K , Y ) =∑T⊆S
I({T ∪ Y }) (3.39)
where∑T⊆S is the sum over all possible subsets T drawn from S and S the set of all possible
input features, S = {X1, . . . , XK}. If we assume that there is no other high order relations rather than
conditional and unconditional pairwise relations, the summation is truncated such that |T | ≤ 2, which
gives [68]:
I(X1:K , Y ) ≈K∑i=1
I(Xi, Y ) +
K∑j=1
K∑l=j+1
I({XjXl, Y }) (3.40)
where K is the total number of features in the dataset. The feature selection procedure is done by
an iterative algorithm in which the utility of choosing the feature XK when K − 1 have already been
selected is quantified by I(XK ;Y |X1:K−1) = I(X1:K;Y ) − I(X1:K−1;Y ). By using the approximation
presented in Equation (3.40) and the definition of interaction information, this utility can be expressed
by:
Jfou = I(XK ;Y )−K−1∑i=1
[I(XK ;Xi)− I(XK ;Xi|Y )] (3.41)
where Jfou is the first-order utility (first-order pairwise interaction) (FOU) of including feature
XK [68]. The FOU is constituted by three different components: the own mutual information, a
penalization for high correlation between features and a positive contribution that account for the de-
pendency on the class conditional probabilities which indicates that the best feature is the one that
has the best trade-off between these three components [68]. The FOU can be written in a parame-
terized way where not just the trade-off is more understandable but also different criterion for feature
selection based on mutual information can be subsumed, since different methods depend essentially
on the weights given to each parameter [68]:
Jfou = I(XK ;Y )− βK−1∑i=1
I(XK ;Xi) + γ
K−1∑i=1
I(XK ;Xi|Y ) (3.42)
In the present work, we set β and γ to zero and consider just the component of the MI between
the two variables. For feature selection purposes a MI ranking, which can take values between 0 and
1, is computed and the features that have the highest MI value are selected. Once again, note that
the redundancy between features is not taken into account.
37
38
4Experimental Results
Contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Model’s adjustable parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 SVMs Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4 AdaBoost Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5 AdaSVM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6 Summary of all the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
39
4.1 Introduction
One of the goals of this work is to perform a comparison between the three different machine
learning techniques already mentioned, i.e., SVM, AdaSVM and AdaBoost. Within each method,
different approaches were also studied in order to evaluate the importance of different types of feature
extraction and feature selection techniques as well as to find the best way for tuning the model’s
parameters. During the next sections all those approaches will be better explained and all results will
be presented.
During this chapter a comparison between all the studies will also be performed in order to infer
what is the best classifier, the best feature extraction procedure, as well as the best feature selection
method for the detection of MCI to AD conversion.
Before starting, the point of how to evaluate the performance of a model has to be clarified. There
are a lot of different ways to measure the prediction power of a model such as the ACC of the model,
the BA, the SENS or the AUC. Although the most frequently used measure is the ACC, which mea-
sures the effectiveness of predicting the right class label [39], sometimes this is not the best way to
evaluate a classifier. Here, the AUC constitutes the most accurate way to do it since the dataset in
this thesis is imbalanced and the number of negative examples (MCI-NC) is almost twice the number
of positive ones (MCI-C) [27]. Note that AUC measures the probability of assigning a higher value to
a positive sample, when one positive and one negative sample is drawn at random [39].
4.2 Model’s adjustable parameters
When the data is imbalanced and since the tendency of the classifier is to classify everything
according to the most significant class, assigning different weights to different classes assumes a role
of greater importance as it can give a higher penalization to the less representative class and, by so,
change the decision boundary and makes the model more sensitive (see Figure 4.1). In these cases,
the C parameter, which as notice in section 3.2.1.B controls the width of the margin according to the
cost of misclassification in the SVM algorithm, is also very important and should be tuned properly.
Three different approaches were tested during SVM and AdaSVM studies and their characteristics
will be clarified next.
1. Without Weights (WOW): the same penalty is given for a mistake committed for both classes.
The C parameter is chosen using a grid search, from a previously determined set of values,
based on the model ACC. For this purpose a 10-fold CV is performed for each possible value
of C, and the selected value is the one that maximizes the mean ACC score in the 10 validation
sets;
2. With Weights (WW): the penalty for a mistake committed by the majority class is lower than a
mistake committed by the less representative one. The weight of the majority class is equaled to
1 and the weight of the less representative one is found by dividing the number of subjects which
belong to the majority class by the number of individuals present in the minority class. The C
40
parameter is chosen exactly as in the previous test but now taking into account differences in
penalizations;
3. With Weights and based on Balanced Accuracy (WWBA): like in the previous approach, different
classes have different mistake penalizations but now the C parameter is tuned based on a
different criterion i.e., the BA (the mean value between the true positive rate, also known as
SENS and the true negative rate, which is normally called SPEC).
Figure 4.1: Lets consider a synthetic dataset where the negative class (blue crosses) that has 10 times moresamples than the positive one (red crosses). If the weights are not used, the decision boundary (blue line) issuch that on one hand the negative test examples (black circles) are almost correctly classified and, on the otherhand, there are more positive examples (pink circles) misclassified. When the weights are taken into account, thedecision boundary suffers a shift (green line) and the sensitivity of the model, the number of positive examplescorrectly classified, increases but, at the same time, the specificity decreases.
41
4.3 SVMs Results
There are many possibilities to build a SVM model. Here three different ways to perform a linear
SVM classification were exploited:
1. The classifier is trained with patterns from the CN and AD population and tested on the MCI set.
In this case the CN subjects are used as the training patterns for the MCI-NC class whereas the
AD subjects are considered the training patterns for the MCI-C class;
2. The classifier is trained and tested using only patterns from the MCI population;
3. The classifier is trained with all subjects in the dataset and tested on the MCI population. As
in 1, the CN subjects are considered as MCI-NC like and the AD subjects are considered as
MCI-C like.
In each case, the influence of different feature extraction and feature selection techniques were
also studied.
4.3.1 Training with CN and AD subjects
In this first approach, the linear SVM model is built based on information from CN and AD patients
and then applied to the MCI population. The number of features varies between 50 and 10.000 and
the selected features, from a pool of more than 300.000, are those that have the highest PCC. For
classification purposes, the three tests previously mentioned in section 4.2 were performed.
The same study was repeated but now, instead of using all the voxels inside the brain volume, only
the information from voxels that are inside the five regions described in section 3.4.1.B were used.
Once again the number of selected features varies between 50 and 10.000, but now from around
28.000 possibilities, and are those that have the highest PCC value.
The third study is identical to the first one but now MI was used as the feature selection criterion
instead of PCC. The aim of this study was to understand the influence that the different feature
selection method has in the final classification performance, i.e., if the model benefits if another feature
selection criterion is used, different from the one used previously, which just considered the linear
relation between variables and class label.
Finally, the second study was repeated but now MI was used as feature selection criterion instead
of PCC as already happened before.
Figure 4.2 shows the best ROC curves of all these approaches, where the red trace corresponds
to the first classifier (WOW test), the blue trace to the second (WWBA test), the green to the third
(WOW test) and the black trace to the fourth one (WOW test).
For all the approaches the best results, regardless of the number of features, are summarized in
Table 4.1.
42
Table 4.1: The highest results for the SVM classifier when CN and AD are used for training in terms of Accu-racy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy (BA) and Area Under the Curve (AUC).
SVM Tests ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)
PCCWOW 71,3 61,5 80,2 68,8 77,0WW 70,6 67,3 75,8 69,9 76,5
WWBA 71,3 67,3 75,8 70,0 76,9ROIs WOW 74,1 65,4 84,6 70,2 80,8
& WW 74,1 63,5 84,6 71,8 80,6PCC WWBA 73,4 71,2 81,4 72,9 81,3
MIWOW 73,4 76,9 71,4 74,2 78,0WW 70,6 69,2 78,0 69,8 77,3
WWBA 70,6 69,2 75,8 69,5 77,3ROIs WOW 75,5 71,2 84,6 71,1 81,9
& WW 74,1 76,9 79,1 71,3 81,4MI WWBA 74,1 75,0 82,4 74,3 81,4
Figure 4.2: A comparison between the best ROC curves achieved when CN and AD subjects are used fortraining and by using the information from whole brain (2500 features), from the ROIs (10.000 features), fromwhole brain but using MI as a feature selection criterion (1000 features) and by applying the same criterion to theROIs (10.000 features).
Although SVMs are robust to imbalanced data due to the fact that they only take into account
the support vectors, i.e., the points in the feature space that are close to the decision boundary [27],
usually the introduction of different penalties, according to the number of the examples in each class,
is important for the success of the classification task.
43
In the first study, as can be seen by the analysis of Table 4.1 although the overall performance
remains unchanged, the use of weights introduces a slight difference in the obtained results since
when weights are used the SENS of the model increases about 6% and the SPEC decreases almost
in the same proportion, as can be verified with the analysis of Figure 4.3. These results are in line with
our expectation since the MCI-C class is less representative and so, when the weights are applied the
decision boundary changes (see Figure 4.1) and more positive examples are correctly classified but,
on the other hand, more negative ones are mislabelled. Here the differences between the second
and the third tests are not significant.
(a) Sensitivity (b) Specificity
Figure 4.3: Sensitivity and Specificity variation as a function of the number of features for a SVM classifier whenCN and AD are used as a training set and when all voxels inside the entire brain volume are considered. Herethe features are selected according to the PCC ranking criterion.
The second study is similar to the first one, but a different feature extraction transformation mask
is used. The SENS and the SPEC reached in this study are almost 4% better than the ones verified
in the previous one and the ACC as well as the BA were improved in about 3%. The AUC achieved is
also 4% better which shows that the set of regions considered are quite discriminative to the problem
of predicting MCI to AD conversion. This conclusion can also be drawn by the analysis of Figure 4.2
where the ROC curve of the second classifier in which only the voxels inside the ROIs are used (blue
trace) is always superior than the ROC curve that belongs to the first one in which all voxels inside
the brain volume are taken into account (red trace).
The transition from the MCI stage to AD is not linear, so the use of other methods of feature
selection rather than PCC can be beneficial for the model. With the aim of studying the influence of
different feature selection procedures in the predictive power of the classifier, the same two studies
were repeated but now by using MI ranking as a criterion to select the best features. When only the
voxels inside the ROIs were considered, the AUC of the classifier is almost 3% better, which once
again, puts in evidence the high predictive power of those regions (see Figure 4.2). The ACC, the BA
as well as the SPEC are also higher than when the voxels within the entire brain volume were used.
The obtained results are, in general, better than the ones achieved with PCC criterion (see Figure
44
4.2) especially in terms of SENS which was improved in a range between 2% and 15 %. To better
understand this discrepancy, the classifiers outputs, for the tests which registered the best SENS
results, were compared and are presented in Figures 4.4 and 4.5.
0 50 100 150 200 250 300−4
−2
0
2
4
6With PCC as Feature Selection Procedure
Subjects
De
cis
ion
Va
lue
s
CN
MCI−NC
MCI−C
AD
0 50 100 150 200 250 300−4
−2
0
2
4With MI as Feature Selection Procedure
Subjects
De
cis
ion
Va
lue
s
CN
MCI−NC
MCI−C
AD
Figure 4.4: A comparison between the output of SVM classifiers when CN and AD are used for training.
45
0 50 100 150 200 250 300−4
−2
0
2
4With PCC as Feature Selection Procedure (ROIs)
Subjects
De
cis
ion
Va
lue
s
CN
MCI−NC
MCI−C
AD
0 50 100 150 200 250 300−4
−2
0
2
4
6With MI as Feature Selection Procedure (ROIs)
Subjects
De
cis
ion
Va
lue
s
CN
MCI−NC
MCI−C
AD
Figure 4.5: A comparison between the output of SVM classifiers when CN and AD are used for training and byusing only the voxels that fell inside the ROIs.
First of all, as can be seen by the analysis of Figures, the MCI-NC pattern is closer to the CN
pattern and the MCI-C to the AD one. These results are in line with our expectation. In terms of the
classification performance, the number of misclassifications when ROIs are used is always smaller
than when all voxels inside the brain volume are taken into account. Moreover it is also possible to
note that when MI is used as feature selection procedure, the number of mislabelled samples within
the training classes is also smaller.
Is still important to refer that, during the third study, when all Voxels in the Entire Brain volume
(VEB) are considered and when the features are selected according to the MI ranking criterion, the
SENS of the classifier in the WOW test is higher, which is not in line with our expectation (see Figure
4.1). This happens due to the fact that the number of false positives in WOW test is very high so the
introduction of weights has the opposite effect, that is, the SPEC increases and the SENS decreases.
Finally, to end up with this discussion about SVMs classifiers with CN ad AD as training classes
and in an attempt of better understanding what is happening, the selected features were analysed,
see Figure 4.6 and Figure 4.7.
46
Figure 4.6: Features selected by PCC to build the classifier when CN and AD subjects are used for training.
Figure 4.7: Features selected by MI to build the classifier when CN and AD subjects are used for training.
The features that are selected by PCC have influence on each other since the correlation coef-
ficient value of one voxel it is dependent of the values registered by its direct neighbours and by so
the selected features pattern shows less variation, as can be seen in Figure 4.6. This dependency is
more understandable with the analysis of Figure 4.8 in whitch a PCC filtered approach operation is
simulated. On the other hand, as it shows in Figure 4.7, the MI method applies a different kind of filter
and thus the pattern of selected features is sparser and the model can predict AD conversion more
47
accurately.
Figure 4.8: A mathematical interpretation of correlation. With the analysis of the bottom figure it is possible tounderstand that the pixel of interest (the one in the middle) is evaluated as -1 x 83 -1 x 94 + 1 x 48 + 1 x 95 =-34 [69].
4.3.2 Training and testing with MCI data
In this study, the model was built and tested based only on MCI subjects’ information. To avoid
overfitting the separation between MCI training set and MCI testing set was done by 10-fold CV and
repeated 10 times. As in all the studies mentioned so far, the number of selected features varies
between 50 and 10.000 and the ones that are selected are those that have the highest PCC value.
Once again, three different tests were performed.
The second study is equals to the aforementioned one but now other feature extraction mask
is used, that is, the one that only takes into account the information from the ROIs as previously
mentioned in section 3.4.1.B.
The third study is similar to the first one, i.e., the classifier was also built based on information from
the entire brain region but MI ranking was used as feature selection criterion.
Finally, the same procedures that were taken into account during the second study were consid-
ered but now the features were selected, as happened before, according to the MI criterion.
48
For all the approaches referred, the highest results are presented in Table 4.2 regardless of the
number of selected features. The results correspond to the mean values of the 10-folds CV procedure.
Figure 4.9 shows the best ROC curves of all these approaches, where the red trace corresponds
to the first classifier (WOW test), the blue trace to the second (WW test), the green to the third (WW
test) and the black trace to the fourth one (WOW test).
Table 4.2: The highest results in terms of Mean±Standard Deviation for the SVM classifier when MCI subjectsare used for training. Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy (BA), AreaUnder the Curve (AUC).
SVM Tests ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)
PCCWOW 67,2±2,4 42,7±4,0 84,6±2,3 60,9±2,9 64,6±2,1WW 64,6±2,9 49,0±5,1 74,2±3,8 61,2±3,2 64,4±2,7
WWBA 64,1±2,5 52,5±3,6 70,7±3,7 61,6±2,3 64,1±2,5ROIs WOW 70,1±1,4 37,7±3,2 90,9±2,7 62,3±1,2 68,3±1,4
& WW 71,0±0,9 47,7±4,1 86,3±1,4 65,2±1,2 68,3±1,4PCC WWBA 68,7±2,2 52,9±4,2 80,1±3,9 64,1±1,8 67,8±1,1
MIWOW 65,7±2,5 36,9±5,9 83,2±4,4 59,3±2,1 62,9±2,6WW 63,4±2,5 45,6±5,1 74,7±4,0 59,1±3,3 63,1±2,3
WWBA 63,5±2,3 52,1±2,8 70,8±2,5 61,1±2,1 62,4±2,1ROIs WOW 70,6±1,1 38,1±3,2 92,6±1,2 62,3±1,1 68,3±1,2
& WW 70,0±1,0 46,9±3,8 85,8±1,2 63,9±1,6 67,8±1,0MI WWBA 68,0±1,4 54,2±3,6 77,7±2,6 64,3±1,4 67,8±1,4
49
Figure 4.9: A comparison between the best ROC curves achieved when MCI subjects are used for training andtesting and by using the information from whole brain (250 features), from the ROIs (10.000 features), from wholebrain but using MI as a feature selection criterion (500 features) and by applying the same criterion to the ROIs(10.000 features).
When the classifier is trained and tested with MCI subjects the levels of performance are lower
because MCI information is noisier due to the variety of different metabolic patterns that can be
found across the MCI population. As already mentioned, the patient does not have a perfect linear
transition so sometimes it is difficult to know in which stage the patient is. Once again, when only
the information from the ROIs is used the results are slightly better. Although the SENS does not
show any improvement and sometimes, as in the case of WOW and WW tests, achieved even worse
results, the SPEC of the model increases in a range between 6% and 12% and the AUC in about
4%. The superiority of the last classifier can be verified with the analysis of Figure 4.9 in which the
corresponding ROC curve (blue trace) is almost always greater than the ROC curve that belongs to
the first study in which all voxels inside the brain volume are used for classification purposes (red
trace).
In the third and fourth studies, in which MI is used as feature selection procedure, the results are
very close to the ones achieved in the first and second ones respectively (see Figure 4.9) and thus it
is not possible to say which is the best feature selection method to solve this problem.
In all the tests performed in each study the obtained results are in accordance with our expectation,
i.e., when weights are introduced, the SENS of the model increases and, at the same time, the SPEC
50
decreases. The study in which these differences are more noticeable is the last one where the WOW
and WWBA tests differ in more than 15% (see Figure 4.10), however these huge differences are
presented in all studies due to the fact that the number of MCI-NC are more than twice the number of
MCI-C (see Figure 4.1 for more explanations).
(a) Sensitivity (b) Specificity
Figure 4.10: Sensitivity and Specificity variation as function of the number of features for a SVM classifier whenMCI subjects are used as a training set and when only the voxels inside the ROIs are considered. Here thefeatures are selected accordingly to the MI ranking criterion.
4.3.3 Training with all classes and testing with MCI population
During this study, the model is learnt in a different way because instead of using a binary classifier,
a multiclass approach is performed, as explained in Section 3.2.1.C. All classes, CN, MCI-NC, MCI-C
and AD, were used for training but just the MCI individuals were considered for testing. Once again
three different tests were performed. Here different penalties were applyed according to the number
of individuals in each class and thus the majority class has the lowest mistake penalization, that is,
1, and the class which has less individuals has the highest one, which is found by dividing the total
number of examples in the majority class by the total number of examples in the less representative
one. To avoid overfitting the separation between MCI training set and MCI testing set was done by
10-fold CV and repeated 10 times. The number of selected features varies between 50 and 10.000
and they correspond to those that have the highest PCC.
The other three studies follow the same principles that already were used before: the second
study is a repetition of the first one but using only the features that fell inside the ROIs; the third study
also follows the same principles than the first one but now by using MI as feature selection criterion;
and the fourth study is a repetition of the second one but with MI as feature selection method.
For all the approaches referred in this section, the highest results are presented in Table 4.3
regardless of the number of selected features. The results correspond to the mean values of the
10-folds CV procedure.
Figure 4.11 shows the best ROC curves of all these approaches, where the red trace corresponds
51
to the first classifier (WWBA test), the blue trace to the second (WWBA test), the green to the third
(WOW test) and the black trace to the fourth one (WWBA test).
Table 4.3: The highest results in terms of Mean±Standard Deviation for the SVM classifier when all subjects areused for training. Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy (BA), Area Underthe Curve (AUC).
SVM Tests ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)
PCCWOW 71,0±0,9 41,0±2,0 89,3±0,7 64,9±1,4 72,6±0,8WW 70,6±0,8 61,7±3,7 84,9±1,8 66,1±1,6 72,8±0,7
WWBA 69,5±1,4 63,8±5,8 82,4±2,9 66,3±2,7 73,9±1,2ROIs WOW 72,7±0,8 45,2±3,4 90,2±1,2 66,1±0,7 75,6±1,7
& WW 69,4±1,1 57,5±2,5 78,7±2,7 66,3±1,1 76,5±1,2PCC WWBA 70,1±1,0 59,8±2,5 78,7±1,7 67,2±0,9 77,6±2,0
MIWOW 71,0±1,5 41,9±3,1 89,6±0,9 64,8±1,7 74,0±1,2WW 70,7±1,2 59,6±3,1 80,0±1,9 67,2±1,1 73,5±1,1
WWBA 69,9±1,5 60,7±3,7 79,8±2,5 66,5±1,5 73,5±1,5ROIs WOW 72,5±1,7 47,9±4,5 91,2±0,9 67,1±1,9 74,9±0,8
& WW 69,9±1,8 56,7±4,6 80,3±2,2 66,2±2,7 76,0±1,8MI WWBA 70,8±1,4 58,1±3,5 81,1±1,9 67,0±1,3 76,8±1,1
Figure 4.11: A comparison between the best ROC curves achieved when all subjects are used for training andtesting and by using the information from whole brain (5000 features), from the ROIs (10.000 features), fromwhole brain but using MI as a feature selection criterion (2500 features) and by applying the same criterion to theROIs (5000 features).
When all classes are considered for training purposes, we can say that, for the same number
52
of features (10.000 features) the achieved results have an intermediate behaviour. As can be seen
in Figure 4.12 the ROC, which belongs to this study, is lower than the one which corresponds to the
classifier built with CN and AD patients’ information (red trace) but higher than when only MCI patients
are considered for training (black trace), since the SENS of the model improves, in some cases, 11%
and the AUC increases in about 9%.
Figure 4.12: A comparison between the three different ways of training the SVM classifier herein exploited whenROIs are used.
Once again it is important to highlight that the SENS of the model increases a lot with the intro-
duction of weights. This improvement sometimes achieves the remarkable value of 23%, as happens
in the first study, and it is true for all the studies. The classifier built based on information from ROIs
has a performance improvement of almost 4%.
As happened in the case when only the MCI information is considered, there are no significant
differences between the two different feature selection procedures, which is also proven by the ROC
curves presented in Figure 4.11 in which the red and green traces and the blue and black traces
almost overlap.
To conclude the SVMs analysis, it is consensual to say that the best way to build one classifier is by
using CN and AD subjects as a training set and test the model with MCI patients, since this classifier
achieves better performances in terms of SENS and AUC. Here, the best feature selection procedure
is the MI, since the obtained results are slightly better. Moreover in the other two ways of building a
53
classifier, i.e., when only MCI or when all classes are taken into account during the training phase, the
method in which the features are selected does not make much difference, probably due to the noisy
nature of MCI data. On the other hand, the feature extraction transformation mask applied during
the pre-processing phase assumes an important role once when just the voxels inside the ROIs are
used the achieved results are at least 3% better, which proves the high discriminative power of those
regions to distinguish between MCI converters and non-converters and also puts in evidence that the
feature selection procedures can not be perfect. Although the number of features needed to obtain
these results are more than double when compared with the pre-processing step explained in section
3.4.1.A (10.000 features) the computational time decreases since the dimensionality of the problem
is much smaller (almost 12 times). Still in SVMs classifiers although for the majority of the studies the
best AUC results had been achieved through the WOW test the best way to tune the C parameter is
by applying the WWBA method because in almost every study the SENS increases which means that
the model becomes more prone to detect a new positive case, which is also a very important aspect
when we have to decide which one is the best classifier for the problem at hands. The differences
in the AUC performance between these two tests are not much significant and in several cases the
AUC performance in WWBA test register a higher value. This analysis allow us to conclude that the
introduction of weights does not have much influence in the overall performance but influences a lot
the capacity of the classifier in detect a new positive case.
Up to now, only SVMs were used for classification purposes. As already was mentioned, two
different classification approaches were also exploited and the methods as well as the results are
explained in the next two sections. At the end, in Table 4.8 a summary of all the results obtained
during this thesis is presented.
4.4 AdaBoost Results
Two different ways to build the AdaBoost classifier were investigated:
1. The classifier is trained with CN and AD population and tested on the MCI set;
2. The classifier is trained and tested based only on MCI information.
These options correspond to the first two considered when building a SVM classifier. The third op-
tion, i.e., when all classes are used for training is not taken into account since the toolbox of AdaBoost
algorithm used in these experiments does not deal with multiclass classification problems.
Before starting with the results presentation it is important to refer that this method was only tested
with the ROIs and not with whole brain voxels approach. In the Boosting method, the features are
selected in a different way. The AdaBoost algorithm automatically selects the most important features
for classification and therefore does not need a prior feature selection step, so the performance level
can be high even in high dimensional problems but to do that AdaBoost algorithm builds KN classi-
fiers, in which K is the number of features and N is the total number of training examples. To reduce
54
the computational time cost, and with the goal of being able to directly compare these studies with
the ones achieved with the SVM classifier, just the smallest problems were considered.
4.4.1 Training with CN and AD subjects and using only the voxels within theROIs
The model was built by using the information from the ROIs of CN and AD population previously
mentioned and tested in MCI set. The number of features used during classification process varies
between 50 and 1.000 and are chosen, from a total number of features of around 28.000 possibilities,
by boosting. The results are shown in Table 4.4.
Table 4.4: The highest results for the AdaBoost classifier when CN and AD subjects are used for training andby using only the voxels within the ROIs regions(4.4.1). Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC),Balanced Accuracy (BA), Area Under the Curve (AUC).
Study ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)AdaBoost 79,0 76,9 80,2 78,6 83,4
If we compare this method with the second and fourth studies explained in section 4.3.1, that
account for classification accomplished by the SVM algorithm, an improvement of 2,1% and 1,5%
respectively in AUC is noted, see Figure 4.13 where a comparison between the best ROC curve in
each one of the studies previously mentioned is presented and in which the ROC curve corresponding
to this study (red trace) is almost always superior than the others and with a far less number of features
(20 times lesser). The SENS is also almost 7% higher than the best obtained during the second
SVM study, which constitutes a good point in favour of AdaBoost because misclassifying a patient
to be a healthy subject may bring severe consequences and thus the SENS should be as high as
possible [39]. When compared with the fourth SVM study, in which MI was used as feature selection
criterion, the SENS does not present any difference but this value is reached with only 50 features
while SVM needs a number of features 50 times higher to obtain the same level of performance.
55
Figure 4.13: A comparison between the best ROC curves achieved for different feature selection criterion whenCN and AD subjects are used for training and when only the voxels inside the ROIs were considered.
To conclude, and as discussed in section 3.2.2 the features that are chosen are those that minimize
the training error (see Equation 3.26). Each one of the weak classifiers selects, in an independent
way, its feature and thus the selected pattern is highly dispersed (Figure 4.14).
56
Figure 4.14: Features selected by Boosting when CN and AD subjects are used for training the model.
4.4.2 Training and testing with MCI data and using only the voxels within theROIs
In this study, the features that belong to the ROIs were used to train and test an AdaBoost classifier
with MCI subjects. As in the case of the SVMs studies, the separation between MCI training set and
MCI testing set was done by 10-folds CV and repeated 10 times. The results in terms of ACC, SENS,
SPEC, BA and AUC are presented in Table 4.5. The weight of each subject during each iteration was
also analysed and can be seen in Figure 4.15.
Table 4.5: The highest results in terms of Mean±Standard Deviation for the AdaBoost classifier when MCIsubjects are used for training and by using only the voxels within the ROIs (4.4.2). Accuracy(ACC), Sensitivity(SENS), Specificity (SPEC), Balanced Accuracy (BA), Area Under the Curve (AUC).
Study ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)AdaBoost 69,0±1,8 51,7±1,8 80,5±3,2 66,0±1,8 69,5±1,6
57
Figure 4.15: Subject’s weights in each iteration.
This study shows that by using the MCI subjects as a training set the performance of the classifier,
even in AdaBoost, is decreased. Nevertheless the obtained results are slightly better (1% or 2%)
than when SVM is used and with less features which is an important aspect since it reduces the
computational time.
As can be seen by the analysis of Figure 4.15 some examples of the training set increase their
weight in each iteration step, which leads to believe that they are mislabelled. However, because it
is not a dominant behaviour, the low performance of the classifier is probably due to the nature of
data, that is, the difficulty in distinguishing between two patterns as similar as is the case of MCI
converters and non-converters. This constitutes an important aspect that should be highlighted since
the AdaBoost algorithm puts more emphasis in the misclassified data and thus, if the examples are
mislabelled the importance given to them does not reflects the truth and the model can not achieve a
good generalization power.
4.5 AdaSVM Results
Finally, the AdaBoost and SVM algorithms are joined together in order to form a new classifier
called AdaSVM. To perform this kind of classification, the features have to be selected by AdaBoost
and then they are fed to the SVM classifier. Just the two studies previously mentioned in Section 4.4
58
were repeated to account for AdaSVM classification. Once again, only the voxels which lie inside the
ROIs were considered due to computational time limitations.
4.5.1 Training with CN and AD subjects and using only the voxels within theROIs
The set of features previously selected by boosting in section 4.4.1 are analysed in order to elimi-
nate those that are repeated and then are used to train a SVM classifier with CN and AD population
as a training set and tested with MCI subjects. The same three tests, as explained in section 4.2,
were performed and the highest results are presented in Table 4.6.
Table 4.6: The highest results for the AdaSVM classifier and when CN and AD are used for training and byusing only the voxels within the ROIs (4.5.1). Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), BalancedAccuracy (BA), Area Under the Curve (AUC).
AdaSVM ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)WOW 75,5 78,8 75,8 75,4 79,6WW 74,1 80,8 75,8 75,0 79,2WWBA 75,5 76,9 76,9 75,8 80,2
As can be seen by the analysis of Table 4.6 the results in terms of SENS, are better in at least 4%
than the ones presented in the second and in the fourth SVM’s studies when CN and AD are used for
training purposes. In terms of ACC, BA and SPEC, the obtained results in all the studies aforemen-
tioned are really close. The AUC is also very similar but the same performance is achieved with a
much smaller number of features which is a good point in favour of boosting as feature selection pro-
cedure, when compared with the PCC or with the MI. Nevertheless, when compared with AdaBoost
this method obtained worse results except for SENS which is improved by 4%. As we expect, the
introduction of different penalization improves the SENS of the model.
A comparison between the ROC curves achieved in the four methods is presented in Figure 4.16.
Here we can verify that, for the same number of features (500 features) the SVM classifier with PCC as
feature selection method (blue trace) has the worst performance. This method is followed by the SVM
with MI as feature selection criterion (black trace), AdaSVM (green trace) and, finally, by AdaBoost
(red trace) which shows the best result. With this direct comparison it is also possible to conclude that
not only is AdaBoost the best classifier to differentiate between MCI-NC and MCI-C when CN and AD
are used as a training population but Boosting is also the best feature selection procedure to take into
account in this classification task.
59
Figure 4.16: A Comparison between different classifiers and classification procedures for a fixed number offeatures (500 features) when CN and AD subjects are used for training and by using only the voxels that lieinside the ROIs.
4.5.2 Training and testing with MCI data and using only the voxels within theROIs
As it happened in the first AdaSVM study, the set of features previously selected by boosting in
section 4.4.2 are analysed in order to eliminate those that are repeated and then are used to train a
SVM classifier with MCI data. Once again, in order to avoid overfitting, the classification results are
an average of a 10 times 10-folds CV procedure. Each one of the 10-fold has the corresponding set
of features. The same three tests (WOW, WW, WWBA) were performed and the results are presented
in Table 4.7.
Table 4.7: The highest results for the AdaSVM classifier and when MCI data is used for training and by using onlythe voxels within the ROIs (4.5.2). Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy(BA), Area Under the Curve (AUC).
AdaSVM ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)WOW 63,4±2,9 47,5±4,7 72,7±3,9 59,8±2,8 62,1±1,8WW 63,2±2,0 52,9±2,6 69,0±2,1 60,9±2,0 63,2±2,5WWBA 62,6±1,8 52,5±3,7 68,9±2,1 60,4±1,9 62,5±1,7
In terms of SENS, the results are slightly better than the ones obtained with AdaBoost (1% or 2%)
but in general, and when compared with the earlier studies which use the same conditions to build the
60
classifier, this method obtained worse results coming once again to enhance the noisy nature of the
data. Figure 4.17 shows a comparison between the best ROC curves for the aforementioned studies,
where only MCI patients’ information was considered during the training and the testing phase by
using only the voxels which fall within ROIs, the feature extraction procedure explained in 3.4.1.B.
As can be seen, when MCI subjects are used for training purposes, resorting to Boosting as feature
selection criterion is not a good option. Boosting is an embedded method of AdaBoost and when
applied as feature selection method in SVM to train with noisy data the discriminative power of the
classifier becomes poor.
Figure 4.17: A comparison between different classifiers and classification procedures for a fixed number offeatures (500 features) when MCI subjects are used for training and by using just the voxels that lie inside theROIs.
It is also important to refer that for AdaSVM classifier the best way for tuning the C parameter
is also through WWBA method since in almost every study the SENS as well as the AUC increases
with an exception for the first study herein mentioned which presents a higher SENS when the WW
method is taken into account.
4.6 Summary of all the Results
In Table 4.8 all the best results achieved in each study in terms of ACC, SENS, SPEC, BA and
AUC are summarized regardless of the number of features or the way through each C parameter was
61
tuned. When the study was performed with resorting to a 10 times 10-folds CV procedure the results
appear in a Mean ± Standard Deviation format.
Table 4.8: A summary of all the results obtained during this thesis. Training Classes (TC), Features Extraction(FE), VEB (Voxels in the Entire Brain volume), Features Selection (FS), Accuracy (ACC), Sensitivity (SENS),Specificity (SPEC), Balanced Accuracy (BA), Area Under the Curve (AUC).
Classifier TC FE FS ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)
SVM
CN&AD VEB PCC 71,3 67,3 80,2 70,0 77,0CN&AD ROIs PCC 74,1 71,2 84,6 72,9 81,3CN&AD VEB MI 73,4 76,9 78,0 74,2 78,0CN&AD ROIs MI 75,5 76,9 84,6 74,3 81,9
MCI VEB PCC 67,2±2,4 52,5±3,6 84,6±2,3 61,6±2,3 64,6±2,1MCI ROIs PCC 71,0±0,9 52,9±4,2 90,9±2,7 65,2±1,2 68,3±1,4MCI VEB MI 65,7±2,5 52,1±2,8 83,2±4,4 61,1±2,1 63,1±2,3MCI ROIs MI 70,6±1,1 54,2±3,6 92,6±1,2 64,3±1,4 68,3±1,2ALL VEB PCC 71,0±0,9 63,8±5,8 89,3±0,7 66,3±2,7 73,9±1,2ALL ROIs PCC 72,7±0,8 59,8±2,5 90,2±1,2 67,2±0,9 77,6±2,0ALL VEB MI 71,0±1,5 60,7±3,7 89,6±0,9 67,2±1,1 74,0±1,2ALL ROIs MI 72,5±1,7 58,1±3,5 91,2±0,9 67,1±1,9 76,8±1,1
AdaBoost CN&AD ROIs Boosting 79,0 76,9 80,2 78,6 83,4MCI ROIs Boosting 69,0±1,8 51,7±1,8 80,5±3,2 66,0±1,8 69,5±1,6
AdaSVM CN&AD ROIs Boosting 75,5 80,8 76,9 75,8 80,2MCI ROIs Boosting 63,4±2,9 52,9±2,6 72,7±3,9 60,9±2,0 63,2±2,5
During the following paragraphs all the results presented so far will be summarized. The best
way to perform feature extraction, feature selection and even the best classifier to use according to
the training and testing situation are some of the questions that will be addressed. The best way for
tuning the model’s parameters in SVM and AdaSVM as well as the weighting vector variation for each
subject obtained in case of the AdaBoost classifier, see Figure 4.15, constitute two more topics that
are not left out.
In SVMs the best results were achieved when the classifier uses CN and AD population as a
training set. In this case, the chosen feature selection procedure also reveals to be an important
aspect with MI outperforming PCC in both tests (see Figure 4.2). The same did not hold when
the model is built based only on MCI subjects’ information, where the results obtained with different
methods of feature selection are too close, and by so, it is not possible to say which method is better
to catch the differences between converted and non-converted patients (see Figure 4.9). When all
classes are used for training, an intermediate performance level is achieved which clearly shows the
noisy nature of MCI data and how much one classifier benefits with stable data as CN and AD patient’s
information, see Figure 4.12. In SVMs classifiers the best way for tuning the C parameter revels to be
through WWBA method.
The AdaBoost algorithm outperforms SVMs in all studies (see Figure 4.16 and Figure 4.17) which
can be justified by differences in model’s construction. These high performances are obtained with
fewer features and, consequently, the necessary computing time is also reduced.
Since the patients can not have a linear evolution between normal and AD stage when boosting
is used as a feature selection procedure the performance of SVMs is improved in the case where
CN and AD were used as a training set but the same did not hold for the case where MCI data
62
was considered which shows that Boosting methodology is not a good option as feature selection
procedure when noisy data is taken into account to build a classifier.
Still in AdaSVM the WWBA method, as happens with SVMs classifiers, proved to be the best way
for tuning the C parameter. The hypothesis raised in section 3.2.3, which said that by joining together
the AdaBoost and SVM algorithms in a sequential way, the predictive power of the model is improved,
is not verified. The AdaBoost outperforms AdaSVM in all studies, no matter which subjects are used
to build the classifier, in almost all aspects with an exception for SENS where the results are improved
in 4% for models built based on CN and AD population and in 1% if only MCI information is used for
training the SVM model.
To better understand what is happening we looked at the output of the AdaSVM and AdaBoost
classifiers when the CN and AD population is used for training and when MCI subjects are considered
for the same purposes (see Figure 4.18 and Figure 4.19).
50 100 150 200 250−200
−100
0
100
200
300AdaBoost
De
cis
ion
Va
lue
s
Subjects
CN
MCI−NC
MCI−C
AD
50 100 150 200 250−4
−2
0
2
4AdaSVM
De
cis
ion
Va
lue
s
Subjects
CN
MCI−NC
MCI−C
AD
Figure 4.18: A comparison between the outputs of AdaSVM and AdaBoost classifiers when CN and AD patientsare used for training.
63
20 40 60 80 100 120 140−150
−100
−50
0
50
100AdaBoost
De
cis
ion
Va
lue
s
Subjects
MCI−NC train
MCI−NC test
MCI−C test
MCI−C train
20 40 60 80 100 120 140−2
−1
0
1
2
3AdaSVM
De
cis
ion
Va
lue
s
Subjects
MCI−NC train
MCI−NC test
MCI−C test
MCI−C train
Figure 4.19: A comparison between the outputs of AdaSVM and AdaBoost classifier when MCI subjects areused for training.
As can be seen in Figure 4.18, although AdaBoost presents a better class separation, AdaSVM
has a fewer number of positive test examples misclassified and thus the SENS of the model, i.e., the
capacity to detect a positive example, is higher.
In Figure 4.19 the noisy nature of data is once again enhanced. If we look for AdaSVM classifier
output in which even within the training classes the classifier committed several classification errors we
understand the worse classification performance achieved by this classifier. However it is important
to highlight that even in the case when MCI subjects are used both in training and testing phases,
AdaSVM classifier presented a fewer number of errors within the positive testing examples and by so
register a better performance in terms of SENS than AdaBoost.
To conclude this discussion it is consensual to say that the classifier that shows a better perfor-
mance is AdaBoost, when CN and AD patients were used as a training set and the obtained classifier
tested in the MCI cohort, which achieved an ACC of 79%, with a SENS of 76,9%, a SPEC of 80,2%
and a BA of 78,6%. The AUC was also computed and registered a remarkable value of 83,4%.
64
5Conclusions and Future Work
65
The early detection of AD constitutes a huge and an important challenge in our days that allows
early treatment as well as the improvement of the quality of life. In the last 20 years a lot of attention
has been given to the detection of MCI conversion, which is, by definition, an intermediate step
between elderly normal and AD stage, and new ways of CAD have emerged.
The present work tries to bring some alternatives to CAD in the task of detecting MCI to AD
conversion. Up to now the detection was mainly focused on SVM algorithms, so we try to bring
something new to the state of the art by using AdaBoost for classification purposes. A variant of
SVM, known as AdaSVM was also studied. Besides this, different methods for feature selection
and feature extraction were also tested with the aim of understanding which is the best approach to
differentiate between the two patterns, i.e., between MCI converters and non-converters.
The classification method as well as the way in which the classifier is built has a high influence
in the final classification performance. Here three different classifiers were tested, they were: the
SVM classifier, the AdaBoost and the AdaSVM. AdaBoost proved to be the best choice. The way in
which the classifier was built also assumes a role of main importance. In this thesis, three different
approaches for SVM and two for AdaBoost and AdaSVM were used. The one that shows the highest
classification performance was AdaBoost when the model is trained with CN and AD information and
then applied to the MCI subjects. This happens due to the noisy nature of MCI information, which
introduces uncertainty to the classification.
In the context of feature selection procedures, three different approaches were studied. They were:
PCC, MI and Boosting. Here, it is difficult to say which one reveals to be the best choice, because it
depends on the classifier and on the classifier construction conditions, for example, although Boosting
achieves a best classification performance in AdaSVM when the classifier is trained using the CN and
AD population and tested on MCI subjects the same does not hold in the case when the model is built
with MCI information. Another example of the differences in results is, for example in the case of the
SVM classifier. For this classifier, when the model is built with CN and AD information, MI reveals to
be the best option but, on the other hand, when the same classifier is trained with MCI patients both
PCC and MI achieved similar results.
In terms of feature extraction procedures VI is used in all the studies performed during this thesis.
Besides this, two different transformation masks were also applied in a pre-processing step to reduce
the dimensionality of the problem. The first mask to be applied only considers the voxels inside the
brain and thus, all the PET image background was discarded. The second mask is more selective and
just takes into account the voxels inside specific brain regions, i.e., the Left Lateral Temporal Lobe,
the Left Dorsolateral Parietal, the Right Dorsolateral Parietal, the Superior Anterior Cingulate and the
Posterior Cingulate and Precuneus. These regions showed a higher predictive power in the detection
MCI to AD conversion, which proves that these are the most affected regions during the transition
between MCI-NC to MCI-C.
The aim of this thesis was to compare different classification methods to detect MCI to AD con-
version and so, as a resume we can say that the AdaBoost classifier with CN and AD patients as a
training set and by using only the voxels that fell inside the ROIs shows a better discriminative power
66
than those obtained by training with MCI set, which is consistent with the noisy nature of MCI data and
demonstrate that a classifier that can separate between CN and AD subjects is also able to separate
between MCI-NC and MCI-C. AdaBoost as well as AdaSVM also show a high sensitivity, i.e., the
capacity to detect a positive example, which is also very important since misclassifying an AD patient
can have severe consequences.
Naturally, some changes can be done with the aim to improve the classification performance. One
of them is to perform multiclass classification in AdaBoost algorithm. This approach was only tested
in the SVM algorithm and performed better than when the SVM model was trained just with MCI
information so, it would be interesting to compare the results achieved by a multiclass SVM with the
ones achieved by an AdaBoost algorithm and verify if, once again, AdaBoost outperforms SVM in the
classification task.
Still in the classification algorithms alternatives we can also try to use a robust version of Ad-
aBoost algorithm. Since the AdaBoost algorithm mainly focus on the errors committed, if one exam-
ple is wrongly labelled the AdaBoost will give more importance to this example iteration after iteration
and the performance level becomes poor. With a robust version of AdaBoost the examples that are
wrongly labelled will be ignored and the classifier performance can be improved, which constitutes a
good strategy when we deal with noisy data as MCI data.
In terms of feature selection other methodologies rather than PCC, MI and Boosting can be em-
ployed in an attempt to improve the classification performance. Several studies were also done in this
area of feature selection procedures, as the case of Bicacro in [70] and Morgado in [71], but these
works are not focused on MCI to AD conversion detection, which constitutes a gap that should be
fulfilled.
In terms of feature extraction procedures, there are also different approaches that can be fol-
lowed and some of them are also explored by the aforementioned authors, such as two and three-
dimensional local binary patterns, local variance and 3D-Haar like features. All these strategies can
now be applied to the problem of distinguishing between MCI-NC and MCI-C with the intent of improv-
ing the classification performance. On top of this we can also try to find more discriminative regions
that make the problem, in terms of the number of features, smaller.
All the ideas mentioned so far have the unique goal of improving the early detection of AD. Other
strategies, such as using different machine learning techniques, can also be explored to detect AD
with the intention of starting the AD treatment as soon as possible to delay neuronal degeneration
and, ideally, in the future, and in line with the recent developments, to completely treat and thus avoid
AD progression.
During the last years a lot of work was done with the resort of CAD systems specially to detect
the presence or not of AD. The problem of detecting MCI to AD conversion is harder because of the
similarity between the involved patterns, and needs to be more broadly explored. These 20 years of
work are just the beginning of a long but interesting journey.
67
68
Bibliography
[1] C. P. Ferri, R. Sousa, E. Albanense, W. s. Ribeiro, and M. Honyashiki, “World Alzheimer Report
2009,” 2009.
[2] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. R. Jack, W. Jagust, J. Q. Tro-
janowski, A. W. Toga, and L. Beckett, “Ways toward an early diagnosis in Alzheimer’s disease:
The Alzheimer’s Disease Neuroimaging Initiative (ADNI),” Alzheimer´s and Dementia, pp. 55–66,
2005.
[3] Alzheimer’s Association, “Basics of Alzheimer Disease: What it is and what you can do,” 2012.
[4] A. Wimo and M. Prince, “World Alzheimer Report 2010,” September 2009.
[5] Alzheimer’s Society, “What is Alzheimer’s Disease,” 2013.
[6] D. Religa, K. Spangberg, A. Wimo, A.-K. Edlun, B. Winblad, and M. Eriksdotter Jonhagen, “De-
mentia Diagnosis Differs in Men and Women and Depends on Age and Dementia Severity: Data
from SveDem, the Swedish Dementia Quality Registry,” Dementia and Geriatric Cognitive Disor-
ders, vol. 33, pp. 90–95, 2012.
[7] W. Thies and L. Bleiler, “2013 Alzheimer’s Disease Facts and Figures,” Alzheimer’s and Demen-
tia, vol. 9, pp. 208–245, 2013.
[8] M. Prince, R. Bryce, and C. Ferri, “World Alzheimer Report 2011,” September 2011.
[9] A. Wimo, L. Jonsson, J. Bond, M. Prince, and B. Winblad, “The worldwide economic impact of
dementia 2010,” Alzheimer’s and Dementia, vol. 9, pp. 1–11, 2013.
[10] C. Misra, Y. Fan, and C. Davatzikos, “Baseline and longitudinal patterns of brain atrophy in
MCI patients, and their use in prediction of short-term conversion to AD: Results from ADNI,”
NeuroImage, vol. 44, pp. 1415–142, 2009.
[11] L. Mosconi, R. Mistur, R. Switalski, W. H. Tsui, L. Glodzik, Y. Li, E. Pirraglia, S. De Santi, B. Reis-
berg, T. Wisniewski, and M. J. de Leon, “FDG-PET changes in brain glucose metabolism from
normal cognition to pathologically verified Alzheimer’s Disease,” European Journal of Nuclear
Medicine and Molecular Imaging, vol. 36, pp. 811–822, January 2009.
[12] L. Mosconi, M. Brys, L. Glodzik Sobanska, S. De Santi, H. Rusinek, and M. J. de Leon, “Early
detection of Alzheimer’s Disease using neuroimaging,” Experimental Gerontology, vol. 42, pp.
129–138, July 2006.
69
[13] H. Braak and E. Braak, “Neuropathological stageing of Alzheimer-related changes,” Acta Neu-
ropathologia, vol. 82, pp. 239–259, June 1991.
[14] J. Jackson Siegal, “Our current understanding of the pathophysiology of Alzheimer’s Disease,”
August 2005.
[15] ”Bruno Dubois, Howard H Feldman, Claudia Jacova, Steven T DeKosky, Pascale Barberger
Gateau, Jeffrey Cummings, Andre Delacourte, Douglas Galasko, Serge Gauthier, Gregory Jicha,
Kenichi Meguro, John O´Brien, Florence Pasquier, Philippe Robert, Martin Rossor, Steven Sal-
loway, Yaakov Stern, Pieter J Visser, and Philip Scheltens”, “Research criteria for the diagnosis
of Alzheimer’s Disease: revising the NINCDS-ADRDA criteria,” The Lancet Neurology, vol. 6,
no. 8, pp. 734 – 746, 2007.
[16] Alzheimer’s Society, “The Mini Mental Stade Examination (MMSE),” 2012.
[17] ”Olivier Querbes, Florent Aubry, Jeremie Pariente, Jean-Albert Lotterie, Jean-Francois Demonet,
Veronique Duret, Michele Puel, Isabelle Berry, Jean-Claude Fort, Pierre Celsis, and The
Alzheimer’s Disease Neuroimaging Initiative”, “”Early diagnosis of Alzheimer’s Disease using
cortical thickness: impact of cognitive reserve”,” BRAIN, vol. 132, pp. 2036–2047, 2009.
[18] S. Duchesne, C. Bocti, K. De Sousa, G. B. Frisoni, H. Chertkow, and D. L. Collins, “Amnestic
MCI future clinical status prediction using baseline MRI features,” Neurobiology of aging, vol. 31,
no. 9, pp. 1606–1617, 2010.
[19] Alzheimer’s Society, “Drug treatments for Alzheimer’s Disease,” 2014.
[20] ”Medical press”. (2014) ”Compound reverses symptoms of Alzheimer’s Dis-
ease in mice, research shows”. [Online]. Available: http://medicalxpress.com/news/
2014-05-compound-reverses-symptoms-alzheimer-disease.html
[21] Andrew Webb, Introduction to Biomedical Imaging. IEEE Press Series in Biomedical Engineer-
ing, 2003.
[22] Holger Grull, Nuclear Imaging and Radiochemistry, 2013.
[23] L. G. Apostolova and P. M. Thompson, “Mapping progressive brain structural changes in early
Alzheimer’s Disease and Mild Cognitive Impairment,” Neuropsychologia, vol. 46, no. 6, pp. 1597–
1612, 2008.
[24] E. Salmon, F. Lekeu, C. Bastin, G. Garraux, and F. Collette, “Functional imaging of cognition in
Alzheimer’s Disease using positron emission tomography,” Neuropsychologia, vol. 46, no. 6, pp.
1613–1623, 2008.
[25] S. F. Eskildsen, P. Coupe, D. Garcıa Lorenzo, V. Fonov, J. C. Pruessner, and D. L. Collins,
“Prediction of Alzheimer’s Disease in subjects with Mild Cognitive Impairment from the ADNI
cohort using patterns of cortical thinning,” NeuroImage, vol. 65, pp. 511–521, 2013.
70
[26] Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer, 2007.
[27] Y. Cui, P. S. Sachdev, D. M. Lipnicki, J. S. Jin, S. Luo, W. Zhu, N. A. Kochan, S. Reppermund,
T. Liu, J. N. Trollor et al., “Predicting the development of Mild Cognitive Impairment: A new use
of pattern recognition,” Neuroimage, vol. 60, no. 2, pp. 894–901, 2012.
[28] C. Davatzikos, P. Bhatt, L. M. Shaw, K. N. Batmanghelich, and J. Q. Trojanowski, “Prediction
of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification,” Neurobiology of
aging, vol. 32, no. 12, pp. 2322–e19, 2011.
[29] E. Westman, A. Simmons, J. Muehlboeck, P. Mecocci, B. Vellas, M. Tsolaki, I. Kłoszewska,
H. Soininen, M. W. Weiner, S. Lovestone et al., “AddNeuroMed and ADNI: similar patterns of
Alzheimer’s atrophy and automated MRI classification accuracy in Europe and North America,”
Neuroimage, vol. 58, no. 3, pp. 818–828, 2011.
[30] R. Cuingnet, E. Gerardin, J. Tessieras, G. Auzias, S. Lehericy, M.-O. Habert, M. Chupin, H. Be-
nali, and O. Colliot, “Automatic classification of patients with Alzheimer’s Disease from structural
MRI: a comparison of ten methods using the ADNI database,” Neuroimage, vol. 56, no. 2, pp.
766–781, 2011.
[31] R. Wolz, V. Julkunen, J. Koikkalainen, E. Niskanen, D. P. Zhang, D. Rueckert, H. Soininen,
J. Lotjonen, Alzheimer’s Disease Neuroimaging Initiative et al., “Multi-method analysis of MRI
images in early diagnostics of Alzheimer’s Disease,” PloS one, vol. 6, no. 10, p. e25446, 2011.
[32] P. Coupe, S. F. Eskildsen, J. V. Manjon, V. S. Fonov, J. C. Pruessner, M. Allard, and D. L. Collins,
“Scoring by nonlocal image patch estimator for early detection of Alzheimer’s Disease,” NeuroIm-
age: clinical, vol. 1, no. 1, pp. 141–152, 2012.
[33] Y. Cho, J.-K. Seong, Y. Jeong, and S. Y. Shin, “Individual subject classification for Alzheimer’s
Disease based on incremental learning using a spatial frequency representation of cortical thick-
ness data,” Neuroimage, vol. 59, no. 3, pp. 2217–2230, 2012.
[34] J. Young, M. Modat, M. J. Cardoso, A. Mendelson, D. Cash, and S. Ourselin, “Accurate multi-
modal probabilistic prediction of conversion to Alzheimer’s Disease in patients with Mild Cognitive
Impairment,” NeuroImage: clinical, vol. 2, pp. 735–745, 2013.
[35] D. Zhang, Y. Wang, L. Zhou, H. Yuan, and D. Shen, “Multimodal classification of Alzheimer’s
Disease and Mild Cognitive Impairment,” Neuroimage, vol. 55, no. 3, pp. 856–867, 2011.
[36] P. Vemuri, S. D. Weigand, D. S. Knopman, K. Kantarci, B. F. Boeve, R. C. Petersen, and C. R.
Jack Jr, “Time-to-event voxel-based techniques to assess regional atrophy associated with MCI
risk of progression to AD,” Neuroimage, vol. 54, no. 2, pp. 985–991, 2011.
[37] S. J. Teipel, C. Born, M. Ewers, A. L. Bokde, M. F. Reiser, H.-J. Moller, and H. Hampel, “Multivari-
ate deformation-based analysis of brain atrophy to predict Alzheimer’s disease in Mild Cognitive
Impairment,” Neuroimage, vol. 38, no. 1, pp. 13–24, 2007.
71
[38] C. Hinrichs, V. Singh, L. Mukherjee, G. Xu, M. K. Chung, and S. C. Johnson, “Spatially aug-
mented LPboosting for AD classification with evaluations on the ADNI dataset,” Neuroimage,
vol. 48, no. 1, pp. 138–149, 2009.
[39] C.-Y. Wee, P.-T. Yap, D. Zhang, K. Denny, J. N. Browndyke, G. G. Potter, K. A. Welsh Bohmer,
L. Wang, and D. Shen, “Identification of MCI individuals using structural and functional connec-
tivity networks,” Neuroimage, vol. 59, no. 3, pp. 2045–2056, 2012.
[40] G. Chetelat, B. Landeau, F. Eustache, F. Mezenge, F. Viader, V. de La Sayette, B. Desgranges,
and J.-C. Baron, “Using voxel-based morphometry to map the structural changes associated
with rapid conversion in MCI: a longitudinal MRI study,” Neuroimage, vol. 27, no. 4, pp. 934–946,
2005.
[41] M. Silveira and J. Marques, “Boosting Alzheimer’s Disease diagnosis using PET images,” in
Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010, pp. 2556–
2559.
[42] C. Hinrichs, V. Singh, G. Xu, and S. C. Johnson, “Predictive markers for AD in a multi-modality
framework: an analysis of MCI progression in the ADNI population,” Neuroimage, vol. 55, no. 2,
pp. 574–589, 2011.
[43] ADNI. Alzheimer’s Disease Neuroimaging Initiative. [Online]. Available: http://adni-info.org/
Home.aspx
[44] PET Technical Procedures Manual. [Online]. Available: http://www.adni-info.org/Scientists/
ADNIStudyProcedures.aspx
[45] K. R. Gray, R. Wolz, R. A. Heckemann, P. Aljabar, A. Hammers, and D. Rueckert, “Multi-region
analysis of longitudinal FDG-PET for the classification of Alzheimer’s disease,” NeuroImage,
vol. 60, no. 1, pp. 221–229, 2012.
[46] ADNI. PET Pre-processing. [Online]. Available: http://adni.loni.usc.edu/methods/pet-analysis/
pre-processing/
[47] V. Vapnik, “Pattern recognition using generalized portrait method,” Automation and remote con-
trol, vol. 24, pp. 774–780, 1963.
[48] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–
297, 1995.
[49] R. Filipovych and C. Davatzikos, “Semi-supervised pattern classification of medical images: ap-
plication to mild cognitive impairment (MCI),” NeuroImage, vol. 55, no. 3, pp. 1109–1119, 2011.
[50] E. Mayoraz and E. Alpaydin, “Support vector machines for multi-class classification,” in Engineer-
ing Applications of Bio-Inspired Artificial Neural Networks. Springer, 1999, pp. 833–842.
72
[51] J. H. Morra, Z. Tu, L. G. Apostolova, A. E. Green, A. W. Toga, and P. M. Thompson, “Comparison
of Adaboost and support vector machines for detecting Alzheimer’s Disease through automated
hippocampal segmentation,” Medical Imaging, IEEE Transactions on, vol. 29, no. 1, pp. 30–43,
2010.
[52] A. Savio, M. Garcıa Sebastian, M. Grana, and J. Villanua, “Results of an Adaboost approach
on Alzheimer’s Disease detection on MRI,” in Bioinspired Applications in Artificial and Natural
Computation. Springer, 2009, pp. 114–123.
[53] Robert E. Schapire, “Explaining Adaboost.”
[54] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer
vision, vol. 57, no. 2, pp. 137–154, 2004.
[55] X. Li, L. Wang, and E. Sung, “A study of Adaboost with SVM based weak learners,” in Neural
Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, vol. 1.
IEEE, 2005, pp. 196–201.
[56] L. Huang, Z. Pan, H. Lu et al., “Automated Diagnosis of Alzheimer’s Disease with Degenerate
SVM-Based Adaboost,” in Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2013
5th International Conference on, vol. 2. IEEE, 2013, pp. 298–301.
[57] C. Bergmeir, M. Costantini, and J. M. Benıtez, “On the usefulness of cross-validation for direc-
tional forecast evaluation,” Computational Statistics & Data Analysis, vol. 76, pp. 132–143, 2014.
[58] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model
selection,” in IJCAI, vol. 14, no. 2, 1995, pp. 1137–1145.
[59] R. Jensen and Q. Shen, Computational intelligence and feature selection: rough and fuzzy ap-
proaches. John Wiley & Sons, 2008, vol. 8.
[60] P. Langley et al., Selection of relevant features in machine learning. Defense Technical Infor-
mation Center, 1994.
[61] N. Hoque, D. Bhattacharyya, and J. Kalita, “MIFS-ND: A mutual information-based feature selec-
tion method,” Expert Systems with Applications, vol. 41, no. 14, pp. 6371–6385, 2014.
[62] P. Sedgwick, “Pearson’s correlation coefficient,” BMJ: British Medical Journal, vol. 345, 2012.
[63] P. Ahlgren, B. Jarneving, and R. Rousseau, “Requirements for a cocitation similarity measure,
with special reference to Pearson’s correlation coefficient,” Journal of the American Society for
Information Science and Technology, vol. 54, no. 6, pp. 550–560, 2003.
[64] J. Wang, “Pearson Correlation Coefficient,” in Encyclopedia of Systems Biology. Springer, 2013,
pp. 1671–1671.
[65] R. F. Tate, “Correlation between a discrete and a continuous variable. Point-biserial correlation,”
The Annals of mathematical statistics, pp. 603–607, 1954.
73
[66] R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, “The mutual information: detecting
and evaluating dependencies between variables,” Bioinformatics, vol. 18, no. suppl 2, pp. S231–
S240, 2002.
[67] G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likelihood maximisation: a uni-
fying framework for information theoretic feature selection,” The Journal of Machine Learning
Research, vol. 13, no. 1, pp. 27–66, 2012.
[68] G. Brown, “A new perspective for information theoretic feature selection,” in International Confer-
ence on Artificial Intelligence and Statistics, 2009, pp. 49–56.
[69] L. Florack, Mathematical Techniques for Image Analysis, year = 2008.
[70] Eduardo Bicacro, “Alzheimer’s Disease Diagnosis unsing 3D Brain Images,” Master’s thesis,
Instituto Superior Tecnico, Universidade de Lisboa, 2011.
[71] Pedro Morgado, “Automated Diagnosis of Alzheimer’s Disease unsing PET Images,” Master’s
thesis, Instituto Superior Tecnico, Universidade de Lisboa, 2012.
74
AROIs
A-1
(a) (12) Orange - Left MesialTemporal; Brown - RightMesial Temporal.
(b) (13) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.
(c) (14) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.
(d) (15) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.
(e) (16) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.
(f) (17) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(g) (18) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(h) (19) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(i) (20) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
Figure A.1: Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champalimaud Founda-tion.
A-2
(a) (21) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(b) (22) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(c) (23) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(d) (24) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.
(e) (25) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(f) (26) Blue - Left Lateral Tempo-ral; Light Blue - Right LateralTemporal; Brown - Inferior An-terior Cingulate.
(g) (27) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(h) (28) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(i) (29) Blue - Left Lateral Tempo-ral; Light Blue - Right LateralTemporal; Brown - Inferior An-terior Cingulate.
Figure A.2: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation.
A-3
(a) (30) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(b) (31) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(c) (32) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(d) (33) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(e) (34) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.
(f) (35) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Yellow - Infe-rior Anterior Cingulate; Brown- Posterior Cingulate and Pre-cuneus.
(g) (36) Brown - Posterior Cin-gulate and Precuneus; Red -Superior Anterior Cingulate.
(h) (37) Brown - Posterior Cin-gulate and Precuneus; Red- Superior Anterior Cingu-late; Yellow - Left DorsolateralParietal; Orange - Right Dor-solateral Parietal.
(i) (38) Brown - Posterior Cingu-late and Precuneus; Red - Su-perior Anterior Cingulate; Yel-low - Left Dorsolateral Pari-etal; Orange - Right Dorsolat-eral Parietal.
Figure A.3: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation.
A-4
(a) (39) (b) (40) (c) (41)
(d) (42) (e) (43) (f) (44)
(g) (45) (h) (46) (i) (47)
Figure A.4: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation. Brown - Posterior Cingulate and Precuneus; Red - Superior Anterior Cingulate; Yellow - LeftDorsolateral Parietal; Orange - Right Dorsolateral Parietal.
A-5
(a) (48) (b) (49) (c) (50)
Figure A.5: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation. Brown - Superior Anterior Cingulate.
A-6