Detecting Conversion of Mild Cognitive Impairment to Alzheimer

Detecting Conversion of Mild Cognitive Impairment toAlzheimer

A Comparison Between Classifiers

Maria Ines Barbosa Silva Crujo Uva

Thesis to obtain the Master of Science Degree in

Biomedical Engineering

Supervisor: Prof. Maria Margarida Campos da Silveira

Examination Committee

Chairperson: Prof. Raul Daniel Lavado Carneiro MartinsSupervisor: Prof. Maria Margarida Campos da SilveiraMembers of the Committee: Dr. Durval Campos Costa

Prof. Joao Miguel Raposo Sanches

November 2014

Our human compassion binds us the one to the other - not in pity or patronizingly, but as human beingswho have learnt how to turn our common suffering into hope for the future.

Nelson Mandela

Acknowledgments

This work was supported by Fundacao para a Ciencia e Tecnologia through the Alzheimer´s Dis-

ease: Image Analysis and Recognition (ADIAR) project (PTDC/SAU-ENB/114606/2009).

Data collection and sharing for this project was done by the Alzheimer’s Disease Neuroimaging

Initiative (ADNI).

I would like to thank all of those who contributed, direct or indirect, to the successful of this work:

• My advisor, Professor Margarida Silveira, for sharing your scientific knowledge and for support-

ing me throughout the project, always with very constructive suggestions.

• My classmates, with who I spent great times during this journey in IST.

• My family, friends and Joao, for had been an unconditional support in all stages of my life.

To all of you, I send my sincere acknowledgments.

iii

Abstract

The incidence of dementia worldwide is expected to double in the next 20 years. Alzheimer’s

Disease (AD) is the most common form of dementia affecting people over the age of 65. The early

diagnosis of the disease is very important because time constitutes a barrier to improve the quality

of life for those who suffer from the disease. The efforts for early detection of AD give rise the Mild

Cognitive Impairment (MCI) concept which is used to characterize a person between normal and

AD stage. The diagnosis can be done by an expert physician using neuroimaging techniques like

Positron Emission Tomography (PET) and Magnetic Resonance Imaging (MRI). In the recent years,

new ways of Computer Aided Diagnosis (CAD) tools have been helping the detection of the disease

since they are not subject to a subjective evaluation. During this study three different classifiers

were used on Fluorodeoxyglucose (FDG)-PET images to detect MCI to AD conversion. The three

classifiers were Support Vector Machine (SVM), AdaSVM and AdaBoost. To improve the predictive

power, different feature selection and feature extraction methods were also studied. The one that

shows a better performance is AdaBoost, when Control Normal (CN) and AD patients were used as a

training set and the obtained classifier tested on MCI cohort, which achieved an accuracy of 79%, with

a sensitivity of 76,9%, a specificity of 80,2% and a Balanced Accuracy (BA) of 78,6%. The area under

the Receiver Operating Characteristic (ROC) curve was also computed and registered a remarkable

value of 83,4%.

Keywords

Alzheimer’s Disease, Mild Cognitive Impairment, Positron Emission Tomography, Computer Aided

Diagnosis, Feature Selection, Feature Extraction

v

Resumo

Numa era em que se espera que o numero de pessoas classificadas como dementes duplique, e,

sendo a Doenca de Alzheimer (AD) a manifestacao de demencia mais comum a afetar a populacao

com mais de 65 anos, um diagnostico precoce torna-se essencial. Os esforcos efectuados para

caracterizar a doenca, na sua fase mais inicial, conduziram ao aparecimento do conceito de Defice

Cognitivo Ligeiro (MCI) que e por definicao um estado intermedio entre o estado cognitivo normal

(CN) e a manifestacao de sintomas de AD. Este diagnostico pode ser realizado por um especialista,

recorrendo para tal a testes neuropsicologicos ou a imagens medicas como e o caso das obtidas por

emissao de positroes (PET) ou atraves de ressonancia magnetica (MRI).

Nos ultimos anos, o desenvolvimento tecnologico possibilitou o aparecimento de varios meios

de diagnostico computacionalmente assistidos (CAD) que melhoraram em muito o desempenho de

diagnostico, visto que nao requerem qualquer tipo de avaliacao subjetiva.

Nesta dissertacao estudaram-se tres tipos de classificadores distintos, a saber: SVMs, AdaBoost

e AdaSVM, com o objectivo de avaliar qual o melhor classificador para a deteccao da conversao de

MCI para AD atraves de imagens FDG-PET. Diferentes tipos de selecao e extracao de caracterısticas

foram igualmente estudados. O que registou um melhor desempenho foi o AdaBoost tendo atingido

um nıvel de precisao de 79%, com uma sensibilidade de 76,9%, uma especificidade de 80,2% e uma

precisao balanceada de 78,6%. A capacidade em detetar com maior certeza um novo caso positivo,

tambem foi avaliada, tendo-se verificado um desempenho de 83,4%.

Palavras Chave

Doenca de Alzheimer, Defice Cognitivo Ligeiro, Tomografia por Emissao de Positroes, Diagnostico

Assistido por Computador, Selecao de Caracterısticas, Extracao de Caracterısticas

vii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Numbers and Facts of the Disease . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Pathology, Diagnosis and Treatment . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 PET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.3.A The Importance of PET as a diagnostic technique for AD . . . . . . . . 7

1.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 State of the Art 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Materials and Methods 19

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 ADNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Subjects Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.3 Imaging Acquisition and Processing . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 SVM - Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1.A SVM - Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1.B SVM - Mathematical Concepts . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1.C Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2.A AdaBoost - Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2.B AdaBoost - Mathematical Concepts . . . . . . . . . . . . . . . . . . . . 28

3.2.3 AdaSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Voxel Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ix

3.4.1.A 1st Mask: Voxels inside the brain volume . . . . . . . . . . . . . . . . . 33

3.4.1.B 2nd Mask: Voxels inside specific brain regions . . . . . . . . . . . . . . 33

3.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.1 Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Experimental Results 39

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Model’s adjustable parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 SVMs Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.1 Training with CN and AD subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.2 Training and testing with MCI data . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3.3 Training with all classes and testing with MCI population . . . . . . . . . . . . . . 51

4.4 AdaBoost Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1 Training with CN and AD subjects and using only the voxels within the Regions

of Interests (ROIs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4.2 Training and testing with MCI data and using only the voxels within the ROIs . . 57

4.5 AdaSVM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.1 Training with CN and AD subjects and using only the voxels within the ROIs . . . 59

4.5.2 Training and testing with MCI data and using only the voxels within the ROIs . . 60

4.6 Summary of all the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Conclusions and Future Work 65

Bibliography 69

Appendix A ROIs A-1

x

List of Figures

1.1 Statistics of dementia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Costs of dementia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Changes in several causes of death. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Different stages of AD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 PET explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Metabolic activation in CN, MCI and AD patients. . . . . . . . . . . . . . . . . . . . . . . 7

1.7 A scheme of all strategies adopted during this dissertation. . . . . . . . . . . . . . . . . 9

3.1 Basic concepts of SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 A schematic interpretation of a multiclass SVM algorithm. . . . . . . . . . . . . . . . . . 27

3.3 Basic concepts of AdaBoost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Basic concepts of AdaSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 A schematic view of the cross-validation process. . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Feature space transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 ROIs transformation mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 An example of the influence of different penalizations. . . . . . . . . . . . . . . . . . . . 41

4.2 ROC curves for a SVM classifier with CN and AD as a training set. . . . . . . . . . . . . 43

4.3 Sensitivity and specificity variation for a SVM classifier when CN and AD are used as

a training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 The outputs of SVM classifiers by using the voxels inside the entire brain volume. . . . . 45

4.5 The outputs of SVM classifiers by using the voxels which fall inside the ROIs. . . . . . . 46

4.6 Features selected by PCC to build the classifier when CN and AD subjects are used

for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.7 Features selected by MI to build the classifier when CN and AD subjects are used for

training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8 A mathematical interpretation of correlation. . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 ROC curves for a SVM classifier with MCI as a training set. . . . . . . . . . . . . . . . . 50

4.10 Sensitivity and specificity variation for a SVM classifier when MCI subjects are used as

a training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.11 ROC curves for a SVM classifier with all classes as a training set. . . . . . . . . . . . . . 52

4.12 A comparison between SVM classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xi

4.13 ROC curves for AdaBoost and SVMs classifiers with CN and AD as a training set and

by using different feature selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.14 Features selected by Boosting when CN and AD subjects are used for training the model. 57

4.15 Subject’s weights in each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.16 A comparison between different classifiers when CN and AD are used for training. . . . 60

4.17 A comparison between different classifiers when MCI subjects are used for training. . . 61

4.18 A comparison between the outputs of AdaSVM and AdaBoost classifiers when CN and

AD patients are used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.19 A comparison between the outputs of AdaSVM and AdaBoost classifier when MCI

subjects are used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.1 Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-

maud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2

A.2 (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from

Champalimaud Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3







xii

List of Tables

2.1 Summary of the State of the Art. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Summary of the State of the Art (Continued). . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Dataset description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 SVM Results when CN and AD are used for training. . . . . . . . . . . . . . . . . . . . . 43

4.2 SVM Results when MCI subjects are used for training. . . . . . . . . . . . . . . . . . . . 49

4.3 SVM Results when all classes are used for training. . . . . . . . . . . . . . . . . . . . . 52

4.4 AdaBoost Results when CN and AD are used for training just with information from ROIs. 55

4.5 AdaBoost Results when MCI subjects are used for training just with information from

ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 AdaSVM Results when CN and AD are used for training just with information from ROIs. 59

4.7 AdaSVM Results when MCI subjects are used for training just with information from

ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.8 A summary of all the results obtained during this thesis. . . . . . . . . . . . . . . . . . . 62

xiii

Acronyms

ACC Accuracy

AD Alzheimer’s Disease

ADNI Alzheimer’s Disease Neuroimaging Initiative

AUC Area Under the Curve

BA Balanced Accuracy

CAD Computer Aided Diagnosis

CDR Clinical Dementia Rating

CN Control Normal

CSF Cerebrospinal Fluid

CV Cross-Validation

FDA Food and Drug Administration

FDG Fluorodeoxyglucose

FWHM Full Width at Half Maximum

LDA Linear Discriminative Analysis

LNOCV Leave-N-Out Cross-Validation

LOOCV Leave-One-Out Cross-Validation

MCI Mild Cognitive Impairment

MCI-C MCI Converters

MCI-NC MCI Non-Converters

MI Mutual Information

ML Machine Learning

MMSE Mini Mental State Examination

xv

MRI Magnetic Resonance Imaging

NBIB National Institute of Biomedical Imaging and Bioengineering

NFTs Neurofibrillary Tangles

NIA National Institute of Aging

NIH National Institute of Health

NINCDS-ADRDA National Institute of Neurological Disorders and Stroke-Alzheimer Disease and Re-

lated Disorders

PCC Pearson Correlation Coefficient

PET Positron Emission Tomography

RBF Radial Basis Function

ROC Receiver Operating Characteristic

ROIs Regions of Interests

SENS Sensitivity

SPEC Specificity

SVM Support Vector Machine

VEB Voxels in the Entire Brain volume

VI Voxel Intensity

xvi

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1

1.1 Motivation

Every seven seconds a new case of dementia appears in the world and the prevalence of the

disease is increasing over the years [1]. Due to the increase of life expectancy, the incidence of

dementia is expected to double during the next 20 years reaching an estimated number of 115.4

million people worldwide in 2050 [1,2]. On average these people live just four to eight years after the

diagnosis and die because, up to now, there is no cure.

1.1.1 Numbers and Facts of the Disease

Alzheimer’s Disease (AD) was first described by Alois Alzheimer, a German Neurologist, in 1906.

At that time the disease was considered rare because on average the life expectancy was about 50

years and so very few people reached the critical age of 65 years in which the likelihood of developing

dementia almost doubles every five years [3,4]. Today, with the increase of life expectancy AD consti-

tutes the most common cause of dementia [3,5]. By dementia we mean a kind of mental disorder that

completely changes the life of patients and their families and is characterized by the loss of memory

and other intellectual abilities which interferes with daily life activities due to the death or malfunction

of the nerve cells [3,6,7].

As can be verified with the analysis of Figure 1.1 much of the incidence increases because of

people with dementia in low and middle income countries [1] where the majority of people don’t

receive a diagnosis and therefore it is difficult to have access to a correct treatment [8].

Figure 1.1: Trends of growth of people with dementia (in millions) [1].

Dementia is a leading cause of disability and need for care. It has associated direct costs, related

with medical and social care, and indirect costs, justified by unpaid caregiving [9]. The amount of

money spent with dementia worldwide crossed the threshold of 604 million dollars in 2010. The costs

are enormous and inequitably distributed [9]. If dementia were a country, it would be the 18th largest

economy (see Figure 1.2) [4].

2

Figure 1.2: Comparison between the costs of dementia in US and others countries economies [4].

According to the Alzheimer’s Association, between 2000 and 2010 the number of deaths resulting

from heart diseases, stroke, prostate cancer and HIV decreased 16%, 23%, 8% and 42% respectively

but the proportion of deaths related with AD increased 68% (see Figure 1.3) [7]. Something has to be

done to stop this epidemic and we don´t have time to loose to manage the impact of the disease [8].

Figure 1.3: Changes in several causes of death [7].

The late diagnosis constitutes a barrier for improving the quality of life for those who suffers from

the the disease. An early diagnosis is essential because it enables the patient to gain time to disease.

At the initial phase of the disease the patient is still capable to trace his own future plans as well

as taking part in decisions related with his care [8]. The societal costs can also be anticipated and

managed and, consequently, the burden costs of the disease can be reduced [9].

3

The efforts in characterizing the early signs of AD have attracted a lot of attention in the recent

years and led to the appearance of the concept of Mild Cognitive Impairment (MCI) [2,10]. The termed

MCI is used to characterize a person who has problems with memory or with another thinking skill but

do not interfere with daily life activities [3]. The MCI is considered a transition stage between normal

aging and AD. Several studies were made with the goal of better understanding MCI and today the

researchers know that there are different types of MCI, depending on the cognitive domains which

are affected but the subtype that is most related with AD is the amnestic MCI, i.e., when the impaired

areas are those that involve memory [2, 3]. The annual rate of conversion between amnestic MCI to

AD is around 12%, which is much higher than the rate verified in normal non-demented subjects [2]. It

is important to identify individuals that are likely to convert to initiate the treatment and by so delaying

the mental decline as much as possible.

Today we have sensitive markers and methodologies that allow preclinical detection of neurode-

generative diseases such as AD where the brain has patterns of atrophy and a decrease in metabolic

rate [11]. Neuroimaging techniques constitute one example of those kinds of methodologies. Struc-

tural Magnetic Resonance Imaging (MRI), which measures the degree of atrophy, and metabolic

Positron Emission Tomography (PET) that, by using 18-Fluorodeoxyglucose (FDG), a labeled analo-

gous of glucose, can measure the reductions in cerebral metabolic rate which are the most clinical

used modalities [11,12]. Together with Computer Aided Diagnosis (CAD), Neuroimaging has formed

a powerful tool that identifies an individual in a preclinical stage, which means, before the patient has

the symptoms, once the pathologies that lead to dementia start to occur [8].

1.1.2 Pathology, Diagnosis and Treatment

AD is an incurable physical disease that affects the human brain and is characterized by the

accumulation of insoluble fibrous material in the central nerve system which leads to cells death

[13]. Up to now the cause of the disease is still unknown but a set of factors such as age, genetic

inheritance, environmental factors or lifestyle constitute the majority of the risk factors for the onset of

the disease [5].

Although it is not fully understood there are three consistent hallmarks for the disease: accu-

mulation of senile plaques of β-amyloid peptides, accumulation of Neurofibrillary Tangles (NFTs)

composed by τ -protein and neuronal degeneration [1, 14]. The disease is characterized by a cog-

nitive decline, concretely a short-term recall memory loss [12,14]. However the existence of plaques

and NFTs are not unique of AD and are involved in the normal process of aging as well as in other

neurodegenerative disorders [8].

Since the hallmarks are not exclusive of AD the diagnosis of the disease is really difficult and was

based on the criterion of the Diagnostic and Statistical Manual of Mental Disorders, fourth edition

(DSM-IV-TR) and the National Institute of Neurological Disorders and Stroke-Alzheimer Disease and

Related Disorders (NINCDS-ADRDA) published in 1984. According to this criterion the diagnosis

should be made in two steps: first the identification of a dementia syndrome and secondly the appli-

cation of a protocol that had to be in line with the clinical characteristics of the AD phenotype. Today

4

the clinical phenotype of AD is not described in such exclusionary terms to allow for the detection of

the early stages of the disease [15]. An autopsy confirmation of the histopathology changes related

to AD is mandatory for a definite diagnosis [1].

To help in the task of dementia diagnosis clinicians commonly use the Mini Mental State Exami-

nation (MMSE). The MMSE is a series of questions and tests, which cover different mental abilities,

and with a maximum score of 30 points. Typically, a normal person has a MMSE score above 27

nevertheless, having a score below this value does not mean that a person has dementia, there may

be other reasons that justify this low score [16]. Actually one of the most relevant pitfalls of MMSE

is the impact of the cognitive reserve, i.e., the degree of education. This constitutes a problem that

hardly affects subjects with high education level in which the early signs of the disease can be easily

hidden. In these cases neuroimaging constitutes an important tool because it can not be manipu-

lated [17]. The severity of the symptoms of dementia is commonly measured in another numerical

scale known as Clinical Dementia Rating (CDR), which ranges between 0 and 3 accordingly to the

strength of symptoms.

AD is a progressive disease, which means that it gets worse over time but the speed of the deterio-

ration is different across subjects. The plaques of β-amyloid peptides deposits have several variations

both in form and in size. By contrast the neurofibrillary changes shows a characteristic pattern of dis-

tribution. These changes start in the medial temporal lobe and spread over all the cortex [18]. This

regular pattern allows the classification of the disease in six stages accordingly the cortical and sub-

cortical neurofibrillary changes. As is represented in Figure 1.4 the six stages can be grouped into:

transentorhinal stages, which corresponds to the silence period of the disease, limbic stages charac-

terized by the onset of the symptoms and isocortical stages where AD is fully developed [3,13].

Figure 1.4: Neurofibrilary changes along different stages of AD [13].

5

Up to now there is no cure for AD and the treatments are focused on slowing down the progres-

sion of the disease and improve patient symptoms [19]. U.S. Food and Drug Administration (FDA)

approved two different types of drugs. The first type is known as cholinesterase inhibitors drug and

avoids the breakdown of acetylcholine, a chemical messenger involved in cognitive brain domains

such as memory and learning. The second one is known as NMDA receptor antagonist, and acting

by blocking this receptor, which is involved in information processing and so, the brain damage is

slowed down. Different drugs are appropriated for different stages of the disease [3,19]. Recent stud-

ies which took place at Saint Louis University and published in the Journal of Alzheimer’s Disease,

demonstrated that an experimental drug called antisense oligonucleotide can reverse the symptoms

of AD in mice that are genetically engineered to model the disease [20]. This constitutes a hope in a

future cure for the millions of subjects that suffer from the disease.

1.1.3 PET

PET is a non-invasive nuclear medicine technique in which we obtain an image with the spatial

distribution of radiopharmaceuticals introduced into the body, i.e., we measure the physiology and

function rather than an anatomical map of the body [21].

The radiopharmaceuticals are designed to bind with the ligand of interest therefore they have to be

labeled analogues of a biological active molecule and their choice depends on what we want to see.

The radiotracer undergoes radioactive decay and a positron is emitted. This positron, after a while,

collides with an electron and two γ-rays are formed. They travel in opposite directions at an angle of

180◦ and with an energy of 511 KeV up to being detected. With the reconstruction of the annihilation

lines the original image can be achieved (see Figure 1.5) [21,22].

PET is a great tool for diagnosis and is largely used in areas such as oncology, cardiology and

neurology [21].

Figure 1.5: Formation and detection of γ-rays [22].

6

1.1.3.A The Importance of PET as a diagnostic technique for AD

In preclinical detection of AD developing preventive measures is a vital step to cover the gap

between the onset of the disease and the beginning of symptoms [23].

To do this kind of imaging a radiopharmaceutical called FDG is used. FDG works as an analogous

of glucose. The chemistry of both molecules is very similar but FDG has fluorine isotope instead of a

normal hydroxyl group at the second carbon.

The human brain needs energy to live, which is obtained through the glucose metabolism and to

do that, FDG initiates the same process as glucose cycle. In a first step FDG suffers phosphorylation

by hexokinase enzyme and is transformed into FDG-6-Phosphate. After this first phosphorylation step

FDG-6-Phosphate is not able to continue the glucose cycle and remains trapped inside the cells. The

transformation in molecule chemistry does not allow for an efficient excretion and the radionuclide

shows up [21].

A high metabolic active tissue shows a high uptake of FDG and a reduction in FDG uptake is

indicative if something goes wrong. The hypometabolism extension can predict the severity of the

disease [11,22,24].

Since it is possible to quantify the metabolic reduction this neuroimaging technique allows charac-

terization of the disease stage, as represented in Figure 1.6, and thus an appropriate treatment can

be applied even before the onset of the clinical signs.

Figure 1.6: Decrease in glucose consuption in patients with MCI and AD when compared with CN patients.

1.2 Proposed Approach

In Machine Learning (ML), the objects are represented by vectors in which each entry represents

one feature of the object. Typically each vector is constituted by a huge number of features, which

accounts for several problems. On one hand the task of creating these vectors is not straightforward

since there are different ways to do it and, most of the times, predict which one is the best for the

problem at hand is quite difficult. On the other hand, in classification problems, the number of features

is quite larger than the number of examples which leads to a dimensionality problem. To overcome

this issue, the number of features has to be reduced. The number of selected features influences

the performance of the classifier and to choose the best set of features, once again, there are a lot

of different possibilities. On top of this, to build the classifier that best fits the classification problem,

7

there are also many different ML algorithms which constitutes one more variable to take into account.

Our aim in this thesis is to study different methodologies in CAD of AD by using FDG-PET images

from Alzheimer’s Disease Neuroimaging Initiative (ADNI).

The main goal of this study is a comparison between three different classification procedures that

were used on FDG-PET images to detect MCI to AD conversion. The three classifiers were Support

Vector Machine (SVM), AdaSVM and AdaBoost and to do that, within each one of the algorithms,

different strategies were investigated.

The first problem to be addressed was the feature extraction procedure. Herein, Voxel Inten-

sity (VI), which is directly proportional to the severity of the disease, is used but to reduce the di-

mensionality of the problem two different pre-processing steps were also computed. In the first one

a binary mask was applied to choose just the voxels that fell inside the brain volume, and thus, all

the background of FDG-PET images was discarded. Although being reduced, the dimensionality of

the problem continues to be huge and by so this approach was only used when the classification was

accomplished by a SVM algorithm. In the second approach, a second binary mask was used but now

for choosing just the voxels within certain Regions of Interests (ROIs). With this last strategy, not only

the dimensionality of the problem is reduced but also the discriminative power of those regions can

be evaluated.

Three different feature selection methods were tested. The first two methods were Pearson Cor-

relation Coefficient (PCC) and Mutual Information (MI), two different ways for selecting features ac-

cording to a ranking criterion. These two methods were useful to choose the features that were used

to build the SVM model. The third approach, known as boosting, which is an embedded feature se-

lection procedure of AdaBoost algorithm, was used as a feature selection criterion for AdaBoost and,

for AdaSVM. All the strategies explained so far are outlined in Figure 1.7.

To conclude, three different tests were performed when the classification task was carried out by

SVM and by AdaSVM methods. There were included in this study with the intent of understanding

the importance of applying different penalties for misclassification when the data is imbalanced, i.e.,

when the number of negative and positive examples is different.

All these strategies will be better explained along this thesis and the results will tell which is the

best strategy to adopt to detect early MCI to AD conversion.

8

Figure 1.7: A scheme of all the different approaches included in this thesis. Inside the orange rectangle thereare represented the two masks used during the feature extraction procedure. The Mask 1 select the voxelsinside the entire brain volume and the Mask 2 select just those that fell inside the ROIs. In the green rectangleare represented the three distinct feature selection procedures and inside the blue one, the three classificationalgorithm already mentioned. The different tests performed to understand the best way for tuning the modelsparameters are represented by the black circles.

1.3 Original Contributions

During the last years, several works were done to account for MCI to AD prediction but, as can

be seen with a careful reading of the next Chapter, the work developed so far, and according to our

knowledge till then, are mainly focused on MRI images as well as on SVM and LDA classifiers.

This work brings some innovate techniques for MCI to AD conversion detection since it introduce

in this problematic the AdaBoost classifier and, at the same time, uses only PET images for the diag-

nosis. Moreover it also tests more efficient techniques for feature extraction procedure, by reducing

the classification problem dimension due the fact that only the regions that can be capable to differ-

entiate between MCI converters and MCI non-converters are considered. The feature selection is not

forgotten or left to chance, and different methods are tested with special emphasis on Boosting which,

together with SVM classifier, originates a new classification technique also known as AdaSVM. Still in

SVMs, different ways for tuning SVM parameters were also explored to guarantee the best possible

results. Here we introduced a method to choose the C parameter based on the Balanced Accuracy

of the model, which constituted a novelty for the state of the art known so far.

The data used during this thesis are obtained through ADNI database, which makes this study

comparable to the ones published and mentioned in Chapter 2. To finish is important to note that

when the same group has to be divided between training and testing set the use of cross-validation

ensures that the model is not prone to overestimates the results.

9

1.4 Thesis Outline

The remainder of this thesis is organized in four chapters. Chapter 2 contains the State of the Art.

In chapter 3, all the materials and methods used during this dissertation are described. The chapter

begins with a detailed description of the population and then all ML methods as well as all feature

extraction and selection techniques are explained. Chapter 4 contains all the results obtained along

this thesis and is divided in different subsections to account for the differences between classifiers,

i.e., SVM, AdaBoost and AdaSVM. Still in chapter 4, a complete discussion about the results and a

comparison between the classifiers are performed. Finally, chapter 5 encloses some conclusions and

the future work that can be done yet in this area.

10

2State of the Art

Contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

11

2.1 Introduction

To deal with the problematic of early diagnosis of AD several imaging processing techniques are

being used on imaging modalities such as PET and MRI, as already was addressed in section 1.1.1,

but these methodologies are not able to accurately per se predict the conversion from MCI to AD [25].

In the last years the problem of searching for patterns in data has received a lot of attention and

some machine learning techniques have been largely exploited [26].

The results of a machine learning algorithm can be seen as y(x) where x is a high dimensional

input features vector and y constitutes the output of the machine. This function is learnt during the

training phase, where a model is built, and its capacity for generalization is access during the test

phase in which new examples are used to avoid bias. Different problems can be solved by using

machine learning algorithms such as: supervised learning problems, where the correct label of the

test set is known, unsupervised learning algorithms, in which the label is unknown and the models

try to find groups of similar examples that are called clusters and reinforcement learning, which deals

with the problem of finding suitable actions to take in order to maximize the reward [26].

Within the problematic of early diagnosis of AD pattern recognition and machine learning tech-

niques are very useful because they allow an effective characterization of group differences as well as

a better identification of individuals at risk of cognitive decline with a high dimensional input data [27].

The high dimensional features vector makes the model prone to overfitting which constitutes a

problem, also known as curse of dimensionality where the number of features is largely greater than

the number of training examples. There are some solutions to overcome this issue and one of them

is the definition of ROIs to reduce the dimensionality problem, an approach followed by many re-

searchers, as we will see in the next section, but there are others who continue to prefer to use the

whole brain structure once the previous knowledge of the affected regions is not a requirement [25,28].

2.2 Previous Work

In recent years the problem of predicting the conversion from MCI to AD has received a lot of

attention and thus many studies have been published. To ascertain the most recent developments a

brief chronological review of the most important contributions will be presented.

In 2009, Misra et al. [10] tried to find predictors of short-term conversion from MCI to AD by

measuring the spatial distribution of brain atrophy and its longitudinal modifications in MCI Converters

(MCI-C) and MCI Non-Converters (MCI-NC). The group was constituted by 27 MCI-C and 76 MCI-NC

patients from the ADNI cohort (explained forward in Section 3.1.1) in which the classification was

based on CDR changes. A Voxel-Based Morphometry (VBM) was used to analyse the Magnetic

Resonance Images from baseline with the goal of identifying a minimal set of brain regions whose

volumes better discriminate between the two groups. Two different types of classifiers were used. In

the first attempt a predictive model was built based on Control Normal (CN) and AD information and

then applied to MCI patients. In the second one the model was built and tested based on MCI data in

a leave one out cross validation way. During this study different kernels as well as different numbers of

12

features were tested and the maximum Accuracy (ACC) achieved was 81,5%. The Receiver Operating

Characteristic (ROC) curve, which represents the trade-off between the Specificity (SPEC) and the

Sensitivity (SENS), was computed and the Area Under the Curve (AUC) was 77%. These constitute

the best results ever published however the small dataset makes the comparison with other studies a

difficult task [25].

In the same year, Querbes et al. [17] published another study debating the same topic. Here 382

ADNI patients were used where 130 were CN, 50 were MCI-NC, 72 were MCI-C and 130 were AD

subjects. For each patient, cortical thickness was computed from Magnetic Resonance (MR) baseline

images. In this study the brain was divided into different zones and combined with age in a Linear

Discriminative Analysis (LDA) and using an automatic procedure the optimal set of ROIs were chosen,

i.e., the ones that are more discriminative for the proposed task. Those zones were used to compute

the normalized thickness index that is used for prediction. Although this study reported an ACC of

73% with a SENS of 75% and a SPEC of 69% the results are biased due to the fact that the same test

set was used to determine the best zones and to test the predictive power of the model, making the

results most likely overestimated [25].

Years later, in 2011, Davatzikos et al. [28] also contributed for this problem with a dataset consti-

tuted by 54 AD, 63 CN, 69 MCI-C and 170 MCI-NC patients from ADNI. They used a VBM analysis to

predict the conversion and two different approaches for classification in which they trained the classi-

fiers with CN and AD information and test on MCI cohort. With the first classifier, that used just VBM

information, an ACC of 55,8% was achieved with a SENS of 94,7%, a SPEC of 37,8% and an AUC of

73,4%. When the information from τ -protein, a Cerebrospinal Fluid (CSF) Biomarker was included in

the SVM classifier, the ACC obtained increased to 61,7% with a SENS of 84,2%, a SPEC of 51,2%

but the AUC decreased a bit to 67,7%. The low discriminative performances between MCI-NC and

MCI-C were justified by the complexity and variance of the patterns of brain atrophy in MCI patients,

which are far more difficult to describe when compared with CN or AD patients.

Also in 2011 Westman et al. proposed a combined method to detect MCI to AD conversion. This

combined method join the ADNI database with AddNeuroMed, an European program, and comprised

a total number of 1067 patients, with 295 AD, 84 MCI-C, 353 MCI-NC and 335 CN subjects. To

predict the conversion, each of the subject´s MRI image was analysed in order to account for cortical

thickness and for volumetric changes measurements. The classifier was trained with CN and AD

information and, at the end, 71% of the MCI-C subjects were correctly labelled as AD like and 60% of

MCI-NC patients as CN like which means an ACC of 76% [29].

Cuingnet et al. published, also in 2011, a study in which ten different approaches to predict the

conversion between MCI and AD were used to build a SVM classifier. These ten approaches can be

divided into three different groups: five were based on voxels methods, three were based on cortical

thickness and two on the hippocampus, in which one was focused on hippocampal volume and other

on hippocampal shape. The population comprised 210 subjects from ADNI database and was split

into a training group with 67 MCI-NC and 39 MCI-C and a testing group constituted by 67 MCI-NC and

37 MCI-C. From those ten methods just four revealed to be slightly more accurate than chance. The

13

best results achieved, when the whole brain was used in a voxel approach, correspond to a SENS of

57%, a SPEC of 78% and an ACC of 71%. When just the cortex was considered, a SENS of 32%,

a SPEC of 91% and an ACC of 70% were achieved. Finally, when just the hippocampal volume was

taken into account, the SVM obtained a SENS of 62%, a SPEC of 69% and an ACC of 67% [30].

Wolz et al. tried to improve the ACC of MCI to AD prediction by combining features from several

structural MRI analysis techniques and to do that they used the baseline MRI images from 405 sub-

jects in ADNI dataset and a SVM classifier. The features that were used for classification were: the

hippocampal volume, the cortical thickness, the tensor-based morphometry and a new method based

on manifold learning in which the Laplacian eigenmaps are computed to estimate the low dimensional

representation of the images based on pairwise images similarities. Due to the fact that the number of

features in cortical thickness and tensor-based morphometry approaches are enormous the features

that were evaluated were those that fell inside certain predefined ROIs but, since the selection was

done using subjects from the test population, the achieved results can be slightly overestimated [25].

The best results were achieved when just features from manifold-based learning were used with a

SENS of 77%, a SPEC of 48% and an ACC of 65%. Moreover, they also performed the same test

with the population used by Cuingnet el al. (see [30]), and by using a LDA classifier. The best results

were achieved when all feature extraction methods were combined with a SENS of 67%, a SPEC of

69% and an ACC of 68%. The results were better than the ones obtained by the SVM classifier and

also had a higher SENS than those published by Cuingnet el al. [31].

In 2012 Eskildsen et al. [25] published a study in which not only the progression from MCI to AD

was estimated but also the time taken by subjects to convert. To do that the MRI scans from 6 months

(122 subjects), 12 months (128 subjects), 24 months (61 subjects) and 36 months (29 subjects) prior

to AD diagnosis were collected for MCI-C. Each group of MCI-C is compared with MCI-NC (134

subjects) based on VBM. The more discriminative features are chosen accordingly the largest value,

which means, the largest atrophy pattern and the ROIs were determined. Within these ROIs, features

were chosen during LDA classification and by combining the four classifiers they obtained an ACC of

73,5% with a SENS of 63,8% and a SPEC of 84,3%. When compared MCI-C36 with MCI-NC they

registered an ACC of 69,9%, a SENS of 55,2%, a SPEC of 73,1% and an AUC of 63,5%; for MCI-C24

vs. MCI-NC classifier, an ACC of 66,7%, a SENS of 59%, a SPEC of 70,2% and an AUC of 67,3%

were obtained; for MCI-C12 vs. MCI-NC classifier an ACC of 72,9%, a SENS of 75,8%, a SPEC of

70,2% and an AUC of 76,2% were achieved; Finally, for MCI-C6 vs. MCI-NC classifier they obtained

an ACC of 75,8%, with a SENS of 75,4%, a SPEC of 76,1% and an AUC of 80,9%. Later, during

the study they combine the VBM information with age and the following results were achieved: for

MCI-C36 vs. MCI-NC classifier an ACC of 72,4%, a SENS of 48,3%, a SPEC of 77,6% and an AUC

of 63,7%; for MCI-C24 vs. MCI-NC classifier an ACC of 67,2%, a SENS of 55,7%, a SPEC of 72,4%

and an AUC of 70,7%; for MCI-C12 vs. MCI-NC classifier an ACC of 70,6% with a SENS of 72,7%,

a SPEC of 68,7% and an AUC of 76,3%; for MCI-C6 vs. MCI-NC classifier they obtained an ACC of

74,6% with a SENS of 72,1%, a SPEC of 76,9% and an AUC of 81.1%. Several conclusions can be

taken from this study such as: the detection time decreases the SENS and increases the AUC or, by

14

including the age in LDA, the AUC, which is the most accurate measurement in imbalanced data, is

slightly better.

Still in 2012 Coupe et al. [32] published one more study about the detection of AD in pre-clinical

stages of the disease in which the ability to predict conversion is a challenging problem because the

brain pattern changes are subtler. Here, a new feature extraction method was exploited, the Scoring

by Nonlocal Image Patch Estimator (SNIPE) method, in which the nonlocal similarity of the subject to

all of the training dataset is computed which reduced problems related to intersubject variability due

the fact that a one-to-many mapping is allowed. For this purpose, all ADNI baseline dataset was used:

231 CN, 238 MCI-NC, 167 MCI-C and 198 AD in which the Hippocampus (HC) and the Enthorhinal

Cortex (EC) were selected and graded with resort to a LDA classifier. During the classification phase

different possibilities to perform the Cross-Validation (CV) were also studied such as Leave-One-Out

Cross-Validation (LOOCV), repeated Leave-N-Out Cross-Validation (LNOCV) and stratified k-fold.

For the grading method and with LOOCV procedure the performance of MCI-NC vs. MCI-C classifier

achieved an ACC of 71% with a SENS of 70% and a SPEC of 71%. For LNOCV procedure with 100

folds of repetitions, an ACC of 73% with a SENS of 72% and a SPEC of 74% was obtained. Finally,

the k-fold CV procedure scored an ACC of 73% with a SENS of 68% and a SPEC of 76%. The AUC

was not specified for any of the tests.

During the same year, Cho et al. also evaluated the predictive power of a LDA classifier in distin-

guishing between MCI converters and MCI non-converters. To do that, they used an ADNI population

with 131 MCI-NC and 72 MCI-C subjects, which later was split into training and testing sets, and

extracted the cortical thickness data from MR volumes that later were filtered to remove the noise

presented in high frequencies. At the end, this study achieved a SENS of 63% with a SPEC of 76%

and an ACC of 71% [33].

In 2013 Young et al. [34] used a Gaussian Process (GP) with multimodal data from MRI, FDG-PET,

genetic Biomarkers to differentiate between MCI-NC and MCI-C. The classification was done with 73

CN, 96 MCI-NC, 47 MCI-C and 63 AD patients from ADNI dataset. The GP model was computed

by training with CN and AD and then applied to MCI data. The output probabilities are dichotomised

to produce binary classification, i.e., if the patient will convert or not. To account for multimodal

kernel construction two main approaches were taken: Grid Search (GS), in which the weights of each

modality are chosen from a set of previously known values, or by Maximum Likelihood (ML) in which

the kernel parameters are learnt from the training data and thus we don’t have to resort to a GS with

CV. The results for GP method using only the information extracted from MRI were: an ACC of 64,3%

with a SENS of 53,2%, a SPEC of 69,8%, a Balanced Accuracy (BA) of 61,5% and an AUC of 64,3%.

For GP method using only the information from PET an ACC of 65% was reached, with a SENS of

66,0%, a SPEC of 64,6%, a BA of 65,7% and an AUC of 76,7%. Making use of multimodal kernel

but tuning the kernel parameters by means of ML the classifier achieved an ACC of 69,9%, with a

SENS of 78,7%, a SPEC of 65,6%, a BA of 74,1% and an AUC of 79,5%. Finally, the results with GS

approach registered an ACC of 67,1% with a SENS of 76,6%, a SPEC of 62,5%, a BA of 70,6% and

an AUC of 75,1%. All these values are summarized in Table (2.1).

15

As can be seen, in the last years many authors drove their efforts to improve MCI to AD conversion

prediction power and these studies were mainly focused on SVM and LDA classifiers. Nevertheless,

other publications had a huge impact to the development of this study, such as [35–40]. The one

published by Silveira et al. in 2010, [41] was also very important due to the fact that a comparison

between SVM, AdaBoost and AdaSVM was performed for three different classifiers, i.e., to distinguish

between CN and AD, between CN and MCI and between MCI and AD. Although the specific sep-

aration between MCI-NC and MCI-C was not taken into account, the obtained results for AdaBoost

revealed to be so promising that we are encouraged to apply the same principles to the problem at

hand. The results for CN/MCI, in terms of ACC, are shown in Table 2.2 just for informational reasons.

16

2.3 Summary

Table 2.1: A summary of all the results presented in this section in terms of Acuracy (ACC), Sensitivity (SENS), Specificity (SPEC) and Area Under the ROC Curve (AUC).

Article Participants Biomarker(s) Method Results (%)ACC SENS SPEC AUC

Misra et al.(2009) [10] 76 MCI-NC MRI VBM 81,5 - - 7727 MCI-C

Querbes et al.(2009) [17]

130 CN

MRI ROIs 73 75 69 -50 MCI-NC72 MCI-C130 AD

Davatzikos et al.(2011) [28]

63 CN MRI VBM 55,8 94,5 37,8 73,4170 MCI-NC69 MCI-C MRI VBM 61,7 84,2 51,2 67,754 AD CSF Biomarker τ -protein

Westman et al. (2011) [29]

295 AD

MRICortical Thickness

76 71 60 -353 MCI-NC84 MCI-C Volume335 CN

Cuingnet et al.(2011) [30]134 MCI-NC

MRIHippocampus 67 62 69 -

76 MCI-C Voxel Approach 71 57 78 -Cortical Thickness 70 32 91 -

Wolz et al. (2011) [31]

238 MCI-NC

MRIManifold Learning (SVM) 65 77 48 -167 MCI-C

134 MCI-NC Combination (LDA) 68 67 69 -78 MCI-C

17

Table 2.2: (Continued) A summary of all the results presented in this section in terms of Accuracy (ACC), Sensitivity (SENS), Specificity (SPEC) and Area Under the ROC Curve(AUC).

Article Participants Biomarker(s) Method Results (%)ACC SENS SPEC AUC

Eskildsen et al.(2012) [25]

29 MCI-C36 vs. 134 MCI-NC MRI

VBM

69,9 55,2 73,1 63,5MRI & Age 72,4 48,3 77,6 63,7

MCI-C24 vs. 134 MCI-NC MRI 66,7 59,0 70,2 67,3MRI & Age 67,2 55,7 72,4 70,7

128 MCI-C12 vs. 134 MRI 72,9 75,8 70,2 76,2MRI & Age 70,6 72,7 68,7 76,3

122 MCI-C6 vs. 134 MCI-NC MRI 75,8 75,4 76,1 80,9MRI & Age 74,6 72,1 76,9 81,1

Coupe et al.(2012) [32]231 CN/198 AD

SNIPEROIs, LOOCV 71 70 71 -

238 MCI-NC ROIs, LNOCV 73 72 74 -167 MCI-C ROIs, k-fold CV 73 68 76 -

Cho et al. (2012) [33] 131 MCI-NC/72 MCI-C MRI Cortical Thickness 71 63 76 -

Young et al.(2013) [34]

73 CN MRI

GP

64,3 53,2 69,8 64,396 MCI-NC PET 65,0 66,0 64,6 76,747 MCI-C MRI + PET + Genetics (GS) 67,1 76,6 62,5 75,1

63 AD MRI + PET + Genetics (ML) 69,9 78,7 65,6 79,5

Silveira et al. [41] 113 MCI/81 CN PETAdaBoost 79,63 - - -

SVM 74,07 - - -AdaSVM 71,52 - - -

18

3Materials and Methods

Contents3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

19

In this chapter the participants characteristics used during this study as well as the classifica-

tion techniques will be presented. The feature selection and feature extraction methods will also be

addressed.

3.1 Data

3.1.1 ADNI

ADNI is a public-private partnership launched in 2003 by the National Institute of Aging (NIA), the

National Institute of Biomedical Imaging and Bioengineering (NBIB), the FDA, private pharmaceuti-

cal companies, such as AstraZeneca, Novartis and Merk, and non-profit foundations, as Alzheimer’s

Association, in conjunction with the National Institute of Health (NIH) Foundation. Several cores con-

stitute ADNI: a clinical coordination centre, two neuroimaging cores, a biomarker core, an informatics

core and a biostatistics core. Its policy is sketched by following three main goals: [2,42,43]

1. Identification of possible neuroimaging biomarkers that allow early detection of AD;

2. Application of prevention as well as treatment techniques during the early stage of the disease;

3. Creation of a database of imaging and clinical data.

ADNI began in October 2004 and the first purpose was to recruit 200 CN individuals and 400

subjects who suffer from MCI to be followed for a period of three years and 200 patients with AD to

be followed for two years [42]. In 2011 the study included over 1000 patients and the results had

overcome the expectations [43].

During the next subsections the problematic of how the ADNI database is constituted, as well as

how images can be managed to be comparable, is addressed.

3.1.2 Subjects Characterization

For the present study individuals from ADNI [43] cohort who have PET baseline images available

were selected. In one initial phase 404 individuals were chosen but this number was reduced to 286

because 118 subjects did not have all neurological test measurement available. The 286 individuals

were divided into four groups, accordingly to their neuropsychological test scores: 78 CN, 91 MCI-NC,

52 MCI-C and 65 AD. Table (3.1) summarizes the main characteristics of each group.

Table 3.1: Description of the groups present in this study. Values are presented in Mean ± Standard Deviationformat.

Group CN MCI-NC MCI-C AD

Number of patients 78 91 52 65Age [years] 75,9 ± 4,9 75,5 ± 7,3 74,7 ± 6,9 76,0 ± 6,7Gender(% of Females) 37,2 30,8 40,4 40,0MMSE 29,1 ± 1,0 27,1 ± 1,6 27,4 ± 1,6 23,4 ± 2,0CDR 0,00 ± 0,00 0,50 ± 0,00 0,50 ± 0,00 0,81 ± 0,25

20

3.1.3 Imaging Acquisition and Processing

The FDG-PET Images in ADNI had been acquired by using one of the following PET scanners:

General Electric, Siemens or Philips and could be performed according to three different protocols

[44,45]:

1. Dynamic Protocol : six-5 minutes frames which should start 30 minutes after FDG injection [30

- 60 min];

2. Static Protocol : a single 30 minutes frame which should start 30 minutes after FDG injection [30

- 60 min];

3. Dynamic Quantitative Protocol : 33 frames acquired during 60 minutes. The acquisition should

start immediately after FDG injection [0 - 60 min].

The quantitative studies require a more strict technical protocol so they are just performed by a

few sites. The qualitative studies are easier to perform and therefore are more broadly used. [44]

With the aim of having a uniform database and making images from different scanners more

similar, there are strict process steps that should be applied to PET image data in a sequential way

[46]:

1. Dynamic co-registration: for registration purposes, all images are separated into different frames.

To decrease the number of artifacts caused by patient motion during the acquisition, each frame

is co-registered to the first extracted frame and the recombination origins a dynamic image set.

These images set present the same image size as well as the same voxel dimension and have

the same spatial orientation when compared with the original PET image data, which is called

’native’ space. This step can only be applied for images acquired under the protocol 1 or 3, i.e.,

Dynamic Protocol or Dynamic Quantitative Protocol;

2. Averaging: a single 30 min PET image frame, in the ’native’ space, is created by averaging the

six-5-minutes frames, in the case of the Dynamic Protocol or the last 6 frames, if the performed

protocol was the Dynamic Quantitative;

3. Image and Voxel size standardization: each subject’s co-registered averaged image is reori-

ented into 160x160x96 voxel grid having 1, 5 mm cubic voxel in which the anterior-posterior axis

is parallel to AC-PC line;

4. Resolution standardization: the previously obtained image is smoothed and an uniform isotropic

resolution of 8 mm Full Width at Half Maximum (FWHM) is achieved;

Before starting the analysis, all images have to go through two more transformation steps in order to

make possible a comparison between two different scanners:

5. Talairach Warping: all images are mapped to a Talairach space and thus the location of all

brain structures becomes independent of the shape, size and differences in the brains across

individuals. After Talairach warping a grid of 128x128x60 is generated;

21

6. Intensity standardization: the intensity of the image is normalized by using subject specific mask,

which has to ensure that the average of all voxels within the mask is exactly equal to one.

3.2 Machine Learning Algorithms

In this section all ML algorithms used in the present study will be better explained. ML is a great

tool because it tries to find a model that minimizes the error for a set of training examples but, at the

same time, allows some form of inductive bias, which gives to the algorithm the ability to generalize

[42].

3.2.1 SVM - Support Vector Machines

SVM became popular some years ago and has an important property of solving a convex opti-

mization problem, which means that any local solution is also a global optimum [26].

3.2.1.A SVM - Basic Concepts

Historically, the origin of SVM dates back to 1962 when Vapnik and Lerner proposed algorithms

for learning recognition and distinction nevertheless the notion of SVM as we know today, was just

introduced in 1995 also by Vapnik but, at this time, together with Corinna Cortes [47, 48]. They

defined SVM as a machine designed for binary classification problems where the input vectors are

non-linearly mapped to a very high dimension feature space. In this feature space SVM tries to find

a hyperplane that minimizes the classification error of the training examples and, at the same time,

maximizes the margin, i.e, the distance between the hyperplane and the closest examples in feature

space, which are called support vectors [48–51]. A schematic representation of a SVM algorithm is

represented in Figure 3.1.

22

Figure 3.1: This figure illustrates the basic concepts of SVMs. The support vectors are indicated by the circles.Image based on Bishop [26].

SVMs can have good generalization ability even if they work with a huge feature space dimension

because the hyperplane can be constructed by using just the support vectors. Vaipnik on his study

showed that the probability expected for an error on a test example is given by [48]:

E [Pr(error)] ≤ E [number of support vectors]

number of training vectors(3.1)

As can be seen by Equation (3.1), the error does not depend on the dimensionality of the space.

Most of the times the data is not linearly separable and to have an optimal separation hyperplane,

high dimensional kernels transformations have to be applied and, in extreme cases, i.e., when the

data distribution is really complex, this can leads to poorer generalization ability. This concept of

hard margins fell in 1995 when Vapnik and Cortes introduced the soft margin concept, which prevent

against overfitting by allowing that some data points are misclassified [26].

3.2.1.B SVM - Mathematical Concepts

Lets consider the binary classification problem where the training set χ is such that:

χ = {(x1, y1) , . . . , (xn, yn)} , xn ∈ RK , yn ∈ {−1, 1} (3.2)

23

Where xn, with n ∈ {1, . . . , N}, is the K-dimensional feature vector of each instance, yn ∈ {−1, 1}

is the class label and N is the total number of training examples. The hyperplane that separates

the training data is the one parameterized by vector w and a bias constant b, given by the following

equation:

w · x + b = 0 (3.3)

Assuming that the data is linearly separable, the decision function is given by

y(x) = w · x + b (3.4)

If the decision function given by Equation (3.4) correctly classifies all training instances, i.e., y(x) >

0 for instances with label y = 1 and y(x) < 0 for instances with y = −1, the hyperplane parameterized

by vector w and by constant b is known as canonical separating hyperplane, which means that a

canonical hyperplane should have a distance from support vector of at least 1 [26].

y(w · x + b) = 1 (3.5)

This can be shortly written by the following constrain which should be fulfilled by the closest training

points:

yi(w · xi + b) ≥ 1,∀i (3.6)

where xi with i ∈ RM and M ≤ N , are support vectors.

The optimal hyperplane from the set of separating hyperplanes is the one that maximizes the

margin, which means, the one that has the greatest distance between the hyperplane and the support

vectors, the nearest vectors. The distance is given by y(x)/‖w‖, on which ‖w‖ is the magnitude of

the vector w, so the distance of closest points to the decision boundary is given by:

d((w, b),xi) =y(w · xi + b)

‖w‖(3.7)

In order to maximize the margin, and according to Equation (3.7), the inverse of the magnitude of

w, ‖w‖−1, should be maximized, which is equivalent to minimizing 12‖w‖

2 subject to the constraint

(3.6) and so we have to solve the following quadratic optimization problem [26]:

Minimize:1

2‖w‖2

Subject to: yi(w · xi + b) ≥ 1,∀i(3.8)

To solve Equation (3.8) we make use of the Lagrange multipliers, αi ≥ 0 with i ∈ {1, . . . ,M} and

M equals to the total number of support vectors, which are very useful when the equation to be solved

has to fulfill one or more constraints. The Lagragian form of this problem is given by:

L(w, b,Λ) =1

2‖w‖2 −

M∑i=1

αi {yi(w · xi + b)− 1} (3.9)

24

where Λ = (α1, . . . , αi) and i ∈ {1, . . . ,M}.

Making use of the dual representation of the Lagrangian function it is possible to eliminate the

vector w as well as the constant b from Equation (3.9). By doing this transformation, the optimization

problem given by Equation (3.8) is easier to solve. First let’s consider the derivatives of the Lagrangian

function, (3.9) with respect to both w and b:

∂L(w, b,Λ)

∂w= (w −

M∑i=1

αiyixi) = 0 (3.10)

∂L(w, b,Λ)

∂b= (

M∑i=1

αiyi) = 0 (3.11)

By using the conditions (3.10) and (3.11) the dual representation of the problem given by Equation

(3.8) can be written as:

L(Λ) =

M∑i=i

αi −1

2

M∑i=1

M∑l=1

αiαlyiylk(xi,xl) (3.12)

The solution is obtained by maximizing Equation (3.12) subject to the following constraints:

Λ = (α1, . . . , αi) ≥ 0, i = 1, . . . ,M (3.13)

M∑i=1

αiyi = 0 (3.14)

In Equation (3.12), K(xi,xl) is the kernel function and can be defined as K(xi,xl) = φ(xi)Tφ(xl)

in which φ is the feature space transformation. For a linear transformation we can write φ(xi) = xi

and φ(xl) = xl so the kernel function can be resumed to K(xi,xl) = φ(xi)Tφ(xl) = xTi xl = xi · xl

and the Lagragian dual representation to:

L(Λ) =

M∑i=i

αi −1

2

M∑i=1

M∑l=1

αiαlyiyl(xi · xl) (3.15)

Going deeper into this problem, we already have to highlight that, the optimization of a problem

using Lagragian multipliers and subject to an inequality constraint, requires that the solution fulfill

three properties known as Kanush-Kuhn-Tucker (KKT) [26]:

αn ≥ 0 (3.16)

yny(xn)− 1 ≥ 0 (3.17)

αn {yny(xn)− 1} = 0 (3.18)

where n is the total number of instances. Through the analysis of equations (3.16) and (3.17) it

is easy to see that, for other vectors rather than support vectors, either αn = 0 or yny(x) = 1 so for

25

the solution of this problem, as we have been exploiting so far, just the support vectors assume an

important role. The bias parameter b can be computed by:

b =1

NS

∑i∈S

(yi −∑l∈S

αlyl(xi · xl)) (3.19)

where NS is the total number of support vectors.

As we have already pointed out, sometimes the data is too complex to be linearly separable. The

dual representation allows the introduction of kernels which can drive other transformations to the

feature space rather than linear such as Radial Basis Function (RBF) Kernel, widely used in CAD of

AD:

K(xi,xl) = φ(xi)Tφ(xl) = exp

{−γ‖xi − xl‖2

}(3.20)

Even using those high dimensional transformations, most of the times a linear separation of the

data is really impossible without any mistakes therefore the SVM has to be modified to account for

data misclassification. To do that, non-negative slack variables, ξn ≥ 0, n = 1, . . . , N , with N equals to

the number of instances were introduced. This extension of SVM theory is also known as soft margin

hyperplane and was introduced by Cortes and Vapnik in 1995 [48].

The slack variable, ξ, penalizes the misclassification data as function of distance. For points that

fall inside the margin but in the correct side of the decision boundary the slack variables take values

between 0 and 1, 0 < ξ ≤ 1. For those points which lay in the wrong side of the decision boundary

the slack variables take values greater than 1, ξ > 1 [26].

Now, to find the optimal solution we have to maximize the margin but, at the same time, minimize

the errors thus the optimization problem presented in Equation (3.8) suffers some transformations:

Minimize: C

N∑n=1

ξn +1

2‖w‖2

Subject to: yn(w · xn + b) ≥ 1− ξn, n = 1, . . . , N

(3.21)

where C is an adjustable parameter which controls the width of the margin accordingly with the

cost of misclassification.

The Lagragian function, as presented in Equation (3.9), becomes:

L(w, b,Λ) =1

2‖w‖2 + C

N∑n=1

ξn −N∑n=1

αn {yn(w · xn + b)− 1 + ξn} −N∑n=1

µnξn (3.22)

in which αn ≥ 0 and µn ≥ 0 are Lagrange multipliers. Finally, and by applying the partial derivatives

∂L/∂w, ∂L/∂b and ∂L/∂ξn and equals each one to 0, it is possible to achieve the optimization

problem in dual Lagrangian form:

26

Minimize: L(Λ) =

N∑n=1

αn −1

2

N∑n=1

N∑m=1

αnαmynymk(xn,xm)

Subject to: 0 ≤ αn ≤ CN∑n=1

αnyn = 0

(3.23)

3.2.1.C Multiclass SVMs

Up to now, just the binary classification problem was addressed but sometimes the problems

involve more than two classes (D > 2). In this case, there are some different methods to deal

with this in which several binary classifiers are combined to build a multiclass classifier. Two main

approaches are commonly used: one-versus-the-rest approach, in which D different classifiers are

built by using the class CD as the positive examples and the remaining D − 1 classes as negative

ones and the one-versus-one approach, the methodology adopted during this study [26].

In one-versus-one approach D(D − 1)/2 different classifiers were built to cover all possible pairs

of classes. All classifiers classify the test population and the respective class is attributed according

to the number of ’votes’, which sometimes reveals to be an ambiguous technique since two or more

classes can have the same number of ’votes’ (see Figure 3.2) [26].

Figure 3.2: A schematic interpretation of a multiclass SVM algorithm.

3.2.2 AdaBoost

Adaptive Boosting, also known as AdaBoost, was developed in 1996 by Freud and Schapire and

is the most common form of boosting algorithm [26]. It is called adaptive due to the fact that it is built

by classifiers that keep in mind the errors committed by the previous ones [52].

3.2.2.A AdaBoost - Basic Concepts

AdaBoost is an ensemble method that is used to boost the classification by using a combination

of weak and inaccurate rules, i.e., it combines a collection of weak classifiers to form a strong and

more accurate one [41,53,54].

27

AdaBoost is a sequential classifier in which each weak classifier uses the weighting coefficients

associated with each example computed by the previous classifier to perform its classification, as can

be seen by the analysis of Figure 3.3 [26]. After the first round of learning the weighting coefficient of

each example is re-weighted in order to penalize those, which were misclassified and at the end, all

weak classifiers are combined in a unique weighted classifier that can give good results even if each

weak classifier only has a performance slightly better than chance [26,54].

Figure 3.3: This figure illustrates the basic concepts of AdaBoost [26].

3.2.2.B AdaBoost - Mathematical Concepts

Lets consider the same notation (see Equation (3.2)). Each weak classifier selects a single feature

and the optimal threshold which best separates the positive and the negative examples, and can be

represented by [41,54]:

y(x, f, θ) =

{−1 if f(x) < θ1 otherwise

(3.24)

in which y(x, f, θ) represents the weak classifier, f the selected feature and θ the optimal feature

threshold. The threshold has to be carefully chosen because a high threshold lowers the detection

28

rate but if the threshold is too low the number of false positives will increase [54].

The learning process of the boosting algorithm starts with the weighting vector initialization. For

the first round of boosting the weights of the examples are set to one over the total number of examples

[53].

w(1) =1

N(3.25)

After this initialization step, a weak classifier for each feature is constructed. To evaluate the

performance of each weak classifier, a weighted error function is computed and the one that presents

the minimum error is selected [54]:

εk = minf,θ

N∑i=1

wi|y(xi, f, θ)− yi| (3.26)

The set of values that minimizes the error function is {fk, θk} and thus y(x, fk, θk) can be defined

just as yk(x). Note that there are KN weak classifiers, i.e., there is one weak classifier for each

feature/threshold combination. Although the error does not suffer any changes, after a few rounds

the confidence in those predictions increases a lot what accounts for the good generalization perfor-

mance. The level of confidence is measured by a quantity called margin [53].

margin =Correct label classification

Incorrect label classification(3.27)

As can be seen by the analysis of Equation (3.27) the margin is small when the ratio between the

correct label and the incorrect label is close to 1. In those cases the confidence is low and the gen-

eralization power is poor. Conversely, if this ratio is greater, the margin is big and the generalization

performance is better [53].

After the selection of another weak classifier, the vector of weights is updated to penalize classifi-

cation errors that occur:

w(k+1) = w(k) exp(αk ∗ ei) (3.28)

in which ei ∈ {−1, 1} is 1 if the ith example is classified correctly and otherwise is equals to −1.

αk is defined as αk = 12 ∗ ln( 1−εk

εk) [53].

After the weights vector is updated and before another learning round is performed the vector has

to be normalized.

w(k+1)i =

w(k+1)i∑N

j=1 w(k+1)j

(3.29)

with i = {1, . . . , N}, where N is the total number of training examples and k = {1, . . . ,K}, in which

K is the length of the features vector.

Finally, through the ensemble of all k weak classifiers, a more accurate classifier is obtained, also

known as strong classifier [53]:

29

Yk(x) =

{1 if

∑kk=1 αkyk(x) ≥ 0

−1 otherwise(3.30)

3.2.3 AdaSVM

SVM and more recently AdaBoost constitute machine learning techniques widely used in classifi-

cation tasks. Both have been separately applied to medical image analysis but recently the potential

of applying the two methods together has also been studied. Here we apply SVM and AdaBoost in a

sequential way, a method that takes the name of AdaSVM [51].

Since the work done in 2000 by Tieu and Viola, in AdaBoost algorithm, each weak classifier

depends only in one feature and as a result, the boosting process can be seen as a greedy feature

selection procedure because, in each round, a new weak classifier is selected [54]. The dependency

between features is encoded by the examples weights and by so the features are selected in an

efficient way [54]. On top of this, as can be seen by the analysis of Figure 3.3, AdaBoost makes

its prediction based on a majority voting scheme, just by adding weak learners, in which the more

accurate classifiers have the biggest weight and thus the training error of the final hypothesis strongly

decreases to zero [26,51,53,54].

Despite all its advantages such as the speed of learning, due to this high accuracy of the weak

classifiers, sometimes it is better to have a more robust classifier instead of having a set of less

accurate ones [55]. One possible way to overcome this disadvantage is by using SVM in a sequential

way, i.e., we use AdaBoost to select the features that most accurately fit the problem, and thus the

dimensionality of the problem is reduced, and SVM as a final classifier with those features previously

selected by Boosting [51]. By doing this procedure, the quality of the classification can be improved

[56]. Figure 3.4 shows a schematic representation of AdaSVM algorithm.

30

Figure 3.4: This figure illustrates the basic concepts of AdaSVM.

3.3 Cross-Validation

In prediction theory, the best classifier is the one that minimize the error in a test set. Moreover,

most of the times, tuning some model’s parameters is also an important step to improve the accuracy

of the classifier. To account for these requirements the dataset is usually split in three different groups:

a training group, a group that is used during the construction of the model, a validation set, used to

choose the parameters that maximize the accuracy of the model, and a test set, to test the predictive

power of a certain model. This separation is essential because the first source of bias, also known

as ”double-dipping” in pattern recognition techniques occurs when one sample is involved in its own

classification and, by so, the obtained results are overestimated [32].

There are different ways to split the dataset into the correspondent groups. The first one is sepa-

rate into three mutually exclusive subsets where 2/4 of the dataset corresponds to the training popu-

lation, 1/4 for the validation group and the remaining 1/4 for the test dataset. Although it is a correctly

possible approach this method raises some problems. In datasets where the number of features is

much larger than the number of examples, which constitutes a problem known as curse of dimension-

ality that will be addressed in Section 3.5, this separation in three different groups makes the training

group inexpressive which originates problems in classification accuracy. The prediction error of the

achieved model is not representative of the problem, since it does not have enough information.

To overcome the aforementioned issue different dataset separation methods were developed and

the most widely used is the k-fold cross validation. The cross validation technique is widely used to

31

evaluate the prediction errors of the models. Here, the data is randomly portioned into K sets with

more or less equal size and each of the K sets is used for testing while the other K-1 are used to

build the model. Into each of the sets used for training the classifier, another cross validation step is

usually done to account for training and validation sets (see Figure 3.5). At the end of the process

K different models were obtained and the resulting error corresponds to an average. Although it is

computational costly, since it has to build K different models, cross-validation has some advantages

such as making full use of the data and achieving a more accurate evaluation of the prediction error

with a lower variance once it corresponds to an average over K folds [57,58].

Figure 3.5: A schematic view of the cross-validation process.

3.4 Feature Extraction

3.4.1 Voxel Intensity

The aim of this thesis is the construction of different CAD systems to forecast the conversion

from MCI to AD as outline in section 1.2. To do that, and as feature extraction method, VI from 3D

FDG-PET images was used.

The VI values, V (x, y, z), were directly taken from ADNI FDG-PET images and are closely related

with glucose uptake and, because of that reason, as explained in section 1.1.3, VI values constitute a

good candidate for detecting preclinical stages of AD. As was detailed in section 3.1.3 the FDG-PET

images from ADNI are already processed, resulting in a normalized set of images all having the same

dimension, 128x128x60. In this way, the domain of V (x, y, z), D, can be state as:

D = {(x, y, z) ∈ N : 1 ≤ x ≤ 128, 1 ≤ y ≤ 128, 1 ≤ z ≤ 60} , V (x, y, z) ∈ [0 32700] (3.31)

32

3.4.1.A 1st Mask: Voxels inside the brain volume

In order to reduce the dimensionality of the classification task, another pre-processing step was

applied where just the voxels inside the brain were considered for classification purpose and the ones

that lie outside the brain were ignored. To do that, a binary mask was built which is true for those

voxels inside the brain and false otherwise. First, an average brain is computed by using CN patients’

information and then, this volume was threshold at 5% of the maximum value to account for some

variations across the population. The resulting image of the application of this mask is shown in

Figure 3.6.

Figure 3.6: On the left, a slice of the average brain of all CN patients. On the right the same illustrates the imageafter the binary mask application.

3.4.1.B 2nd Mask: Voxels inside specific brain regions

In this thesis the application of other masks rather than the previous one was also exploited.

Those masks correspond to different ROIs and our aim was to find the set of regions that have the

most discriminative power during the classification task, i.e., the regions that are more involved in

conversion process. These masks were built based on the brain regions previously delimited by Dr.

Durval Campos Costa, from Champalimaud Foundation (Figure 3.7).

For the purpose of this study, the selected regions are: the Left Lateral Temporal lobe, the Left

Dorsolateral Parietal, the Right Dorsolateral Parietal, the Superior Anterior Cingulate and the Posterior

Cingulate and Precuneus. Those regions showed a higher discriminative power and some of them are

in line with other studies like [17], such as the Left Lateral Temporal Lobe and the Posterior Cingulate

and Precuneus. The other three regions were included following some Dr. Durval Campos Costa

indications and the set achieved better results.

Although both masks can be seen as a feature selection step rather than a feature extraction

procedure it is more consensual to describe these masks here, as a pre-processing step because

they only depend on anatomic position and thus they not depend on features’ information.

3.5 Feature Selection

Although the feature extraction step account for computational time reduction it does not gives any

statistical information about dataset and the features vectors obtained after the pre-processing step

33

(a) (b)

(c)

Figure 3.7: In 3.7(a)the Left Lateral Temporal lobe appears in dark red, the Right Lateral Temporal in red andthe Inferior Anterior Cingulate in dark blue.In 3.7(b) the Left Mesial Temporal lobe appears in dark blue, the RightMesial Temporal in blue and the Inferior Frontal Gyrus/Orbitofrontal in dark red. In 3.7(c) the Left DorsolateralParietal lobe appears in dark red, the Right Dorsolateral Parietal in blue, the Superior Anterior Cingulate in redand the Posterior Cingulate and Precuneus in dark blue.

can easily have a huge dimension. A features vector with a high dimensionality and a comparatively

small sample size constitutes a common problem in pattern recognition and this creates some serious

challenges to the classifier’s performances since the risk of overfitting is higher, i.e., the classifier can

have a good performance in the sample size but lacks in generalization power and perform poorly in

unseen data.

During the feature selection procedure the features vectors of each instances are reduced to a new

space of variables, which has a reduced variability. Typically the problem becomes easier to solve

than the first one due to the fact that the selected features are the ones that are more discriminative

for the problem in hands [26].

There are some benefits of using feature selection procedures [59]:

• Facilitating data visualization: with a reduced number of features the identification of patterns in

data is easier;

34

• Reducing stores requirements: with fewer features, the memory occupied by the information is

also less;

• Reducing training and utilization times: the time needed to train and test the model is lesser

because the dimensionality of the problem is smaller;

• Improving prediction performance: the relevant features are selected and the noisy ones are

discarded and thus the classifier performance can be improved.

The feature selection procedure can be seen as a search problem where the number of features

is reduced to a subset of relevant ones. To help in this task four main strategies are used. The

first broadly class of methods are known as filter methods and here the features are selected just

based on training set characteristics and the performance of the classifier is not important for the final

decision. The major advantage of this method is to preform faster than the others. In the second class

of methods, also known as wrapper methods, the adopted strategy is different. Here the features are

selected based on the maximal ACC of the model in the training data. Although this strategy achieves

better results it is more computational costly. In the third method, embedded method, the features are

selected during the training process and this approach is specific to the applied algorithm. Finally, the

last method is also known by hibrid method and combines the first two approaches herein mentioned,

i.e., first a subset of candidate features are faster selected by a filtered approach and then, this subset

is reduced based on an accurate wrapper strategy [59–61].

During the next three sections the feature selection procedures exploited during this thesis are

resumed but first it is important to clarify some notation. During the following section, let consider the

χ dataset in which:

χ ={

(x(1), y(1)), . . . , (x(N), y(N))}

(3.32)

where x(i) = (x(i)1 , . . . , x

(i)K ) in which k represents a K-dimensional features vector x and i ∈ RN

is the ith example of the dataset.

3.5.1 Pearson Correlation Coefficient

In statistical analysis the dependence between two variables can be measured by correlation. The

PCC, represented by r, computes the linear association between two variables [62], and in this case

between each feature Xi and the class label Y , and can be represented by:

r =cov(xk, y)√

var(xk)√var(y)

(3.33)

where cov stands for covariance and var for variance. An estimate for r can be given by [63,64]:

r =

∑Ni=1(x

(i)k − xk)(y(i) − y)√∑N

i=1(x(i)k − xk)2

√∑Ni=1(y(i) − y)2

(3.34)

where xk is the mean value of the kth feature and y the mean value of the vector label y. The

r coefficient does not have any units and can have values from -1 to 1, −1 ≤ r ≤ +1. A value

35

of r close to ±1 indicates a strong linear relationship between the two variables regardless of the

direction but, while a positive value indicates a direct relation, i.e., an increase in the first variable

means an increase in the second one, a negative value represents an inverse relation and thus, when

one variable increases its values the other decreases [62]. Here, the best features for the model are

chosen based on a absolute value ranking of the r coefficient.

To conclude this topic, for the special case when the correlation is computed between a continuous

variable and a dichotomous one, as the case at hand, the correlation coefficient can be simplified to:

rpb =M1 −M−1

sxk

√N1N−1N2

(3.35)

where rpb is the point-biserial correlation coefficient, a mathematical equivalent of the Pearson

correlation coefficient [65], M1 is the mean value of all x(i) examples with label y = 1, M−1 is the

mean value of x(i) examples with label y = −1, N is the total number of examples in dataset and sN

is the standard deviation of xk given by:

sxk=

√√√√ 1

N

N∑i=1

(x(i)k − xk)2 (3.36)

Here we consider a dichotomous variable y = {−1, 1} but the same holds for other values rather

than −1 and 1.

3.5.2 Boosting

As explained in Sections 3.2.2 and 3.2.3, each weak classifier depends only in one feature and

thus the process of choose a weak classifier is, in fact, a feature selection procedure. The features

are chosen based on an error minimization (see Equation (3.26)), i.e., those that most accurately

span the classification problem [51]. Boosting can be seen as an embedded method because the

features are selected during the training process and although it is more computational costly, by this

way have the prior knowledge of most suited features for classification is not a requirement [51].

3.5.3 Mutual Information

There are several different approaches to measure the linear relationship between two variables,

such as the Pearson Correlation Coefficient, already mentioned in Section 3.5.1 or the Euclidean

distance but, sometimes, the relation between two variables can not be expressed in a linear way.

MI is a filter method based on information theory and represents different types of dependencies

between two variables and not just the linear ones [66]. The relation between two random variables x

and y can be measured by considering the divergence between the joint distribution and the product

of the marginal, and is given by [26]:

I(x,y) =

∫∫p(x,y)log(

p(x,y)

p(x)p(y)dxdy (3.37)

or in the discrete way, where the integrals are substituted by summations and the continuous

distributions estimated by frequency counts [67]:

36

I(x,y) =∑x∈K

∑y∈Y

P (x,y)logP (x,y)

P (x)P (y)(3.38)

where K and Y are all possible values for variables x and y respectively. The aim of using Mu-

tual Information as a feature selection procedure, is to select those that maximize the joint mutual

information [68]:

I(X1:K , Y ) =∑T⊆S

I({T ∪ Y }) (3.39)

where∑T⊆S is the sum over all possible subsets T drawn from S and S the set of all possible

input features, S = {X1, . . . , XK}. If we assume that there is no other high order relations rather than

conditional and unconditional pairwise relations, the summation is truncated such that |T | ≤ 2, which

gives [68]:

I(X1:K , Y ) ≈K∑i=1

I(Xi, Y ) +

K∑j=1

K∑l=j+1

I({XjXl, Y }) (3.40)

where K is the total number of features in the dataset. The feature selection procedure is done by

an iterative algorithm in which the utility of choosing the feature XK when K − 1 have already been

selected is quantified by I(XK ;Y |X1:K−1) = I(X1:K;Y ) − I(X1:K−1;Y ). By using the approximation

presented in Equation (3.40) and the definition of interaction information, this utility can be expressed

by:

Jfou = I(XK ;Y )−K−1∑i=1

[I(XK ;Xi)− I(XK ;Xi|Y )] (3.41)

where Jfou is the first-order utility (first-order pairwise interaction) (FOU) of including feature

XK [68]. The FOU is constituted by three different components: the own mutual information, a

penalization for high correlation between features and a positive contribution that account for the de-

pendency on the class conditional probabilities which indicates that the best feature is the one that

has the best trade-off between these three components [68]. The FOU can be written in a parame-

terized way where not just the trade-off is more understandable but also different criterion for feature

selection based on mutual information can be subsumed, since different methods depend essentially

on the weights given to each parameter [68]:

Jfou = I(XK ;Y )− βK−1∑i=1

I(XK ;Xi) + γ

K−1∑i=1

I(XK ;Xi|Y ) (3.42)

In the present work, we set β and γ to zero and consider just the component of the MI between

the two variables. For feature selection purposes a MI ranking, which can take values between 0 and

1, is computed and the features that have the highest MI value are selected. Once again, note that

the redundancy between features is not taken into account.

37

38

4Experimental Results

Contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Model’s adjustable parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 SVMs Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4 AdaBoost Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5 AdaSVM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6 Summary of all the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

39

4.1 Introduction

One of the goals of this work is to perform a comparison between the three different machine

learning techniques already mentioned, i.e., SVM, AdaSVM and AdaBoost. Within each method,

different approaches were also studied in order to evaluate the importance of different types of feature

extraction and feature selection techniques as well as to find the best way for tuning the model’s

parameters. During the next sections all those approaches will be better explained and all results will

be presented.

During this chapter a comparison between all the studies will also be performed in order to infer

what is the best classifier, the best feature extraction procedure, as well as the best feature selection

method for the detection of MCI to AD conversion.

Before starting, the point of how to evaluate the performance of a model has to be clarified. There

are a lot of different ways to measure the prediction power of a model such as the ACC of the model,

the BA, the SENS or the AUC. Although the most frequently used measure is the ACC, which mea-

sures the effectiveness of predicting the right class label [39], sometimes this is not the best way to

evaluate a classifier. Here, the AUC constitutes the most accurate way to do it since the dataset in

this thesis is imbalanced and the number of negative examples (MCI-NC) is almost twice the number

of positive ones (MCI-C) [27]. Note that AUC measures the probability of assigning a higher value to

a positive sample, when one positive and one negative sample is drawn at random [39].

4.2 Model’s adjustable parameters

When the data is imbalanced and since the tendency of the classifier is to classify everything

according to the most significant class, assigning different weights to different classes assumes a role

of greater importance as it can give a higher penalization to the less representative class and, by so,

change the decision boundary and makes the model more sensitive (see Figure 4.1). In these cases,

the C parameter, which as notice in section 3.2.1.B controls the width of the margin according to the

cost of misclassification in the SVM algorithm, is also very important and should be tuned properly.

Three different approaches were tested during SVM and AdaSVM studies and their characteristics

will be clarified next.

1. Without Weights (WOW): the same penalty is given for a mistake committed for both classes.

The C parameter is chosen using a grid search, from a previously determined set of values,

based on the model ACC. For this purpose a 10-fold CV is performed for each possible value

of C, and the selected value is the one that maximizes the mean ACC score in the 10 validation

sets;

2. With Weights (WW): the penalty for a mistake committed by the majority class is lower than a

mistake committed by the less representative one. The weight of the majority class is equaled to

1 and the weight of the less representative one is found by dividing the number of subjects which

belong to the majority class by the number of individuals present in the minority class. The C

40

parameter is chosen exactly as in the previous test but now taking into account differences in

penalizations;

3. With Weights and based on Balanced Accuracy (WWBA): like in the previous approach, different

classes have different mistake penalizations but now the C parameter is tuned based on a

different criterion i.e., the BA (the mean value between the true positive rate, also known as

SENS and the true negative rate, which is normally called SPEC).

Figure 4.1: Lets consider a synthetic dataset where the negative class (blue crosses) that has 10 times moresamples than the positive one (red crosses). If the weights are not used, the decision boundary (blue line) issuch that on one hand the negative test examples (black circles) are almost correctly classified and, on the otherhand, there are more positive examples (pink circles) misclassified. When the weights are taken into account, thedecision boundary suffers a shift (green line) and the sensitivity of the model, the number of positive examplescorrectly classified, increases but, at the same time, the specificity decreases.

41

4.3 SVMs Results

There are many possibilities to build a SVM model. Here three different ways to perform a linear

SVM classification were exploited:

1. The classifier is trained with patterns from the CN and AD population and tested on the MCI set.

In this case the CN subjects are used as the training patterns for the MCI-NC class whereas the

AD subjects are considered the training patterns for the MCI-C class;

2. The classifier is trained and tested using only patterns from the MCI population;

3. The classifier is trained with all subjects in the dataset and tested on the MCI population. As

in 1, the CN subjects are considered as MCI-NC like and the AD subjects are considered as

MCI-C like.

In each case, the influence of different feature extraction and feature selection techniques were

also studied.

4.3.1 Training with CN and AD subjects

In this first approach, the linear SVM model is built based on information from CN and AD patients

and then applied to the MCI population. The number of features varies between 50 and 10.000 and

the selected features, from a pool of more than 300.000, are those that have the highest PCC. For

classification purposes, the three tests previously mentioned in section 4.2 were performed.

The same study was repeated but now, instead of using all the voxels inside the brain volume, only

the information from voxels that are inside the five regions described in section 3.4.1.B were used.

Once again the number of selected features varies between 50 and 10.000, but now from around

28.000 possibilities, and are those that have the highest PCC value.

The third study is identical to the first one but now MI was used as the feature selection criterion

instead of PCC. The aim of this study was to understand the influence that the different feature

selection method has in the final classification performance, i.e., if the model benefits if another feature

selection criterion is used, different from the one used previously, which just considered the linear

relation between variables and class label.

Finally, the second study was repeated but now MI was used as feature selection criterion instead

of PCC as already happened before.

Figure 4.2 shows the best ROC curves of all these approaches, where the red trace corresponds

to the first classifier (WOW test), the blue trace to the second (WWBA test), the green to the third

(WOW test) and the black trace to the fourth one (WOW test).

For all the approaches the best results, regardless of the number of features, are summarized in

Table 4.1.

42

Table 4.1: The highest results for the SVM classifier when CN and AD are used for training in terms of Accu-racy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy (BA) and Area Under the Curve (AUC).

SVM Tests ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)

PCCWOW 71,3 61,5 80,2 68,8 77,0WW 70,6 67,3 75,8 69,9 76,5

WWBA 71,3 67,3 75,8 70,0 76,9ROIs WOW 74,1 65,4 84,6 70,2 80,8

& WW 74,1 63,5 84,6 71,8 80,6PCC WWBA 73,4 71,2 81,4 72,9 81,3

MIWOW 73,4 76,9 71,4 74,2 78,0WW 70,6 69,2 78,0 69,8 77,3

WWBA 70,6 69,2 75,8 69,5 77,3ROIs WOW 75,5 71,2 84,6 71,1 81,9

& WW 74,1 76,9 79,1 71,3 81,4MI WWBA 74,1 75,0 82,4 74,3 81,4

Figure 4.2: A comparison between the best ROC curves achieved when CN and AD subjects are used fortraining and by using the information from whole brain (2500 features), from the ROIs (10.000 features), fromwhole brain but using MI as a feature selection criterion (1000 features) and by applying the same criterion to theROIs (10.000 features).

Although SVMs are robust to imbalanced data due to the fact that they only take into account

the support vectors, i.e., the points in the feature space that are close to the decision boundary [27],

usually the introduction of different penalties, according to the number of the examples in each class,

is important for the success of the classification task.

43

In the first study, as can be seen by the analysis of Table 4.1 although the overall performance

remains unchanged, the use of weights introduces a slight difference in the obtained results since

when weights are used the SENS of the model increases about 6% and the SPEC decreases almost

in the same proportion, as can be verified with the analysis of Figure 4.3. These results are in line with

our expectation since the MCI-C class is less representative and so, when the weights are applied the

decision boundary changes (see Figure 4.1) and more positive examples are correctly classified but,

on the other hand, more negative ones are mislabelled. Here the differences between the second

and the third tests are not significant.

(a) Sensitivity (b) Specificity

Figure 4.3: Sensitivity and Specificity variation as a function of the number of features for a SVM classifier whenCN and AD are used as a training set and when all voxels inside the entire brain volume are considered. Herethe features are selected according to the PCC ranking criterion.

The second study is similar to the first one, but a different feature extraction transformation mask

is used. The SENS and the SPEC reached in this study are almost 4% better than the ones verified

in the previous one and the ACC as well as the BA were improved in about 3%. The AUC achieved is

also 4% better which shows that the set of regions considered are quite discriminative to the problem

of predicting MCI to AD conversion. This conclusion can also be drawn by the analysis of Figure 4.2

where the ROC curve of the second classifier in which only the voxels inside the ROIs are used (blue

trace) is always superior than the ROC curve that belongs to the first one in which all voxels inside

the brain volume are taken into account (red trace).

The transition from the MCI stage to AD is not linear, so the use of other methods of feature

selection rather than PCC can be beneficial for the model. With the aim of studying the influence of

different feature selection procedures in the predictive power of the classifier, the same two studies

were repeated but now by using MI ranking as a criterion to select the best features. When only the

voxels inside the ROIs were considered, the AUC of the classifier is almost 3% better, which once

again, puts in evidence the high predictive power of those regions (see Figure 4.2). The ACC, the BA

as well as the SPEC are also higher than when the voxels within the entire brain volume were used.

The obtained results are, in general, better than the ones achieved with PCC criterion (see Figure

44

4.2) especially in terms of SENS which was improved in a range between 2% and 15 %. To better

understand this discrepancy, the classifiers outputs, for the tests which registered the best SENS

results, were compared and are presented in Figures 4.4 and 4.5.

0 50 100 150 200 250 300−4

−2

0

2

4

6With PCC as Feature Selection Procedure

Subjects

De

cis

ion

Va

lue

s

CN

MCI−NC

MCI−C

AD

0 50 100 150 200 250 300−4

−2

0

2

4With MI as Feature Selection Procedure

Subjects

De

cis

ion

Va

lue

s

CN

MCI−NC

MCI−C

AD

Figure 4.4: A comparison between the output of SVM classifiers when CN and AD are used for training.

45

0 50 100 150 200 250 300−4

−2

0

2

4With PCC as Feature Selection Procedure (ROIs)

Subjects

De

cis

ion

Va

lue

s

CN

MCI−NC

MCI−C

AD

0 50 100 150 200 250 300−4

−2

0

2

4

6With MI as Feature Selection Procedure (ROIs)

Subjects

De

cis

ion

Va

lue

s

CN

MCI−NC

MCI−C

AD

Figure 4.5: A comparison between the output of SVM classifiers when CN and AD are used for training and byusing only the voxels that fell inside the ROIs.

First of all, as can be seen by the analysis of Figures, the MCI-NC pattern is closer to the CN

pattern and the MCI-C to the AD one. These results are in line with our expectation. In terms of the

classification performance, the number of misclassifications when ROIs are used is always smaller

than when all voxels inside the brain volume are taken into account. Moreover it is also possible to

note that when MI is used as feature selection procedure, the number of mislabelled samples within

the training classes is also smaller.

Is still important to refer that, during the third study, when all Voxels in the Entire Brain volume

(VEB) are considered and when the features are selected according to the MI ranking criterion, the

SENS of the classifier in the WOW test is higher, which is not in line with our expectation (see Figure

4.1). This happens due to the fact that the number of false positives in WOW test is very high so the

introduction of weights has the opposite effect, that is, the SPEC increases and the SENS decreases.

Finally, to end up with this discussion about SVMs classifiers with CN ad AD as training classes

and in an attempt of better understanding what is happening, the selected features were analysed,

see Figure 4.6 and Figure 4.7.

46

Figure 4.6: Features selected by PCC to build the classifier when CN and AD subjects are used for training.

Figure 4.7: Features selected by MI to build the classifier when CN and AD subjects are used for training.

The features that are selected by PCC have influence on each other since the correlation coef-

ficient value of one voxel it is dependent of the values registered by its direct neighbours and by so

the selected features pattern shows less variation, as can be seen in Figure 4.6. This dependency is

more understandable with the analysis of Figure 4.8 in whitch a PCC filtered approach operation is

simulated. On the other hand, as it shows in Figure 4.7, the MI method applies a different kind of filter

and thus the pattern of selected features is sparser and the model can predict AD conversion more

47

accurately.

Figure 4.8: A mathematical interpretation of correlation. With the analysis of the bottom figure it is possible tounderstand that the pixel of interest (the one in the middle) is evaluated as -1 x 83 -1 x 94 + 1 x 48 + 1 x 95 =-34 [69].

4.3.2 Training and testing with MCI data

In this study, the model was built and tested based only on MCI subjects’ information. To avoid

overfitting the separation between MCI training set and MCI testing set was done by 10-fold CV and

repeated 10 times. As in all the studies mentioned so far, the number of selected features varies

between 50 and 10.000 and the ones that are selected are those that have the highest PCC value.

Once again, three different tests were performed.

The second study is equals to the aforementioned one but now other feature extraction mask

is used, that is, the one that only takes into account the information from the ROIs as previously

mentioned in section 3.4.1.B.

The third study is similar to the first one, i.e., the classifier was also built based on information from

the entire brain region but MI ranking was used as feature selection criterion.

Finally, the same procedures that were taken into account during the second study were consid-

ered but now the features were selected, as happened before, according to the MI criterion.

48

For all the approaches referred, the highest results are presented in Table 4.2 regardless of the

number of selected features. The results correspond to the mean values of the 10-folds CV procedure.


to the first classifier (WOW test), the blue trace to the second (WW test), the green to the third (WW

test) and the black trace to the fourth one (WOW test).

Table 4.2: The highest results in terms of Mean±Standard Deviation for the SVM classifier when MCI subjectsare used for training. Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy (BA), AreaUnder the Curve (AUC).


PCCWOW 67,2±2,4 42,7±4,0 84,6±2,3 60,9±2,9 64,6±2,1WW 64,6±2,9 49,0±5,1 74,2±3,8 61,2±3,2 64,4±2,7

WWBA 64,1±2,5 52,5±3,6 70,7±3,7 61,6±2,3 64,1±2,5ROIs WOW 70,1±1,4 37,7±3,2 90,9±2,7 62,3±1,2 68,3±1,4

& WW 71,0±0,9 47,7±4,1 86,3±1,4 65,2±1,2 68,3±1,4PCC WWBA 68,7±2,2 52,9±4,2 80,1±3,9 64,1±1,8 67,8±1,1

MIWOW 65,7±2,5 36,9±5,9 83,2±4,4 59,3±2,1 62,9±2,6WW 63,4±2,5 45,6±5,1 74,7±4,0 59,1±3,3 63,1±2,3

WWBA 63,5±2,3 52,1±2,8 70,8±2,5 61,1±2,1 62,4±2,1ROIs WOW 70,6±1,1 38,1±3,2 92,6±1,2 62,3±1,1 68,3±1,2

& WW 70,0±1,0 46,9±3,8 85,8±1,2 63,9±1,6 67,8±1,0MI WWBA 68,0±1,4 54,2±3,6 77,7±2,6 64,3±1,4 67,8±1,4

49

Figure 4.9: A comparison between the best ROC curves achieved when MCI subjects are used for training andtesting and by using the information from whole brain (250 features), from the ROIs (10.000 features), from wholebrain but using MI as a feature selection criterion (500 features) and by applying the same criterion to the ROIs(10.000 features).

When the classifier is trained and tested with MCI subjects the levels of performance are lower

because MCI information is noisier due to the variety of different metabolic patterns that can be

found across the MCI population. As already mentioned, the patient does not have a perfect linear

transition so sometimes it is difficult to know in which stage the patient is. Once again, when only

the information from the ROIs is used the results are slightly better. Although the SENS does not

show any improvement and sometimes, as in the case of WOW and WW tests, achieved even worse

results, the SPEC of the model increases in a range between 6% and 12% and the AUC in about

4%. The superiority of the last classifier can be verified with the analysis of Figure 4.9 in which the

corresponding ROC curve (blue trace) is almost always greater than the ROC curve that belongs to

the first study in which all voxels inside the brain volume are used for classification purposes (red

trace).

In the third and fourth studies, in which MI is used as feature selection procedure, the results are

very close to the ones achieved in the first and second ones respectively (see Figure 4.9) and thus it

is not possible to say which is the best feature selection method to solve this problem.

In all the tests performed in each study the obtained results are in accordance with our expectation,

i.e., when weights are introduced, the SENS of the model increases and, at the same time, the SPEC

50

decreases. The study in which these differences are more noticeable is the last one where the WOW

and WWBA tests differ in more than 15% (see Figure 4.10), however these huge differences are

presented in all studies due to the fact that the number of MCI-NC are more than twice the number of

MCI-C (see Figure 4.1 for more explanations).

(a) Sensitivity (b) Specificity

Figure 4.10: Sensitivity and Specificity variation as function of the number of features for a SVM classifier whenMCI subjects are used as a training set and when only the voxels inside the ROIs are considered. Here thefeatures are selected accordingly to the MI ranking criterion.

4.3.3 Training with all classes and testing with MCI population

During this study, the model is learnt in a different way because instead of using a binary classifier,

a multiclass approach is performed, as explained in Section 3.2.1.C. All classes, CN, MCI-NC, MCI-C

and AD, were used for training but just the MCI individuals were considered for testing. Once again

three different tests were performed. Here different penalties were applyed according to the number

of individuals in each class and thus the majority class has the lowest mistake penalization, that is,

1, and the class which has less individuals has the highest one, which is found by dividing the total

number of examples in the majority class by the total number of examples in the less representative

one. To avoid overfitting the separation between MCI training set and MCI testing set was done by

10-fold CV and repeated 10 times. The number of selected features varies between 50 and 10.000

and they correspond to those that have the highest PCC.

The other three studies follow the same principles that already were used before: the second

study is a repetition of the first one but using only the features that fell inside the ROIs; the third study

also follows the same principles than the first one but now by using MI as feature selection criterion;

and the fourth study is a repetition of the second one but with MI as feature selection method.

For all the approaches referred in this section, the highest results are presented in Table 4.3

regardless of the number of selected features. The results correspond to the mean values of the

10-folds CV procedure.


51

to the first classifier (WWBA test), the blue trace to the second (WWBA test), the green to the third

(WOW test) and the black trace to the fourth one (WWBA test).

Table 4.3: The highest results in terms of Mean±Standard Deviation for the SVM classifier when all subjects areused for training. Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy (BA), Area Underthe Curve (AUC).


PCCWOW 71,0±0,9 41,0±2,0 89,3±0,7 64,9±1,4 72,6±0,8WW 70,6±0,8 61,7±3,7 84,9±1,8 66,1±1,6 72,8±0,7

WWBA 69,5±1,4 63,8±5,8 82,4±2,9 66,3±2,7 73,9±1,2ROIs WOW 72,7±0,8 45,2±3,4 90,2±1,2 66,1±0,7 75,6±1,7

& WW 69,4±1,1 57,5±2,5 78,7±2,7 66,3±1,1 76,5±1,2PCC WWBA 70,1±1,0 59,8±2,5 78,7±1,7 67,2±0,9 77,6±2,0

MIWOW 71,0±1,5 41,9±3,1 89,6±0,9 64,8±1,7 74,0±1,2WW 70,7±1,2 59,6±3,1 80,0±1,9 67,2±1,1 73,5±1,1

WWBA 69,9±1,5 60,7±3,7 79,8±2,5 66,5±1,5 73,5±1,5ROIs WOW 72,5±1,7 47,9±4,5 91,2±0,9 67,1±1,9 74,9±0,8

& WW 69,9±1,8 56,7±4,6 80,3±2,2 66,2±2,7 76,0±1,8MI WWBA 70,8±1,4 58,1±3,5 81,1±1,9 67,0±1,3 76,8±1,1

Figure 4.11: A comparison between the best ROC curves achieved when all subjects are used for training andtesting and by using the information from whole brain (5000 features), from the ROIs (10.000 features), fromwhole brain but using MI as a feature selection criterion (2500 features) and by applying the same criterion to theROIs (5000 features).

When all classes are considered for training purposes, we can say that, for the same number

52

of features (10.000 features) the achieved results have an intermediate behaviour. As can be seen

in Figure 4.12 the ROC, which belongs to this study, is lower than the one which corresponds to the

classifier built with CN and AD patients’ information (red trace) but higher than when only MCI patients

are considered for training (black trace), since the SENS of the model improves, in some cases, 11%

and the AUC increases in about 9%.

Figure 4.12: A comparison between the three different ways of training the SVM classifier herein exploited whenROIs are used.

Once again it is important to highlight that the SENS of the model increases a lot with the intro-

duction of weights. This improvement sometimes achieves the remarkable value of 23%, as happens

in the first study, and it is true for all the studies. The classifier built based on information from ROIs

has a performance improvement of almost 4%.

As happened in the case when only the MCI information is considered, there are no significant

differences between the two different feature selection procedures, which is also proven by the ROC

curves presented in Figure 4.11 in which the red and green traces and the blue and black traces

almost overlap.

To conclude the SVMs analysis, it is consensual to say that the best way to build one classifier is by

using CN and AD subjects as a training set and test the model with MCI patients, since this classifier

achieves better performances in terms of SENS and AUC. Here, the best feature selection procedure

is the MI, since the obtained results are slightly better. Moreover in the other two ways of building a

53

classifier, i.e., when only MCI or when all classes are taken into account during the training phase, the

method in which the features are selected does not make much difference, probably due to the noisy

nature of MCI data. On the other hand, the feature extraction transformation mask applied during

the pre-processing phase assumes an important role once when just the voxels inside the ROIs are

used the achieved results are at least 3% better, which proves the high discriminative power of those

regions to distinguish between MCI converters and non-converters and also puts in evidence that the

feature selection procedures can not be perfect. Although the number of features needed to obtain

these results are more than double when compared with the pre-processing step explained in section

3.4.1.A (10.000 features) the computational time decreases since the dimensionality of the problem

is much smaller (almost 12 times). Still in SVMs classifiers although for the majority of the studies the

best AUC results had been achieved through the WOW test the best way to tune the C parameter is

by applying the WWBA method because in almost every study the SENS increases which means that

the model becomes more prone to detect a new positive case, which is also a very important aspect

when we have to decide which one is the best classifier for the problem at hands. The differences

in the AUC performance between these two tests are not much significant and in several cases the

AUC performance in WWBA test register a higher value. This analysis allow us to conclude that the

introduction of weights does not have much influence in the overall performance but influences a lot

the capacity of the classifier in detect a new positive case.

Up to now, only SVMs were used for classification purposes. As already was mentioned, two

different classification approaches were also exploited and the methods as well as the results are

explained in the next two sections. At the end, in Table 4.8 a summary of all the results obtained

during this thesis is presented.

4.4 AdaBoost Results

Two different ways to build the AdaBoost classifier were investigated:

1. The classifier is trained with CN and AD population and tested on the MCI set;

2. The classifier is trained and tested based only on MCI information.

These options correspond to the first two considered when building a SVM classifier. The third op-

tion, i.e., when all classes are used for training is not taken into account since the toolbox of AdaBoost

algorithm used in these experiments does not deal with multiclass classification problems.

Before starting with the results presentation it is important to refer that this method was only tested

with the ROIs and not with whole brain voxels approach. In the Boosting method, the features are

selected in a different way. The AdaBoost algorithm automatically selects the most important features

for classification and therefore does not need a prior feature selection step, so the performance level

can be high even in high dimensional problems but to do that AdaBoost algorithm builds KN classi-

fiers, in which K is the number of features and N is the total number of training examples. To reduce

54

the computational time cost, and with the goal of being able to directly compare these studies with

the ones achieved with the SVM classifier, just the smallest problems were considered.

4.4.1 Training with CN and AD subjects and using only the voxels within theROIs

The model was built by using the information from the ROIs of CN and AD population previously

mentioned and tested in MCI set. The number of features used during classification process varies

between 50 and 1.000 and are chosen, from a total number of features of around 28.000 possibilities,

by boosting. The results are shown in Table 4.4.

Table 4.4: The highest results for the AdaBoost classifier when CN and AD subjects are used for training andby using only the voxels within the ROIs regions(4.4.1). Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC),Balanced Accuracy (BA), Area Under the Curve (AUC).

Study ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)AdaBoost 79,0 76,9 80,2 78,6 83,4

If we compare this method with the second and fourth studies explained in section 4.3.1, that

account for classification accomplished by the SVM algorithm, an improvement of 2,1% and 1,5%

respectively in AUC is noted, see Figure 4.13 where a comparison between the best ROC curve in

each one of the studies previously mentioned is presented and in which the ROC curve corresponding

to this study (red trace) is almost always superior than the others and with a far less number of features

(20 times lesser). The SENS is also almost 7% higher than the best obtained during the second

SVM study, which constitutes a good point in favour of AdaBoost because misclassifying a patient

to be a healthy subject may bring severe consequences and thus the SENS should be as high as

possible [39]. When compared with the fourth SVM study, in which MI was used as feature selection

criterion, the SENS does not present any difference but this value is reached with only 50 features

while SVM needs a number of features 50 times higher to obtain the same level of performance.

55

Figure 4.13: A comparison between the best ROC curves achieved for different feature selection criterion whenCN and AD subjects are used for training and when only the voxels inside the ROIs were considered.

To conclude, and as discussed in section 3.2.2 the features that are chosen are those that minimize

the training error (see Equation 3.26). Each one of the weak classifiers selects, in an independent

way, its feature and thus the selected pattern is highly dispersed (Figure 4.14).

56

Figure 4.14: Features selected by Boosting when CN and AD subjects are used for training the model.

4.4.2 Training and testing with MCI data and using only the voxels within theROIs

In this study, the features that belong to the ROIs were used to train and test an AdaBoost classifier

with MCI subjects. As in the case of the SVMs studies, the separation between MCI training set and

MCI testing set was done by 10-folds CV and repeated 10 times. The results in terms of ACC, SENS,

SPEC, BA and AUC are presented in Table 4.5. The weight of each subject during each iteration was

also analysed and can be seen in Figure 4.15.

Table 4.5: The highest results in terms of Mean±Standard Deviation for the AdaBoost classifier when MCIsubjects are used for training and by using only the voxels within the ROIs (4.4.2). Accuracy(ACC), Sensitivity(SENS), Specificity (SPEC), Balanced Accuracy (BA), Area Under the Curve (AUC).

Study ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)AdaBoost 69,0±1,8 51,7±1,8 80,5±3,2 66,0±1,8 69,5±1,6

57

Figure 4.15: Subject’s weights in each iteration.

This study shows that by using the MCI subjects as a training set the performance of the classifier,

even in AdaBoost, is decreased. Nevertheless the obtained results are slightly better (1% or 2%)

than when SVM is used and with less features which is an important aspect since it reduces the

computational time.

As can be seen by the analysis of Figure 4.15 some examples of the training set increase their

weight in each iteration step, which leads to believe that they are mislabelled. However, because it

is not a dominant behaviour, the low performance of the classifier is probably due to the nature of

data, that is, the difficulty in distinguishing between two patterns as similar as is the case of MCI

converters and non-converters. This constitutes an important aspect that should be highlighted since

the AdaBoost algorithm puts more emphasis in the misclassified data and thus, if the examples are

mislabelled the importance given to them does not reflects the truth and the model can not achieve a

good generalization power.

4.5 AdaSVM Results

Finally, the AdaBoost and SVM algorithms are joined together in order to form a new classifier

called AdaSVM. To perform this kind of classification, the features have to be selected by AdaBoost

and then they are fed to the SVM classifier. Just the two studies previously mentioned in Section 4.4

58

were repeated to account for AdaSVM classification. Once again, only the voxels which lie inside the

ROIs were considered due to computational time limitations.

4.5.1 Training with CN and AD subjects and using only the voxels within theROIs

The set of features previously selected by boosting in section 4.4.1 are analysed in order to elimi-

nate those that are repeated and then are used to train a SVM classifier with CN and AD population

as a training set and tested with MCI subjects. The same three tests, as explained in section 4.2,

were performed and the highest results are presented in Table 4.6.

Table 4.6: The highest results for the AdaSVM classifier and when CN and AD are used for training and byusing only the voxels within the ROIs (4.5.1). Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), BalancedAccuracy (BA), Area Under the Curve (AUC).

AdaSVM ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)WOW 75,5 78,8 75,8 75,4 79,6WW 74,1 80,8 75,8 75,0 79,2WWBA 75,5 76,9 76,9 75,8 80,2

As can be seen by the analysis of Table 4.6 the results in terms of SENS, are better in at least 4%

than the ones presented in the second and in the fourth SVM’s studies when CN and AD are used for

training purposes. In terms of ACC, BA and SPEC, the obtained results in all the studies aforemen-

tioned are really close. The AUC is also very similar but the same performance is achieved with a

much smaller number of features which is a good point in favour of boosting as feature selection pro-

cedure, when compared with the PCC or with the MI. Nevertheless, when compared with AdaBoost

this method obtained worse results except for SENS which is improved by 4%. As we expect, the

introduction of different penalization improves the SENS of the model.

A comparison between the ROC curves achieved in the four methods is presented in Figure 4.16.

Here we can verify that, for the same number of features (500 features) the SVM classifier with PCC as

feature selection method (blue trace) has the worst performance. This method is followed by the SVM

with MI as feature selection criterion (black trace), AdaSVM (green trace) and, finally, by AdaBoost

(red trace) which shows the best result. With this direct comparison it is also possible to conclude that

not only is AdaBoost the best classifier to differentiate between MCI-NC and MCI-C when CN and AD

are used as a training population but Boosting is also the best feature selection procedure to take into

account in this classification task.

59

Figure 4.16: A Comparison between different classifiers and classification procedures for a fixed number offeatures (500 features) when CN and AD subjects are used for training and by using only the voxels that lieinside the ROIs.

4.5.2 Training and testing with MCI data and using only the voxels within theROIs

As it happened in the first AdaSVM study, the set of features previously selected by boosting in

section 4.4.2 are analysed in order to eliminate those that are repeated and then are used to train a

SVM classifier with MCI data. Once again, in order to avoid overfitting, the classification results are

an average of a 10 times 10-folds CV procedure. Each one of the 10-fold has the corresponding set

of features. The same three tests (WOW, WW, WWBA) were performed and the results are presented

in Table 4.7.

Table 4.7: The highest results for the AdaSVM classifier and when MCI data is used for training and by using onlythe voxels within the ROIs (4.5.2). Accuracy(ACC), Sensitivity (SENS), Specificity (SPEC), Balanced Accuracy(BA), Area Under the Curve (AUC).

AdaSVM ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)WOW 63,4±2,9 47,5±4,7 72,7±3,9 59,8±2,8 62,1±1,8WW 63,2±2,0 52,9±2,6 69,0±2,1 60,9±2,0 63,2±2,5WWBA 62,6±1,8 52,5±3,7 68,9±2,1 60,4±1,9 62,5±1,7

In terms of SENS, the results are slightly better than the ones obtained with AdaBoost (1% or 2%)

but in general, and when compared with the earlier studies which use the same conditions to build the

60

classifier, this method obtained worse results coming once again to enhance the noisy nature of the

data. Figure 4.17 shows a comparison between the best ROC curves for the aforementioned studies,

where only MCI patients’ information was considered during the training and the testing phase by

using only the voxels which fall within ROIs, the feature extraction procedure explained in 3.4.1.B.

As can be seen, when MCI subjects are used for training purposes, resorting to Boosting as feature

selection criterion is not a good option. Boosting is an embedded method of AdaBoost and when

applied as feature selection method in SVM to train with noisy data the discriminative power of the

classifier becomes poor.

Figure 4.17: A comparison between different classifiers and classification procedures for a fixed number offeatures (500 features) when MCI subjects are used for training and by using just the voxels that lie inside theROIs.

It is also important to refer that for AdaSVM classifier the best way for tuning the C parameter

is also through WWBA method since in almost every study the SENS as well as the AUC increases

with an exception for the first study herein mentioned which presents a higher SENS when the WW

method is taken into account.

4.6 Summary of all the Results

In Table 4.8 all the best results achieved in each study in terms of ACC, SENS, SPEC, BA and

AUC are summarized regardless of the number of features or the way through each C parameter was

61

tuned. When the study was performed with resorting to a 10 times 10-folds CV procedure the results

appear in a Mean ± Standard Deviation format.

Table 4.8: A summary of all the results obtained during this thesis. Training Classes (TC), Features Extraction(FE), VEB (Voxels in the Entire Brain volume), Features Selection (FS), Accuracy (ACC), Sensitivity (SENS),Specificity (SPEC), Balanced Accuracy (BA), Area Under the Curve (AUC).

Classifier TC FE FS ACC(%) SENS(%) SPEC(%) BA(%) AUC(%)

SVM

CN&AD VEB PCC 71,3 67,3 80,2 70,0 77,0CN&AD ROIs PCC 74,1 71,2 84,6 72,9 81,3CN&AD VEB MI 73,4 76,9 78,0 74,2 78,0CN&AD ROIs MI 75,5 76,9 84,6 74,3 81,9

MCI VEB PCC 67,2±2,4 52,5±3,6 84,6±2,3 61,6±2,3 64,6±2,1MCI ROIs PCC 71,0±0,9 52,9±4,2 90,9±2,7 65,2±1,2 68,3±1,4MCI VEB MI 65,7±2,5 52,1±2,8 83,2±4,4 61,1±2,1 63,1±2,3MCI ROIs MI 70,6±1,1 54,2±3,6 92,6±1,2 64,3±1,4 68,3±1,2ALL VEB PCC 71,0±0,9 63,8±5,8 89,3±0,7 66,3±2,7 73,9±1,2ALL ROIs PCC 72,7±0,8 59,8±2,5 90,2±1,2 67,2±0,9 77,6±2,0ALL VEB MI 71,0±1,5 60,7±3,7 89,6±0,9 67,2±1,1 74,0±1,2ALL ROIs MI 72,5±1,7 58,1±3,5 91,2±0,9 67,1±1,9 76,8±1,1

AdaBoost CN&AD ROIs Boosting 79,0 76,9 80,2 78,6 83,4MCI ROIs Boosting 69,0±1,8 51,7±1,8 80,5±3,2 66,0±1,8 69,5±1,6

AdaSVM CN&AD ROIs Boosting 75,5 80,8 76,9 75,8 80,2MCI ROIs Boosting 63,4±2,9 52,9±2,6 72,7±3,9 60,9±2,0 63,2±2,5

During the following paragraphs all the results presented so far will be summarized. The best

way to perform feature extraction, feature selection and even the best classifier to use according to

the training and testing situation are some of the questions that will be addressed. The best way for

tuning the model’s parameters in SVM and AdaSVM as well as the weighting vector variation for each

subject obtained in case of the AdaBoost classifier, see Figure 4.15, constitute two more topics that

are not left out.

In SVMs the best results were achieved when the classifier uses CN and AD population as a

training set. In this case, the chosen feature selection procedure also reveals to be an important

aspect with MI outperforming PCC in both tests (see Figure 4.2). The same did not hold when

the model is built based only on MCI subjects’ information, where the results obtained with different

methods of feature selection are too close, and by so, it is not possible to say which method is better

to catch the differences between converted and non-converted patients (see Figure 4.9). When all

classes are used for training, an intermediate performance level is achieved which clearly shows the

noisy nature of MCI data and how much one classifier benefits with stable data as CN and AD patient’s

information, see Figure 4.12. In SVMs classifiers the best way for tuning the C parameter revels to be

through WWBA method.

The AdaBoost algorithm outperforms SVMs in all studies (see Figure 4.16 and Figure 4.17) which

can be justified by differences in model’s construction. These high performances are obtained with

fewer features and, consequently, the necessary computing time is also reduced.

Since the patients can not have a linear evolution between normal and AD stage when boosting

is used as a feature selection procedure the performance of SVMs is improved in the case where

CN and AD were used as a training set but the same did not hold for the case where MCI data

62

was considered which shows that Boosting methodology is not a good option as feature selection

procedure when noisy data is taken into account to build a classifier.

Still in AdaSVM the WWBA method, as happens with SVMs classifiers, proved to be the best way

for tuning the C parameter. The hypothesis raised in section 3.2.3, which said that by joining together

the AdaBoost and SVM algorithms in a sequential way, the predictive power of the model is improved,

is not verified. The AdaBoost outperforms AdaSVM in all studies, no matter which subjects are used

to build the classifier, in almost all aspects with an exception for SENS where the results are improved

in 4% for models built based on CN and AD population and in 1% if only MCI information is used for

training the SVM model.

To better understand what is happening we looked at the output of the AdaSVM and AdaBoost

classifiers when the CN and AD population is used for training and when MCI subjects are considered

for the same purposes (see Figure 4.18 and Figure 4.19).

50 100 150 200 250−200

−100

0

100

200

300AdaBoost

De

cis

ion

Va

lue

s

Subjects

CN

MCI−NC

MCI−C

AD

50 100 150 200 250−4

−2

0

2

4AdaSVM

De

cis

ion

Va

lue

s

Subjects

CN

MCI−NC

MCI−C

AD

Figure 4.18: A comparison between the outputs of AdaSVM and AdaBoost classifiers when CN and AD patientsare used for training.

63

20 40 60 80 100 120 140−150

−100

−50

0

50

100AdaBoost

De

cis

ion

Va

lue

s

Subjects

MCI−NC train

MCI−NC test

MCI−C test

MCI−C train

20 40 60 80 100 120 140−2

−1

0

1

2

3AdaSVM

De

cis

ion

Va

lue

s

Subjects

MCI−NC train

MCI−NC test

MCI−C test

MCI−C train

Figure 4.19: A comparison between the outputs of AdaSVM and AdaBoost classifier when MCI subjects areused for training.

As can be seen in Figure 4.18, although AdaBoost presents a better class separation, AdaSVM

has a fewer number of positive test examples misclassified and thus the SENS of the model, i.e., the

capacity to detect a positive example, is higher.

In Figure 4.19 the noisy nature of data is once again enhanced. If we look for AdaSVM classifier

output in which even within the training classes the classifier committed several classification errors we

understand the worse classification performance achieved by this classifier. However it is important

to highlight that even in the case when MCI subjects are used both in training and testing phases,

AdaSVM classifier presented a fewer number of errors within the positive testing examples and by so

register a better performance in terms of SENS than AdaBoost.

To conclude this discussion it is consensual to say that the classifier that shows a better perfor-

mance is AdaBoost, when CN and AD patients were used as a training set and the obtained classifier

tested in the MCI cohort, which achieved an ACC of 79%, with a SENS of 76,9%, a SPEC of 80,2%

and a BA of 78,6%. The AUC was also computed and registered a remarkable value of 83,4%.

64

5Conclusions and Future Work

65

The early detection of AD constitutes a huge and an important challenge in our days that allows

early treatment as well as the improvement of the quality of life. In the last 20 years a lot of attention

has been given to the detection of MCI conversion, which is, by definition, an intermediate step

between elderly normal and AD stage, and new ways of CAD have emerged.

The present work tries to bring some alternatives to CAD in the task of detecting MCI to AD

conversion. Up to now the detection was mainly focused on SVM algorithms, so we try to bring

something new to the state of the art by using AdaBoost for classification purposes. A variant of

SVM, known as AdaSVM was also studied. Besides this, different methods for feature selection

and feature extraction were also tested with the aim of understanding which is the best approach to

differentiate between the two patterns, i.e., between MCI converters and non-converters.

The classification method as well as the way in which the classifier is built has a high influence

in the final classification performance. Here three different classifiers were tested, they were: the

SVM classifier, the AdaBoost and the AdaSVM. AdaBoost proved to be the best choice. The way in

which the classifier was built also assumes a role of main importance. In this thesis, three different

approaches for SVM and two for AdaBoost and AdaSVM were used. The one that shows the highest

classification performance was AdaBoost when the model is trained with CN and AD information and

then applied to the MCI subjects. This happens due to the noisy nature of MCI information, which

introduces uncertainty to the classification.

In the context of feature selection procedures, three different approaches were studied. They were:

PCC, MI and Boosting. Here, it is difficult to say which one reveals to be the best choice, because it

depends on the classifier and on the classifier construction conditions, for example, although Boosting

achieves a best classification performance in AdaSVM when the classifier is trained using the CN and

AD population and tested on MCI subjects the same does not hold in the case when the model is built

with MCI information. Another example of the differences in results is, for example in the case of the

SVM classifier. For this classifier, when the model is built with CN and AD information, MI reveals to

be the best option but, on the other hand, when the same classifier is trained with MCI patients both

PCC and MI achieved similar results.

In terms of feature extraction procedures VI is used in all the studies performed during this thesis.

Besides this, two different transformation masks were also applied in a pre-processing step to reduce

the dimensionality of the problem. The first mask to be applied only considers the voxels inside the

brain and thus, all the PET image background was discarded. The second mask is more selective and

just takes into account the voxels inside specific brain regions, i.e., the Left Lateral Temporal Lobe,

the Left Dorsolateral Parietal, the Right Dorsolateral Parietal, the Superior Anterior Cingulate and the

Posterior Cingulate and Precuneus. These regions showed a higher predictive power in the detection

MCI to AD conversion, which proves that these are the most affected regions during the transition

between MCI-NC to MCI-C.

The aim of this thesis was to compare different classification methods to detect MCI to AD con-

version and so, as a resume we can say that the AdaBoost classifier with CN and AD patients as a

training set and by using only the voxels that fell inside the ROIs shows a better discriminative power

66

than those obtained by training with MCI set, which is consistent with the noisy nature of MCI data and

demonstrate that a classifier that can separate between CN and AD subjects is also able to separate

between MCI-NC and MCI-C. AdaBoost as well as AdaSVM also show a high sensitivity, i.e., the

capacity to detect a positive example, which is also very important since misclassifying an AD patient

can have severe consequences.

Naturally, some changes can be done with the aim to improve the classification performance. One

of them is to perform multiclass classification in AdaBoost algorithm. This approach was only tested

in the SVM algorithm and performed better than when the SVM model was trained just with MCI

information so, it would be interesting to compare the results achieved by a multiclass SVM with the

ones achieved by an AdaBoost algorithm and verify if, once again, AdaBoost outperforms SVM in the

classification task.

Still in the classification algorithms alternatives we can also try to use a robust version of Ad-

aBoost algorithm. Since the AdaBoost algorithm mainly focus on the errors committed, if one exam-

ple is wrongly labelled the AdaBoost will give more importance to this example iteration after iteration

and the performance level becomes poor. With a robust version of AdaBoost the examples that are

wrongly labelled will be ignored and the classifier performance can be improved, which constitutes a

good strategy when we deal with noisy data as MCI data.

In terms of feature selection other methodologies rather than PCC, MI and Boosting can be em-

ployed in an attempt to improve the classification performance. Several studies were also done in this

area of feature selection procedures, as the case of Bicacro in [70] and Morgado in [71], but these

works are not focused on MCI to AD conversion detection, which constitutes a gap that should be

fulfilled.

In terms of feature extraction procedures, there are also different approaches that can be fol-

lowed and some of them are also explored by the aforementioned authors, such as two and three-

dimensional local binary patterns, local variance and 3D-Haar like features. All these strategies can

now be applied to the problem of distinguishing between MCI-NC and MCI-C with the intent of improv-

ing the classification performance. On top of this we can also try to find more discriminative regions

that make the problem, in terms of the number of features, smaller.

All the ideas mentioned so far have the unique goal of improving the early detection of AD. Other

strategies, such as using different machine learning techniques, can also be explored to detect AD

with the intention of starting the AD treatment as soon as possible to delay neuronal degeneration

and, ideally, in the future, and in line with the recent developments, to completely treat and thus avoid

AD progression.

During the last years a lot of work was done with the resort of CAD systems specially to detect

the presence or not of AD. The problem of detecting MCI to AD conversion is harder because of the

similarity between the involved patterns, and needs to be more broadly explored. These 20 years of

work are just the beginning of a long but interesting journey.

67

68

Bibliography

[1] C. P. Ferri, R. Sousa, E. Albanense, W. s. Ribeiro, and M. Honyashiki, “World Alzheimer Report

2009,” 2009.

[2] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. R. Jack, W. Jagust, J. Q. Tro-

janowski, A. W. Toga, and L. Beckett, “Ways toward an early diagnosis in Alzheimer’s disease:

The Alzheimer’s Disease Neuroimaging Initiative (ADNI),” Alzheimer´s and Dementia, pp. 55–66,

2005.

[3] Alzheimer’s Association, “Basics of Alzheimer Disease: What it is and what you can do,” 2012.

[4] A. Wimo and M. Prince, “World Alzheimer Report 2010,” September 2009.

[5] Alzheimer’s Society, “What is Alzheimer’s Disease,” 2013.

[6] D. Religa, K. Spangberg, A. Wimo, A.-K. Edlun, B. Winblad, and M. Eriksdotter Jonhagen, “De-

mentia Diagnosis Differs in Men and Women and Depends on Age and Dementia Severity: Data

from SveDem, the Swedish Dementia Quality Registry,” Dementia and Geriatric Cognitive Disor-

ders, vol. 33, pp. 90–95, 2012.

[7] W. Thies and L. Bleiler, “2013 Alzheimer’s Disease Facts and Figures,” Alzheimer’s and Demen-

tia, vol. 9, pp. 208–245, 2013.

[8] M. Prince, R. Bryce, and C. Ferri, “World Alzheimer Report 2011,” September 2011.

[9] A. Wimo, L. Jonsson, J. Bond, M. Prince, and B. Winblad, “The worldwide economic impact of

dementia 2010,” Alzheimer’s and Dementia, vol. 9, pp. 1–11, 2013.

[10] C. Misra, Y. Fan, and C. Davatzikos, “Baseline and longitudinal patterns of brain atrophy in

MCI patients, and their use in prediction of short-term conversion to AD: Results from ADNI,”

NeuroImage, vol. 44, pp. 1415–142, 2009.

[11] L. Mosconi, R. Mistur, R. Switalski, W. H. Tsui, L. Glodzik, Y. Li, E. Pirraglia, S. De Santi, B. Reis-

berg, T. Wisniewski, and M. J. de Leon, “FDG-PET changes in brain glucose metabolism from

normal cognition to pathologically verified Alzheimer’s Disease,” European Journal of Nuclear

Medicine and Molecular Imaging, vol. 36, pp. 811–822, January 2009.

[12] L. Mosconi, M. Brys, L. Glodzik Sobanska, S. De Santi, H. Rusinek, and M. J. de Leon, “Early

detection of Alzheimer’s Disease using neuroimaging,” Experimental Gerontology, vol. 42, pp.

129–138, July 2006.

69

[13] H. Braak and E. Braak, “Neuropathological stageing of Alzheimer-related changes,” Acta Neu-

ropathologia, vol. 82, pp. 239–259, June 1991.

[14] J. Jackson Siegal, “Our current understanding of the pathophysiology of Alzheimer’s Disease,”

August 2005.

[15] ”Bruno Dubois, Howard H Feldman, Claudia Jacova, Steven T DeKosky, Pascale Barberger

Gateau, Jeffrey Cummings, Andre Delacourte, Douglas Galasko, Serge Gauthier, Gregory Jicha,

Kenichi Meguro, John O´Brien, Florence Pasquier, Philippe Robert, Martin Rossor, Steven Sal-

loway, Yaakov Stern, Pieter J Visser, and Philip Scheltens”, “Research criteria for the diagnosis

of Alzheimer’s Disease: revising the NINCDS-ADRDA criteria,” The Lancet Neurology, vol. 6,

no. 8, pp. 734 – 746, 2007.

[16] Alzheimer’s Society, “The Mini Mental Stade Examination (MMSE),” 2012.

[17] ”Olivier Querbes, Florent Aubry, Jeremie Pariente, Jean-Albert Lotterie, Jean-Francois Demonet,

Veronique Duret, Michele Puel, Isabelle Berry, Jean-Claude Fort, Pierre Celsis, and The

Alzheimer’s Disease Neuroimaging Initiative”, “”Early diagnosis of Alzheimer’s Disease using

cortical thickness: impact of cognitive reserve”,” BRAIN, vol. 132, pp. 2036–2047, 2009.

[18] S. Duchesne, C. Bocti, K. De Sousa, G. B. Frisoni, H. Chertkow, and D. L. Collins, “Amnestic

MCI future clinical status prediction using baseline MRI features,” Neurobiology of aging, vol. 31,

no. 9, pp. 1606–1617, 2010.

[19] Alzheimer’s Society, “Drug treatments for Alzheimer’s Disease,” 2014.

[20] ”Medical press”. (2014) ”Compound reverses symptoms of Alzheimer’s Dis-

ease in mice, research shows”. [Online]. Available: http://medicalxpress.com/news/

2014-05-compound-reverses-symptoms-alzheimer-disease.html

[21] Andrew Webb, Introduction to Biomedical Imaging. IEEE Press Series in Biomedical Engineer-

ing, 2003.

[22] Holger Grull, Nuclear Imaging and Radiochemistry, 2013.

[23] L. G. Apostolova and P. M. Thompson, “Mapping progressive brain structural changes in early

Alzheimer’s Disease and Mild Cognitive Impairment,” Neuropsychologia, vol. 46, no. 6, pp. 1597–

1612, 2008.

[24] E. Salmon, F. Lekeu, C. Bastin, G. Garraux, and F. Collette, “Functional imaging of cognition in

Alzheimer’s Disease using positron emission tomography,” Neuropsychologia, vol. 46, no. 6, pp.

1613–1623, 2008.

[25] S. F. Eskildsen, P. Coupe, D. Garcıa Lorenzo, V. Fonov, J. C. Pruessner, and D. L. Collins,

“Prediction of Alzheimer’s Disease in subjects with Mild Cognitive Impairment from the ADNI

cohort using patterns of cortical thinning,” NeuroImage, vol. 65, pp. 511–521, 2013.

70

http://medicalxpress.com/news/2014-05-compound-reverses-symptoms-alzheimer-disease.html

http://medicalxpress.com/news/2014-05-compound-reverses-symptoms-alzheimer-disease.html

[26] Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer, 2007.

[27] Y. Cui, P. S. Sachdev, D. M. Lipnicki, J. S. Jin, S. Luo, W. Zhu, N. A. Kochan, S. Reppermund,

T. Liu, J. N. Trollor et al., “Predicting the development of Mild Cognitive Impairment: A new use

of pattern recognition,” Neuroimage, vol. 60, no. 2, pp. 894–901, 2012.

[28] C. Davatzikos, P. Bhatt, L. M. Shaw, K. N. Batmanghelich, and J. Q. Trojanowski, “Prediction

of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification,” Neurobiology of

aging, vol. 32, no. 12, pp. 2322–e19, 2011.

[29] E. Westman, A. Simmons, J. Muehlboeck, P. Mecocci, B. Vellas, M. Tsolaki, I. Kłoszewska,

H. Soininen, M. W. Weiner, S. Lovestone et al., “AddNeuroMed and ADNI: similar patterns of

Alzheimer’s atrophy and automated MRI classification accuracy in Europe and North America,”

Neuroimage, vol. 58, no. 3, pp. 818–828, 2011.

[30] R. Cuingnet, E. Gerardin, J. Tessieras, G. Auzias, S. Lehericy, M.-O. Habert, M. Chupin, H. Be-

nali, and O. Colliot, “Automatic classification of patients with Alzheimer’s Disease from structural

MRI: a comparison of ten methods using the ADNI database,” Neuroimage, vol. 56, no. 2, pp.

766–781, 2011.

[31] R. Wolz, V. Julkunen, J. Koikkalainen, E. Niskanen, D. P. Zhang, D. Rueckert, H. Soininen,

J. Lotjonen, Alzheimer’s Disease Neuroimaging Initiative et al., “Multi-method analysis of MRI

images in early diagnostics of Alzheimer’s Disease,” PloS one, vol. 6, no. 10, p. e25446, 2011.

[32] P. Coupe, S. F. Eskildsen, J. V. Manjon, V. S. Fonov, J. C. Pruessner, M. Allard, and D. L. Collins,

“Scoring by nonlocal image patch estimator for early detection of Alzheimer’s Disease,” NeuroIm-

age: clinical, vol. 1, no. 1, pp. 141–152, 2012.

[33] Y. Cho, J.-K. Seong, Y. Jeong, and S. Y. Shin, “Individual subject classification for Alzheimer’s

Disease based on incremental learning using a spatial frequency representation of cortical thick-

ness data,” Neuroimage, vol. 59, no. 3, pp. 2217–2230, 2012.

[34] J. Young, M. Modat, M. J. Cardoso, A. Mendelson, D. Cash, and S. Ourselin, “Accurate multi-

modal probabilistic prediction of conversion to Alzheimer’s Disease in patients with Mild Cognitive

Impairment,” NeuroImage: clinical, vol. 2, pp. 735–745, 2013.

[35] D. Zhang, Y. Wang, L. Zhou, H. Yuan, and D. Shen, “Multimodal classification of Alzheimer’s

Disease and Mild Cognitive Impairment,” Neuroimage, vol. 55, no. 3, pp. 856–867, 2011.

[36] P. Vemuri, S. D. Weigand, D. S. Knopman, K. Kantarci, B. F. Boeve, R. C. Petersen, and C. R.

Jack Jr, “Time-to-event voxel-based techniques to assess regional atrophy associated with MCI

risk of progression to AD,” Neuroimage, vol. 54, no. 2, pp. 985–991, 2011.

[37] S. J. Teipel, C. Born, M. Ewers, A. L. Bokde, M. F. Reiser, H.-J. Moller, and H. Hampel, “Multivari-

ate deformation-based analysis of brain atrophy to predict Alzheimer’s disease in Mild Cognitive

Impairment,” Neuroimage, vol. 38, no. 1, pp. 13–24, 2007.

71

[38] C. Hinrichs, V. Singh, L. Mukherjee, G. Xu, M. K. Chung, and S. C. Johnson, “Spatially aug-

mented LPboosting for AD classification with evaluations on the ADNI dataset,” Neuroimage,

vol. 48, no. 1, pp. 138–149, 2009.

[39] C.-Y. Wee, P.-T. Yap, D. Zhang, K. Denny, J. N. Browndyke, G. G. Potter, K. A. Welsh Bohmer,

L. Wang, and D. Shen, “Identification of MCI individuals using structural and functional connec-

tivity networks,” Neuroimage, vol. 59, no. 3, pp. 2045–2056, 2012.

[40] G. Chetelat, B. Landeau, F. Eustache, F. Mezenge, F. Viader, V. de La Sayette, B. Desgranges,

and J.-C. Baron, “Using voxel-based morphometry to map the structural changes associated

with rapid conversion in MCI: a longitudinal MRI study,” Neuroimage, vol. 27, no. 4, pp. 934–946,

2005.

[41] M. Silveira and J. Marques, “Boosting Alzheimer’s Disease diagnosis using PET images,” in

Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010, pp. 2556–

2559.

[42] C. Hinrichs, V. Singh, G. Xu, and S. C. Johnson, “Predictive markers for AD in a multi-modality

framework: an analysis of MCI progression in the ADNI population,” Neuroimage, vol. 55, no. 2,

pp. 574–589, 2011.

[43] ADNI. Alzheimer’s Disease Neuroimaging Initiative. [Online]. Available: http://adni-info.org/

Home.aspx

[44] PET Technical Procedures Manual. [Online]. Available: http://www.adni-info.org/Scientists/

ADNIStudyProcedures.aspx

[45] K. R. Gray, R. Wolz, R. A. Heckemann, P. Aljabar, A. Hammers, and D. Rueckert, “Multi-region

analysis of longitudinal FDG-PET for the classification of Alzheimer’s disease,” NeuroImage,

vol. 60, no. 1, pp. 221–229, 2012.

[46] ADNI. PET Pre-processing. [Online]. Available: http://adni.loni.usc.edu/methods/pet-analysis/

pre-processing/

[47] V. Vapnik, “Pattern recognition using generalized portrait method,” Automation and remote con-

trol, vol. 24, pp. 774–780, 1963.

[48] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–

297, 1995.

[49] R. Filipovych and C. Davatzikos, “Semi-supervised pattern classification of medical images: ap-

plication to mild cognitive impairment (MCI),” NeuroImage, vol. 55, no. 3, pp. 1109–1119, 2011.

[50] E. Mayoraz and E. Alpaydin, “Support vector machines for multi-class classification,” in Engineer-

ing Applications of Bio-Inspired Artificial Neural Networks. Springer, 1999, pp. 833–842.

72

http://adni-info.org/Home.aspx

http://adni-info.org/Home.aspx

http://www.adni-info.org/Scientists/ADNIStudyProcedures.aspx

http://www.adni-info.org/Scientists/ADNIStudyProcedures.aspx

http://adni.loni.usc.edu/methods/pet-analysis/pre-processing/

http://adni.loni.usc.edu/methods/pet-analysis/pre-processing/

[51] J. H. Morra, Z. Tu, L. G. Apostolova, A. E. Green, A. W. Toga, and P. M. Thompson, “Comparison

of Adaboost and support vector machines for detecting Alzheimer’s Disease through automated

hippocampal segmentation,” Medical Imaging, IEEE Transactions on, vol. 29, no. 1, pp. 30–43,

2010.

[52] A. Savio, M. Garcıa Sebastian, M. Grana, and J. Villanua, “Results of an Adaboost approach

on Alzheimer’s Disease detection on MRI,” in Bioinspired Applications in Artificial and Natural

Computation. Springer, 2009, pp. 114–123.

[53] Robert E. Schapire, “Explaining Adaboost.”

[54] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer

vision, vol. 57, no. 2, pp. 137–154, 2004.

[55] X. Li, L. Wang, and E. Sung, “A study of Adaboost with SVM based weak learners,” in Neural

Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, vol. 1.

IEEE, 2005, pp. 196–201.

[56] L. Huang, Z. Pan, H. Lu et al., “Automated Diagnosis of Alzheimer’s Disease with Degenerate

SVM-Based Adaboost,” in Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2013

5th International Conference on, vol. 2. IEEE, 2013, pp. 298–301.

[57] C. Bergmeir, M. Costantini, and J. M. Benıtez, “On the usefulness of cross-validation for direc-

tional forecast evaluation,” Computational Statistics & Data Analysis, vol. 76, pp. 132–143, 2014.

[58] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model

selection,” in IJCAI, vol. 14, no. 2, 1995, pp. 1137–1145.

[59] R. Jensen and Q. Shen, Computational intelligence and feature selection: rough and fuzzy ap-

proaches. John Wiley & Sons, 2008, vol. 8.

[60] P. Langley et al., Selection of relevant features in machine learning. Defense Technical Infor-

mation Center, 1994.

[61] N. Hoque, D. Bhattacharyya, and J. Kalita, “MIFS-ND: A mutual information-based feature selec-

tion method,” Expert Systems with Applications, vol. 41, no. 14, pp. 6371–6385, 2014.

[62] P. Sedgwick, “Pearson’s correlation coefficient,” BMJ: British Medical Journal, vol. 345, 2012.

[63] P. Ahlgren, B. Jarneving, and R. Rousseau, “Requirements for a cocitation similarity measure,

with special reference to Pearson’s correlation coefficient,” Journal of the American Society for

Information Science and Technology, vol. 54, no. 6, pp. 550–560, 2003.

[64] J. Wang, “Pearson Correlation Coefficient,” in Encyclopedia of Systems Biology. Springer, 2013,

pp. 1671–1671.

[65] R. F. Tate, “Correlation between a discrete and a continuous variable. Point-biserial correlation,”

The Annals of mathematical statistics, pp. 603–607, 1954.

73

[66] R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, “The mutual information: detecting

and evaluating dependencies between variables,” Bioinformatics, vol. 18, no. suppl 2, pp. S231–

S240, 2002.

[67] G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likelihood maximisation: a uni-

fying framework for information theoretic feature selection,” The Journal of Machine Learning

Research, vol. 13, no. 1, pp. 27–66, 2012.

[68] G. Brown, “A new perspective for information theoretic feature selection,” in International Confer-

ence on Artificial Intelligence and Statistics, 2009, pp. 49–56.

[69] L. Florack, Mathematical Techniques for Image Analysis, year = 2008.

[70] Eduardo Bicacro, “Alzheimer’s Disease Diagnosis unsing 3D Brain Images,” Master’s thesis,

Instituto Superior Tecnico, Universidade de Lisboa, 2011.

[71] Pedro Morgado, “Automated Diagnosis of Alzheimer’s Disease unsing PET Images,” Master’s

thesis, Instituto Superior Tecnico, Universidade de Lisboa, 2012.

74

AROIs

A-1

(a) (12) Orange - Left MesialTemporal; Brown - RightMesial Temporal.

(b) (13) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.

(c) (14) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.

(d) (15) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.

(e) (16) Orange - Left MesialTemporal; Brown - RightMesial Temporal; Blue - LeftLateral Temporal; Green -Right Lateral Temporal.

(f) (17) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(g) (18) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(h) (19) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(i) (20) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

Figure A.1: Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champalimaud Founda-tion.

A-2

(a) (21) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(b) (22) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(c) (23) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(d) (24) Yellow - Left Mesial Tem-poral; Orange - Right MesialTemporal; Blue - Left LateralTemporal; Light Blue - RightLateral Temporal; Brown - Or-bitofrontal.

(e) (25) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(f) (26) Blue - Left Lateral Tempo-ral; Light Blue - Right LateralTemporal; Brown - Inferior An-terior Cingulate.

(g) (27) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(h) (28) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(i) (29) Blue - Left Lateral Tempo-ral; Light Blue - Right LateralTemporal; Brown - Inferior An-terior Cingulate.

Figure A.2: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation.

A-3

(a) (30) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(b) (31) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(c) (32) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(d) (33) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(e) (34) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Brown - Infe-rior Anterior Cingulate.

(f) (35) Blue - Left Lateral Tem-poral; Light Blue - Right Lat-eral Temporal; Yellow - Infe-rior Anterior Cingulate; Brown- Posterior Cingulate and Pre-cuneus.

(g) (36) Brown - Posterior Cin-gulate and Precuneus; Red -Superior Anterior Cingulate.

(h) (37) Brown - Posterior Cin-gulate and Precuneus; Red- Superior Anterior Cingu-late; Yellow - Left DorsolateralParietal; Orange - Right Dor-solateral Parietal.

(i) (38) Brown - Posterior Cingu-late and Precuneus; Red - Su-perior Anterior Cingulate; Yel-low - Left Dorsolateral Pari-etal; Orange - Right Dorsolat-eral Parietal.

Figure A.3: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation.

A-4

(a) (39) (b) (40) (c) (41)

(d) (42) (e) (43) (f) (44)

(g) (45) (h) (46) (i) (47)

Figure A.4: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation. Brown - Posterior Cingulate and Precuneus; Red - Superior Anterior Cingulate; Yellow - LeftDorsolateral Parietal; Orange - Right Dorsolateral Parietal.

A-5

(a) (48) (b) (49) (c) (50)

Figure A.5: (Continued) Axial slices of the brain regions delimited by Dr.Durval Campos Costa from Champali-maud Foundation. Brown - Superior Anterior Cingulate.

A-6

Documents

Detecting Conversion of Mild Cognitive Impairment to Alzheimer