Upload
phamquynh
View
213
Download
0
Embed Size (px)
Citation preview
Fuzzy Modeling for the Prediction of VasopressorsAdministration in the ICU Using Ensemble and Mixed Fuzzy
Clustering Approaches
Carlos Santos Azevedo
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisor: Prof. Susana Margarida da Silva Vieira
Examination Committee
Chairperson: Prof. João Rogério Caldas PintoMembers of the Committee: Prof. Luís Manuel Fernandes Mendonça
Prof: João Miguel da Costa Sousa
October 2015
ii
Acknowledgments
I would like to express my gratitude to my supervisor Professor Susana Vieira for the opportunity to
work in this field of research and the useful comments, remarks, patience and engagement through the
learning process of my master thesis. Besides my supervisor, I would like to thank the rest of my thesis
committee:
Furthermore I would like to thank Catia Salgado for the friendship, availability and the stimulating
discussions and suggestions. I am grateful too for the support and advise from my faculty colleagues
and friends Marta Ferreira, Rita Viegas, Joaquim Viegas and Hugo Proenca. I must acknowledge as
well the support of my friends Bruno Vidal, Jonas Haggenjos and Henrique Goncalves without whom
part of my journey in IST would not be as pleasant.
A special thanks to my family. Words cannot express how grateful I am to my mother, and father for
all of the sacrifices that they have made on my behalf.
iii
iv
Abstract
Severe sepsis and septic shock are major health care problems and remain one of the leading causes
of death in critically ill patients. Therapy with vasopressor agents is usually initiated in this group of
patients. The main objective of this work is to describe and implement a data mining solution to predict
the need of vasopressors administration in septic shock patients in the Intensive Care Unit (ICU). MIMIC
II database was used to extract clinical data from 32 physiological and 5 static variables for a cohort
of patients of interest. Two different analysis were conducted using this data: patients’ clinical state
and patients’ clinical evolution. The former group is studied under an ensemble modelling approach.
Feature selection was performed for four different criteria, a priori, a posteriori, arithmetic mean and
mean weighted distance, and for the single model. Ensemble approaches benefited from the feature
selection, whereas the single model performs best with all the 32 input variables. The spatio-temporal
analysis was approached using two different clustering techniques: Fuzzy C-means (FCM) and Mixed
Fuzzy Clustering (MFC). The latter weights the relevance of the temporal component of data in the
clustering process, allowing a more flexible identification of structures in datasets composed by mixed
features (temporal and static). Two modeling approaches based on MFC were tested and compared
with similar approaches based on the traditional FCM, where both clustering algorithms are used either
for transforming the feature space of the input variables in membership degrees, or for determining
the antecedent fuzzy sets of Takagi Sugeno fuzzy models. The use of feature transformation showed
better performance than the other methods, however, when sequential feature selection is combined with
fuzzy modeling, FCM is the best performer. Overall, the best results obtained are AUC=0.82±0.01 and
AUC=0.80±0.06 for the ensemble and MFC strategies, respectively. Additionally, considering the fact
that the imputation of data is around 14.6% in MFC and 74.7% in ensemble, MFC should be considered,
preferentially, to predict vasopressors administration in critically ill patients.
Keywords: Data mining, Vasopressors, Feature selection, data pre-processing, time-series,
mixed data, ensemble modelling, fuzzy clustering, specialized modelling
v
vi
Resumo
Sepsis grave e choque septico sao um dos grandes problemas existentes em cuidados medicos
sendo uma das principais causas de morte em doentes crıticos. A terapia por vasopressores e usual-
mente usada neste grupo de pacientes. O principal objectivo deste trabalho e descrever e implemen-
tar uma solucao de data mining para prever a necessidade de administracao de vasopressores em
doentes de cuidados intesivos em choque septico. A base de dados MIMIC II foi utilizada para extrair
dados clınicos de 32 variaveis fisiologicas e 5 variaveis estaticas para os pacientes que fazem parte
do grupo de interesse. Foram executados dois tipos de analises: uma que avalia o estado clınico do
paciente para um dado instante e outro que tem em conta a evolucao clınica do paciente. O primeiro
tipo foi estudado utilizando uma abordagem multimodelo. Foi feita uma seleccao de variaveis para o
modelo singular e para quatro criterios multimodelo: a priori, a posteriori, media aritmetica e media
pesada pela distancia aos clusters. A abordagem multimodel beneficiou da seleccao de variaveis, en-
quanto que o modelo singular teve melhor performance usando as 32 variaveis fisiologicas. Foi feita
tambem uma abordagem espaco-temporal atraves do uso de duas tecnicas de clustering diferentes:
Fuzzy C-means (FCM) e Mixed Fuzzy Clustering (MFC). Esta ultima tem a particularidade de pesar a
componente temporal durante o processo de particao, permitindo uma identificacao mais flexıvel das
estruturas existentes nos dados compostos por atributos temporais e estaticos. Duas abordagens de
modelacao baseadas no MFC foram testadas e comparadas com abordagens similares baseadas no
FCM, em que ambos os algoritmos de particao foram usados para transformar as variaveis em matrizes
de particao, ou para determinar os antecedentes dos conjuntos fuzzy de modelos fuzzy Takagi-Sugeno.
O uso da transformacao revelou uma melhor performance face aos outros metodos, no entanto, quando
a seleccao de variaveis e combinada com modelacao fuzzy, o FCM sem transformacao deu melhores
resultados. Os melhores resultados obtidos foram AUC=0.82±0.01 para a abordagem multimodelo e
AUC=0.80±0.06 para a abordagem MFC. Considerando o facto de que a quantidade de insercao arti-
ficial de dados ronda os 14.6% para MFC e 74.7% para os multimodelos, a abordagem MFC deve ser
considerada para prever a administracao da medicacao para pacientes em estado crıtico.
Palavras-chave: Data mining, Vasopressores, Seleccao de atributos, pre-processamento
de dados, dados temporais, acoplamento de dados estaticos com dinamicos, multimodelos, fuzzy clus-
tering, modelacao especializada
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Datamining applied to the prediction of vasopressor dependency . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Methods 7
2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Fuzzy C-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Mixed Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Fuzzy Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Modelling Based on MFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 FCM Fuzzy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 FCM Fuzzy Model with FCM feature transformation . . . . . . . . . . . . . . . . . 14
2.3.3 MFC fuzzy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 FCM Fuzzy Model with MFC feature transformation . . . . . . . . . . . . . . . . . 15
2.4 Ensemble Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Subgroup selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Subgroup modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 Ensemble decision criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Sequential Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 SFS with ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ix
2.5.3 SFS with MFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Validation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Data collection and preprocessing 28
3.1 Structure of MIMIC II Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Preprocessing Data - General Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Chosen Input/Output Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Removing outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Removing deceased patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Preprocessing Data - Clinical Actual State Analysis . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Enabling the dataset for prediction purposes . . . . . . . . . . . . . . . . . . . . . 49
3.4 Preprocessing Data - Clinical State Evolution Analysis . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Data imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.2 Enabling the dataset for prediction purposes . . . . . . . . . . . . . . . . . . . . . 55
4 Results and Discussions 56
4.1 Ensemble Modelling - Punctual Data Results . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 Unsupervised Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.2 Fixing parameters of the models & Feature Selection . . . . . . . . . . . . . . . . . 62
4.1.3 Model Assessment based on overall performance . . . . . . . . . . . . . . . . . . 68
4.1.4 Model Assessment based on the singular models’ performance . . . . . . . . . . . 73
4.2 Mixed Fuzzy Clustering - Time-series data approach . . . . . . . . . . . . . . . . . . . . . 74
4.2.1 Fixing parameters of the models & Feature Selection . . . . . . . . . . . . . . . . . 75
4.2.2 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Conclusions 79
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Bibliography 87
A Outliers - Expert Knowledge versus Inter-quartile method 89
B Removing Deceased Patients 98
C ID’s of the same variable 105
D Clustering Validation Analysis - Methods 106
x
E Clustering Validation Analysis - Distributions 108
F Fixing ensemble modelling parameters 113
G Influence of the fuzziness parameter in FCM clusters centres 114
H Histograms feature selection for punctual data 117
I Previous Results 122
xi
xii
List of Tables
3.1 Features and sampling rates (measurements/day) in each dataset. . . . . . . . . . . . . . 35
3.2 List of vasopressors and participation in the datasets. . . . . . . . . . . . . . . . . . . . . 36
3.3 Delimiting data (Inter-quartile is applied on ALL dataset) . . . . . . . . . . . . . . . . . . . 40
3.4 After alignment with Heart Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 After imputation of data using ZOH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Removal of data due to lack of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Percentages of imputation by dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Percentages of imputations by input time-varying variable . . . . . . . . . . . . . . . . . . 49
3.9 Example of the output shifting procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.10 Before shifting the output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.11 After shifting the output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.12 Percentages of imputation given the vector length. . . . . . . . . . . . . . . . . . . . . . . 54
3.13 Considering an interval between measurements of x hour (for ALL dataset) . . . . . . . . 55
4.1 Classes percentages in each dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Clustering validation indexes for dataset ALL performed 10 times with different partitions.
(+) means a higher value is better and (-) means the opposite. . . . . . . . . . . . . . . . 59
4.3 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Most selected features for single model for dataset ALL. . . . . . . . . . . . . . . . . . . . 64
4.6 Most selected features for a priori for dataset ALL. . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Most selected features for a posteriori for dataset ALL. . . . . . . . . . . . . . . . . . . . . 64
4.8 Most selected features for arithmetic mean for dataset ALL. . . . . . . . . . . . . . . . . . 64
4.9 Most selected features for distance-weighted mean for dataset ALL. . . . . . . . . . . . . 64
4.10 Most selected features for single model for dataset BOTH. . . . . . . . . . . . . . . . . . . 65
4.11 Most selected features for a priori for dataset BOTH. . . . . . . . . . . . . . . . . . . . . . 65
4.12 Most selected features for a posteriori for dataset BOTH. . . . . . . . . . . . . . . . . . . . 65
4.13 Most selected features for arithmetic mean for dataset BOTH. . . . . . . . . . . . . . . . . 65
4.14 Most selected features for distance-weighted mean for dataset BOTH. . . . . . . . . . . . 65
4.15 Most selected features for single model for dataset PNM. . . . . . . . . . . . . . . . . . . 65
4.16 Most selected features for a priori for dataset PNM. . . . . . . . . . . . . . . . . . . . . . . 65
xiii
4.17 Most selected features for a posteriori for dataset PNM. . . . . . . . . . . . . . . . . . . . 65
4.18 Most selected features for arithmetic mean for dataset PNM. . . . . . . . . . . . . . . . . 65
4.19 Most selected features for distance-weighted mean for dataset PNM. . . . . . . . . . . . . 65
4.20 Most selected features for single model for dataset PAN. . . . . . . . . . . . . . . . . . . . 66
4.21 Most selected features for a priori for dataset PAN. . . . . . . . . . . . . . . . . . . . . . . 66
4.22 Most selected features for a posteriori for dataset PAN. . . . . . . . . . . . . . . . . . . . . 66
4.23 Most selected features for arithmetic mean for dataset PAN. . . . . . . . . . . . . . . . . . 66
4.24 Most selected features for distance-weighted mean for dataset PAN. . . . . . . . . . . . . 66
4.25 Mean and standard deviation of the number of features selected through the 50 runs. . . 66
4.26 Features selected using SFS for the dataset PAN. . . . . . . . . . . . . . . . . . . . . . . 67
4.27 Features selected using SFS for the dataset PNM. . . . . . . . . . . . . . . . . . . . . . . 67
4.28 Features selected using SFS for the dataset BOTH. . . . . . . . . . . . . . . . . . . . . . 67
4.29 Features selected using SFS for the dataset ALL. . . . . . . . . . . . . . . . . . . . . . . . 67
4.30 Feature selection results for PAN dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.31 Feature selection results for PNM dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.32 Feature selection results for BOTH dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.33 Feature selection results for ALL dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.34 Comparison against single model with all variables (no FS). . . . . . . . . . . . . . . . . . 69
4.35 Results without FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.36 Results with FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.37 Results for models based on the singular models’ performance: PAN dataset. . . . . . . . 74
4.38 Results for models based on the singular models’ performance: PNM dataset. . . . . . . 74
4.39 Results for models based on the singular models’ performance: BOTH dataset. . . . . . . 74
4.40 Results for models based on the singular models’ performance: ALL dataset. . . . . . . . 74
4.41 Best parameters and selected feature according to AUC. . . . . . . . . . . . . . . . . . . . 76
4.42 Most selected features by FCM FM (mean number of selected features: 5.8). . . . . . . . 76
4.43 Most selected features by FCM-FCM FM (mean number of selected features: 4.3). . . . . 76
4.44 Most selected features by MFC FM (mean number of selected features: 8.2). . . . . . . . 76
4.45 Most selected features by MFC-FCM FM (mean number of selected features: 6.5). . . . . 76
4.46 MA data using all variables with best FS data parameters according to AUC. . . . . . . . 77
4.47 FCM FM vs FCM-FCM FM with FCM FM selected features. . . . . . . . . . . . . . . . . . 77
4.48 FCM FM vs FCM-FCM FMwith FCM-FCM FM selected features. . . . . . . . . . . . . . . 77
4.49 MFC FM vs MFC-FCM FM with MFC FM selected features. . . . . . . . . . . . . . . . . . 77
4.50 MFC FM vs MFC-FCM FM with MFC-FCM FM selected features. . . . . . . . . . . . . . . 78
5.1 Best results for punctual and evolution state analysis for the dataset ALL. . . . . . . . . . 80
C.1 List of de IDs associated with each variable that were grouped into one. . . . . . . . . . . 105
xiv
D.1 Clustering validation indexes score for dataset PAN performed 10 times with different
partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
D.2 Clustering validation indexes score for dataset PNM performed 10 times with different
partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
D.3 Clustering validation indexes score for dataset BOTH performed 10 times with different
partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
D.4 Clustering validation indexes score for dataset ALL performed 10 times with different par-
titions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
E.1 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
E.2 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
E.3 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.4 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.5 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
E.6 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
E.7 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
E.8 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
F.1 Best parameters all features; c=2:5; m=1.1:2. . . . . . . . . . . . . . . . . . . . . . . . . . 113
F.2 Best parameters with feature selection; c=2:5; m=1.1:2. . . . . . . . . . . . . . . . . . . . 113
xv
xvi
List of Figures
2.1 Functional block diagram of a fuzzy inference system (FIS). . . . . . . . . . . . . . . . . . 12
2.2 Methods used for time-variant data coupled with time-invariant data. . . . . . . . . . . . . 14
2.3 Schematic representation of the single and multimodel approaches; [*] cluster centres is
relevant for the case where feature selection is performed independently for each sub-
group/unsupervised cluster, see Section 2.5.2. . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Left: SFS applied in individual subgroups; Right: SFS applied to the ensemble methodology. 22
2.5 ROC curve example and points of interest (inspired on [20]). . . . . . . . . . . . . . . . . 24
3.1 Schematic of data collection and database construction [1]. . . . . . . . . . . . . . . . . . 31
3.2 From raw data to usable data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Interval between two consecutive vasopressors intake (with a random added value of
±0.015, after their interval identification, for density observation) considering only patients
with more than 6 hours of data before vasopressors administration. Figure (b) is a zoom
in of the figure (a) to the region of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 12,13,16
and 20. Each point corresponds to the mean value of the measurements taken in a 2
hours window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Artificial data and results for three different methods on applying min-max normalization. . 44
3.7 Alignment of misaligned and unevenly sampled data (inspired in [12]). . . . . . . . . . . . 45
3.8 Alignment of the data for the punctual data analysis, using the variable with highest sam-
pling rate as template, covering all the case scenarios. . . . . . . . . . . . . . . . . . . . . 46
3.9 Preprocessing steps and resulting datasets of the MIMIC II for the punctual data case. . . 47
3.10 Preprocessing steps and resulting datasets of the MIMIC II for the time-series data case. 52
3.11 Procedure to adapt the real measurements to a vector of length 10 for all variables includ-
ing data imputation when needed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.12 How the data is enabled for prediction purposes. . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Example on PAN dataset to get the subset data for clustering validation. . . . . . . . . . . 58
xvii
4.2 Distribution along clusters for the dataset ALL. . . . . . . . . . . . . . . . . . . . . . . . . 61
A.1 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.2 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 5 to 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.3 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 9 to 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.4 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 13 to 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.5 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 17 to 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.6 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 21 to 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.7 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 25 to 28. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.8 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-
tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-
ables 29 to 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B.1 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 1-6.
Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 99
B.2 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 7-12.
Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 100
B.3 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 13-18.
Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 101
B.4 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 19-24.
Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 102
B.5 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 25-30.
Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 103
B.6 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 30-32.
Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 104
xviii
B.7 Related to the above figures. Left figure: Number of measurements considered each 2
hours time window. Right figure: Number of patients taken into account for each 2 hours
time window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.1 Division in clusters for the dataset PAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
E.2 Division in clusters for the dataset PNM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.3 Division in clusters for the dataset BOTH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
E.4 Division in clusters for the dataset ALL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
H.1 Frequency a feature was selected during 50 runs for dataset ALL. . . . . . . . . . . . . . 118
H.2 Frequency a feature was selected during 50 runs for dataset BOTH. . . . . . . . . . . . . 119
H.3 Frequency a feature was selected during 50 runs for dataset PNM. . . . . . . . . . . . . . 120
H.4 Frequency a feature was selected during 50 runs for dataset PAN. . . . . . . . . . . . . . 121
I.1 Tables extracted from (a) [24] (b) [23] for results comparison. . . . . . . . . . . . . . . . . 122
xix
xx
Notation
Symbols
βi Degree of activation of the ith rule
δ The euclidian distance
λ Temporal component weight
xsi Vector with the spatial component of the samples
µAij Membership function of Aij
µij Membership degree of sample j to the ith cluster (FCM approach)
vsl Spatial prototypes for each cluster l
Aij the antecedent fuzzy set for rule Ri and the jth feature
Ai Positive definite symmetric matrix
ci The N-dimension centre of the ith cluster
d2λ distance function between a sample and the spatial and temporal prototype of a cluster
dij Distance from sample j to the ith cluster prototype (FCM approach)
fi consequent function of the rule Ri
g Cluster that minimizes the criterion
J Objective function of the augmented FCM (MFC approach)
Jm Objective Function (FCM approach)
Kc Number of individual models that belong to a MCS
m Fuzziness parameter
N Total number of features
nc Number of clusters
Ns Number of samples
p Total number of temporal features (MFC approach)
q Total number of temporal samples
r Total number of spatial features (MFC approach)
SRi Sample rate of the ith feature
t Threshold
U Partition matrix
ul,i The degree of membership of sample i to cluster l
vtl,k Temporal prototypes for each cluster l and feature k
X Collection of data samples
xj The N-dimensional jth sample
xi Matrix that includes Xti and xsi
Xti Matrix q × p with the temporal component of the samples
y Output of the model
Ymm,j the prediction made by the MCS for sample j
z Iteration step number
xxi
Acronyms
ACC Accuracy
ADI Alternative Dunn Index
ALL Dataset with no restriction in terms of diseases
AUC Area Under the receiver operating characteristics Curve
BOTH Pancreatitis and Pneumonia dataset
CE Classification Entropy
DI Dunn’s Index
EHR Modern Electronic Health Records
FCM Fuzzy C-Means
FCM-FCM FM FCM model with FCM feature transformation FM
FIS Fuzzy Inference Systems
FM Fuzzy Model
FN False Negatives
FP False Positives
FS Feature Selection
ICU Intensive Care Unit
ID Identification
KDD Knowledge Discovery in Databases
MCS Multiple Classifier System
MFC Mixed Fuzzy Clustering
MFC-FCM FM FCM model with MFC feature transformation FM
MIMIC II Multi-Parameter Intelligent Monitoring
MRN Medical Record Number
OLS Ordinary-Least Squares
PAN Pancreatitis dataset
PC Partition Coefficient
PNM Pneumonia dataset
prec Precision
ROC Receiver Operating Characteristics
S Separation Index
SC Partition Index
sens Sensitivity
SFS Sequential Forward Selection
spec Specificity
TN True Negatives
TP True Positives
TS Takagi-Sugeno
XB Xie-Beni Index
xxii
Chapter 1
Introduction
This study addresses the need of vasopressors administration for patients in septic shock condition
in intensive care units (ICU) attempting to timely predict its administration. It proposes several ways to
pre-process the data adding some details that were not taken into consideration so far: use of expert
knowledge provided by physicians to remove outliers in place of statistical methods, removing patients
that due to their physiological variables behaviour may misguide the classification, new normalization
approach that tries to make use of patients’ healthy condition to evaluate only the deviation from its
own natural state, and not compare with their peers, and readjusted interval time between vasopressors
administration to consider it as continuous. The data is preprocessed in order to fit to two different
approaches in predicting the need of vasopressors: data that comprises information about the state
of a patient at a given time neglecting its history/evolution during the ICU stay (called punctual data
throughout this work), and data organized as a sequence of events ordered by its occurrence in time,
giving information about the evolution of each variable during the patient’s stay jointly with static data
(in this case: demographic data and ICU scoring systems). The latter uses an algorithm capable of
evaluating the time-variant and time-invariant data, being it the first time it is used in modelling to obtain
fuzzy models.
1.1 Motivation
Severe sepsis and septic shock are major health care problems and remain one of the leading causes
of death in critically ill patients [2] along with substantial consumption of health care resources [3]. It
affects millions of people around the world each year, killing one in four (and often more), and increasing
in incidence [3] [16] [45] [42] [17].
Sepsis is a systemic, deleterious host response to infection leading to severe sepsis and septic
shock, whereas the former refers to acute organ dysfunction secondary to documented or suspected in-
fection and the latter is severe sepsis plus hypotension (low blood pressure) despite the increase in the
cardiac output assumed it had been obtained by volume expansion, i.e., not reversed with fluid resusci-
tation. The initial priority in managing septic shock is to maintain a reasonable mean arterial pressure
1
and cardiac output to keep the patient alive allowing organ perfusion while the source of infection is iden-
tified and addressed, and measures to interrupt the pathogenic sequence leading to septic shock are
undertaken. While these latter goals are being pursued, adequate organ system perfusion and function
must be maintained, guided by cardiovascular monitoring [33]. It is this necessity that brings the use of
vasopressor agents administration, when vascular failure is such that the increase in cardiac output is
insufficient to maintain the mean arterial pressure in accordance with good organic infusion [15]. Es-
sentially, vasopressors are drugs which cause the blood vessels to constrict, increasing blood pressure
in critically ill patients.
As in other types of shock, septic shock, in parallel with the treatment of infection, demands an urgent
need to restore blood pressure and cardiac output for organ perfusion and thus stop the process that
aggravates the debt oxygen [50] [14] [18], similarly to polytrauma, acute myocardial infarction, or stroke,
the speed and appropriateness of therapy administered in the initial hours after severe sepsis develops
are likely to influence outcome.
Current recommendations are to try to restore both the cardiac output and blood pressure by volume
expansion/fluid resuscitation. It is only when fluid administration fails to restore an adequate arterial
pressure and organ perfusion that therapy with vasopressor agents should be initiated.
1.2 Datamining applied to the prediction of vasopressor depen-
dency
Medical or health care services are traditionally rendered by numerous providers who operate in-
dependently of one another. Providers may include, for example, hospitals, clinics, doctors, therapists
and diagnostic laboratories. A single patient may obtain the services of a number of these providers
when being treated for a particular illness or injury. Over the course of a lifetime, a patient may receive
the services of a large number of providers. Each medical service provider typically maintains medical
records for services the provider renders for a patient, but rarely if ever has medical records generated
by other providers. Such documents may include, for example, new patient information or admission
records, doctors’ notes, and lab and test results. Each provider will identify a patient with a medical
record number (MRN) of its own choosing to track medical records the provider generates in connection
with the patient. In order to make health care management more efficient, improve the quality of health
care delivered and eliminate inefficiencies in the delivery of the services, there is a desire to collect all
of a patient’s medical records into a central location for access by health care managers and providers.
A central database of medical information about its patients enables a network or organization to deter-
mine and set practices that help to reduce costs. It also fosters sharing of information between health
care providers about specific patients that will tend to improve the quality of health care delivered to the
patients and reduce duplication of services.
Nowadays technological advancements in the form of computer-based patient record software and
personal computer hardware are making the collection of and access to health care data more man-
2
ageable. The use of data mining in medicine is a rapidly growing field, which aims at discovering some
structure in large clinical heterogeneous data [11]. This interest arose due to the rapid emergence of
electronic data management methods, holding valuable and complex information. Human experts are
limited and may overlook important details, while automated discovery tools can analyse the raw data
and extract high level information for the decision-maker [30].
In this era databases are typically very large and contain high percentages of missing data. In
most cases, the missing data come from multiple heterogeneous sources [29]. After data collection
and problem definition, pre-processing is very important for data analysis, especially for retrospective
evaluations. Medical databases are a good example where the preprocessing is essential. Clearly, the
quality of the results from data analysis strongly depends on the careful execution of the preprocessing
steps [8].
The typical problems associated with medical databases are listed below [44]:
1. Each of the patients has a different length of stay in the medical unit. For each patient, a different
number of variables is documented.
2. Different data is measured at different times of day with different frequencies.
3. Some hospitals may not have data recorded online. Since the data can be transferred from hand-
written records to the database, typing errors are a common error source.
4. Many variables have a high percentage of missing values caused by faults or simply by seldom
measurements. Nevertheless, with the preprocessing approach proposed in this work, it was
possible to achieve better modelling results.
This study deals with all the aforementioned irregularities, in order to obtain more concise datasets
that are less vulnerable to inaccurate data collection and medical decisions. The techniques of data
mining were used to search for relationships in such a large clinical database.
The present problem has been addressed through the use of machine learning algorithms to clas-
sify whether ICU patients need vasopressors. In [32], heart rate and arterial blood pressure are used
as inputs to a fuzzy-logic based algorithm generating a ”a vasopressor advisability index”. In [13], a
multimodel approach is tested for predicting the risk of death in septic shock patients where two models
were used in parallel to enable selective sensitivity or specificity. In [24], fuzzy modelling with feature
selection was used to classify the use of vasopressors in septic shock patients, and in [23] the authors
account with the fluid resuscitation intake to filter the database and use fuzzy modelling to predict the
need of vasopressors. Both use disease-based subsets of patients showing that this approach improves
the prediction results when compared with a general model.
At the present moment there is no available time-series analysis dedicated to this problem even
though the use of time-varying data has shown to be useful in discovering patterns and extracting knowl-
edge from data in the most diverse domains as telecommunication applications, environmental analysis,
medical problems and financial markets [40]. All the aforementioned studies are based on what is called,
3
throughout this work, by punctual data that refers to the state of a patient at a given time, neglecting its
history/evolution during the ICU stay.
1.3 Contributions
Following the work done in [24] and [23], this thesis offers a detailed description of the whole pre-
processing, including improvements that resulted from the thorough inspection of the data that led to the
redefinition of some assumptions that were made previously. These changes led to improved results
(previous results are presented in Appendix I) and include: exclusion of diseased patients that did
not take vasopressors, update an inconsistency shown by the analysis of the intervals between the
drug intakes, tested a new normalization for the dataset (that is then proved to output worse results,
however this idea deserves an exhaustive study given that it was applied blindly to every variable),
outliers removal is here approached by using expert knowledge, only use data starting at the point
where each variable has at least one measurement (not extrapolating backwards, and using only data
that would be available in a real-time situation).
This thesis performs feature selection for each of the criteria proposed that use ensemble modelling
approach: a priori, a posteriori, arithmetic mean and weighted mean-distance. It shows two ways of
selecting the most predictive features: one based on the overall performance of the multiple classifier
system, and another based on single models’ performance. This states the difference between having
a specialized tuned system vs a system that contains specialized tuned models.
A spatiotemporal approach is undertaken making use of the physiological variables as time-series
(temporal data), and demographic data along with scores as non-varying data (spatial data). This in-
cludes the pre-processing of the data in order to be possible to build fuzzy models based on Mixed
Fuzzy Clustering. In a previous study [22] it was shown that using the resulting partition matrix of the
Mixed Fuzzy Clustering as input to the models has improved the results when compared to using the
spatiotemporal data directly. This idea is now applied to modelling based on the Fuzzy C-means (FCM),
where the partition matrix that comes out of the FCM algorithm is used as input for the models. There
are four approaches that use the data in this setup and feature selection was performed for each of them
and its results are discussed. It was proven that the proposed feature transformations deal better with
higher order feature sets when compared to using real measurements as input to the models, and the
intuitiveness of such transformations has the advantage of not making it lose the transparency needed
in health care data for further interpretation.
The output of the collaborative work developed in this thesis includes two conference papers [22]
[52]. Furthermore, it is expected that the work developed in this thesis and in [52] is extended to a
journal paper in the near future.
4
1.4 Outline
Chapter 2 presents the underlying methods that are present throughout this study. The theoretical
aspects of clustering methodology is introduced, follows a description of fuzzy modelling. Next, feature
selection and its variants are presented and validation measures are described.
Chapter 3 introduces the MIMIC II database and covers the data pre-processing, by giving a de-
tailed description of the transformation from the original database to the final datasets - punctual and
spatiotemporal - to which the methods will be applied.
In Chapter 4, the main results are shown, discussed and compared. This includes all the decisions
that were taken based on performance indexes to proceed with the study - fixing parameters and feature
selection. Feature selection results are shown and discussed, as well as the model assessment results.
These is performed for both ensemble modelling and time-series coupled with non-variant features anal-
ysis. Finally, in Chapter 5, the results are summarized and conclusions are drawn. The advantages and
disadvantages of each method are discussed, their limitations are presented and future work is pro-
posed.
5
6
Chapter 2
Methods
This chapter goes in depth about the theoretical background of the material that are ingrained
throughout this study. It starts with the concept of clustering and its role in data partitioning covering
in particular the Fuzzy C-Means algorithm (FCM) and Mixed Fuzzy Clustering algorithm (MFC), where
the main difference has to do with distance costs: one for singular data points clustering, not differ-
entiating between time-variant and time-invariant data, and another suited for clustering of time-series
combined with time-invariant data where the weight of the time-variant data is given through a parameter
λ. Next it covers fuzzy modelling, namely the Takagi-Sugeno (TS) fuzzy models, crucial to the develop-
ment of this study, and the proposed variants of this model are addressed in detail: antecedents of the
fuzzy model obtained through FCM and MFC and feature transformation using FCM and MFC. Then it is
presented what is denominated by ensemble modelling (or multimodel approach) which is an alternative
to the common structure of modelling passing through different stages of clustering in order to build a
group of models trained to deal with specific partitions of the data playing its role in the final prediction
using four different criteria. Then, it is introduced the concept of feature selection and how it is applied
in this context. Finally, the validation measures both for clustering validation and model validation are
introduced.
2.1 Clustering
Clustering is the problem of grouping data based on similarity, i.e., it is the task of dividing data
points into homogeneous classes or clusters so that items in the same class are as similar as possible
and items in different classes are as dissimilar as possible. Clustering can also be thought of as a
form of data compression, where a large number of samples are converted into a small number (when
compared with the size of the dataset) of representative prototypes or clusters. Depending on the data
and the application, different types of similarity measures may be used to identify classes, where the
similarity measure controls how the clusters are formed and are often expressed in terms of distance
norm between the data vectors or data vectors and centres of the clusters (prototypes). One may or may
not include the output (if known) to aid during the partitioning task, the algorithm is named supervised
7
or non-supervised, respectively.
There exist a large number of clustering algorithms but, generally speaking, clustering algorithms
can be divided into four groups: partitioning methods, hierarchical methods, density-based methods and
grid-based methods. In the same lines the clustering can be distinguished into two categories according
to the way the data is partitioned: hard and fuzzy clustering.
In hard (non-fuzzy) clustering, data is divided into crisp clusters, where each data point belongs
to exactly one cluster. In fuzzy clustering, the data points can belong to more than one cluster, and
associated with each of the data points are membership grades which indicate the degree to which the
data points belong to each cluster. In this study, only partitioning fuzzy methods are covered in detail
which is the base for all the modelling part, namely Fuzzy C-Means (FCM) algorithm and Mixed Fuzzy
Clustering (MFC) algorithm.
2.1.1 Fuzzy C-Means
The FCM algorithm is one of the most widely used fuzzy clustering algorithms. It attempts to partition
a finite collection of elements X = {x1, x2, ..., xNs} into a collection of nc fuzzy clusters, allowing one
data point to belong to two or more clusters, given some criterion. This method (developed by Dunn in
1973 and improved by Bezdek in 1981) is frequently used in pattern recognition, image processing, data
mining and fuzzy modelling [38]. It is based on the minimization of the objective function as in 2.1:
Jm =
N∑i=1
nc∑j=1
µmijd2ij(xj , ci), 1 ≤ m <∞ (2.1)
where nc is the number of clusters, N is the number of features, m is the fuzziness parameter that can
take any real number greater than 1, µij is the membership degree of sample xj to the ith cluster , xj is
the jth of N-dimensional measured data, ci is the N-dimension centre of the ith cluster, and d2ij(xj , ci) is
any norm expressing the similarity between any measured data xj and the prototypes ci.
The information contained in each µij can be organized into a matrix U , as in 2.2, commonly called
the partition matrix:
U =
µ11 · · · µ1Ns
.... . .
...
µnc1 · · · µncNs
, (2.2)
There are three properties inherent to µij and it is worth noting that µij ∈ [0, 1]∀ i, j, where zero implies
that the sample xj does not belong at all to cluster ci, and one implies that a sample xj completely
belongs to cluster ci (superimposed). The second property is that the sum of all membership degrees
of any sample to all clusters must be equal to one, according to 2.3:
nc∑i=1
µij = 1 ∀ j (2.3)
and that the total sample memberships to any cluster must be bigger than zero and smaller than one
8
2.4, implying that the clusters cannot be overlapped:
0 <
Ns∑j=1
µij < Ns ∀ i (2.4)
After defining m and the number of clusters nc, the matrix U is randomly initialized and the fuzzy
partitioning is carried out through an iterative optimization of the objective function 2.1, with the update
of the prototypes ci as in equation 2.5 :
ci =
∑nj=1 u
mij · xj∑n
j=1 umij
(2.5)
following the update of the membership degree µij , equation 2.6:
µij =1∑nc
k=1
(dijdkj
) 2m−1
(2.6)
The stopping condition can be anything that works for the problem. In this case this iteration stops when
max{∣∣∣µ(z+1)
ij − µ(z)ij
∣∣∣ } < ε ∀ i, j, where ε is a stopping condition between 0 and 1, and z is the iteration
step. This procedure converges to a local minimum or a saddle point in Jm.
In the present study, each sample is assigned to each cluster with a certain degree of membership.
This degree is proportional to the distance between the sample and the cluster prototype, which in a
general way can be computed as:
d2ij(xj , ci) = ‖xj − ci‖2 = (xj − ci)TAi(xj − ci), (2.7)
where Ai is a positive definite symmetric matrix, usually equal to the identity matrix in the FCM algorithm.
2.1.2 Mixed Fuzzy Clustering
Mixed Fuzzy Clustering (MFC) is a clustering method based on Fuzzy C-Means [7] and its principle is to
couple time-invariant data with time-variant data. This method generalizes the spatio-temporal concept
to any set of time-variant and time-invariant features and its extension to the analysis of multiple time-
series. In health care related data this means using information as demographic data, standard scores
or other type of information that can be considered static while still considering physiological varying
data, i.e., data with a sampling rate that cannot be neglected and which its evolutionary condition is
measured/taken into account.
In this approach each sample xi is composed by features that are constant during the sampling time
in analysis and by features that change over time (multiple time-series).
xi = (xsi , Xti ) (2.8)
In order to extend the spatio-temporal clustering method proposed in [35] which only deals with one
time-series to the case of multiple time-series, a new dimension is introduced, to handle p time-variant
9
features. As presented in equations 2.9 and 2.10, the spatial component of the samples is represented
by xsi and r is the number of spatial features, the temporal component of the samples is represented
by the matrix Xti with number of columns equal to the number of temporal features p and rows equal to
the number of temporal samples q. The MFC algorithm clusters the dataset using an augmented form
of the FCM. The main difference between the augmented and the classical FCM relies on the distance
function. In the augmented FCM a new pondering element λ is included, factoring the importance given
to the time-variant component. The distance is also computed separately for each time-series.
xsi = (xsi,1, ..., xsi,r) (2.9)
Xti =
xti,1,1 xti,1,2 · · · xti,1,p
xti,2,1 xti,2,2 · · · xti,2,p...
.... . .
...
xti,q,1 xti,q,2 · · · xti,q,p
(2.10)
The spatial prototypes are represented for each cluster l = {1, 2, ..., nc} by vsl or vsl and are computed
following the equation 2.11.
vsl =
∑ni=1 u
ml,ix
si∑n
i=1 uml,i
(2.11)
The temporal prototypes for each cluster l and feature k are represented by vtl,k, computed following
equation 2.12 and the matrix of temporal prototypes for cluster l is represented by V tl .
vtl,k =
∑ni=1 u
ml,ix
ti,k∑n
i=1 uml,i
(2.12)
The distance function between a sample and the spatial and temporal prototype of a cluster is computed
following equation 2.13, where δ represents the euclidean distance.
d2λ(vsl , Vtl , xi) = ||vsl − xsi ||2 + λ
p∑k=1
δ(vtl,k,xti,k) (2.13)
From this point both the membership degree of a sample i to cluster l equation 2.14 and the objective
function of the augmented FCM given by equation 2.15 have the same format as its parallel in the FCM
Section 2.1.1, the only difference is the distance measure that now has to consider the spatial and
temporal distances.
ul,i =1∑C
o=1
(dλ(vsl ,V
tl ,xi)
dλ(vso,Vto ,xi)
) 2m−1
(2.14)
J =
c∑l=1
n∑i=1
uml,id2λ(vsl , V
tl , xi) (2.15)
The MFC algorithm is described in algorithm 1. Its inputs are the spacial Xs and temporal data
10
Xt, number of clusters nc, initial partition matrix U , fuzzification parameter m and temporal component
weight λ. It returns the final partition matrix U , the spacial prototypes V s and temporal prototypes V t.
Algorithm 1 Mixed Fuzzy Clustering (MFC)
1: Input:2: Xs: Ns × r matrix of spacial data3: Xt: Ns × q × p matrix of temporal data4: nc: number of cluster prototypes5: U : nc ×Ns initial partition matrix6: m: fuzzification parameter7: λ: temporal component weight8: Output:9: U : nc × n partition matrix
10: V s: nc × r spatial cluster prototypes11: V t: nc × q × p temporal cluster prototypes12: while ∆J < ε do13: Compute the spatial cluster prototypes V s
14: for k in {0, ..., p} do15: Compute the temporal cluster prototype vtk16: end for17: Compute the distances dλ18: Update the partition matrix U19: Compute ∆J20: end while
2.2 Fuzzy Modelling
In the present thesis, fuzzy modelling is used in order to obtain classification models. These models
are created recurring to the train set of data of a target system and they are expected to be able to
reproduce its behaviour, i.e., to correctly assign samples of the validation set of data to the correspondent
label [37]. Fuzzy modelling systems are non-linear systems capable of inferring complex non-linear
relationships between input and output data when there is little or no previous knowledge of the problem
to be modelled.
In contrast to classical set theory and its latent boolean logic, fuzzy logic presents various possible
degrees of membership for an element to pertain to a given fuzzy set [19]. Fuzzy logic can be viewed as
an extension of the classical sets by adding a fuzziness parameter to handle the concept of partial truth,
where its value may vary between 0 and 1, being it completely false or completely true, respectively.
Furthermore, when linguistic variables are used, these degrees may be managed by specific functions
offering a more realistic framework for human reasoning (also called approximate reasoning) than the
traditional two-valued logic [59].
Fuzzy systems (static or dynamic systems that make use of fuzzy sets and fuzzy logic)[4] can be
used for a variety of purposes as modelling, data analysis, prediction, control, etc. The process of
formulating the mapping from a given input to an output using fuzzy logic is called Fuzzy Inference. This
process comprises four parts as presented in Figure 2.1 and they are described as follows [54]:
Fuzzification of the input variables - the first step is to take the inputs and determine the degree to
11
which they belong to each of the appropriate fuzzy sets via membership functions (conversion
of the input variables into linguistic values). A fuzzy operator (AND or OR) is applied in the an-
tecedent: after the inputs are fuzzified, the degree to which each part of the antecedent is satisfied
for each rule is known. If the antecedent of a given rule has more than one part, the fuzzy operator
is applied (to two or more membership values from the fuzzified inputs variables) to obtain one
number that represents the result of the antecedent for that rule. This number is then applied to
the output function resulting in a single truth value.
Knowledge base - contains the main relationships between inputs and outputs: It is composed by a
database where membership functions for linguistic terms are defined and a rule base, generally
represented by if-then statements.
Inference engine - is responsible for computing the fuzzy output of the system, using the information
contained in the rule base and the given input value to produce an output. Here occurs the aggre-
gation of the consequents across the rules.
Defuzzification of the output fuzzy set - provides a crisp value from the fuzzy set output.
Fuzzifier(Fuzzification)
Inferenceengine
Defuzzifier(Defuzzification)
Fuzzy sets ofInput variables
Fuzzy set output
Rule base(IF-THEN…)
+Database
Crisp inputs Crisp output
Also known as Knowledge Base:Provided by experts or extracted from numerical data
- Maps fuzzysets into fuzzysets- Determines how the rules are activated and combined
Provides crisp output valueActivates the linguistic rules
Figure 2.1: Functional block diagram of a fuzzy inference system (FIS).
Different consequent constituents result in different fuzzy inference systems, but their antecedents
are always the same [37]. There are two mostly well known types of Fuzzy Inference Systems (FIS):
Linquistic or Mamdani fuzzy model / type inference - where both the antecedent and consequent
are fuzzy propositions. It expects the output membership functions to be fuzzy sets. After the
aggregation process, there is a fuzzy set for each output variable that needs deffuzification.
Takagi-Sugeno (TS) fuzzy model / type inference - where the consequents are crisp functions of the
antecedent variables rather than a fuzzy proposition. The first two parts of the fuzzy inference
process, fuzzifying the inputs and applying the fuzzy operator, are exactly the same as the former.
The main difference is that the TS output membership functions/consequent are either a constant
or a linear equation, zero-order or first-order respectively.
12
So, the main difference between these two types of inference systems lies in the consequent of
fuzzy rules and thus their aggregation and deffuzification procedures changing the way the crisp output
is generated from the fuzzy inputs. The first two parts remain unchangeable: fuzzifying the inputs and
applying the fuzzy operator. At the cost of losing the expressive power and interpretability of Mamdani
output (since the consequents of the TS rules are not fuzzy) TS offers better processing time since
the weighted average replace the time consuming defuzzification process making it by far the most
popular candidate for sample-data-based fuzzy modelling. Moreover TS works better with optimization
and adaptive techniques which makes it very attractive for dynamic non-linear systems. These adaptive
techniques can be used to customize the membership functions so that fuzzy system best models the
data.
In some domains like medicine it is preferable to not use black box models enabling the user to
understand the classifier and evaluate its results. Due to the FM grey box nature this approach is
appealing as it provides not only a transparent, non-crisp model, but also a linguistic interpretation in the
form of if-then rules, which can potentially be embedded into clinical decision support processes. In this
work, first-order Takagi-Sugeno fuzzy models (TS-FM) [55] are derived from the data. These consist
of fuzzy rules where each rule describes a local input-output relation. With TS-FM, each discriminant
function consists, for the binary classification case, of rules of the type
Ri : If x1is Ai1 and ... and xN is AiN
then y = fi(x), i = 1, 2, ...,K (2.16)
where fi is the consequent function of rule Ri. The output of the discriminant function y can be inter-
preted as a score (or evidence) for the positive example given the input feature vector x. The degree
of activation of the ith rule is given by βi =∏Nj=1 µAij (x), where µAij (x) : R → [0, 1]. The discrimi-
nant output is computed by aggregating the individual rules contributions in a weighted average fashion:
d(x) =∑Ki=1 βifi(x)∑Ki=1 βi
. A sample x is considered positive if the score is higher than a certain t threshold
y > t, transforming the continuous discriminant output into a binary classifier.
The number of rules K of the type Ri and the antecedent fuzzy sets Aij are determined by fuzzy
clustering in the product space of the input variables. The consequent functions fi(x) are linear functions
determined by ordinary-least squares (OLS) in the space of the input and output variables.
2.3 Modelling Based on MFC
The methods that are presented here will deal with mixed data: time-variant data paired with time-
invariant data. There are four different modelling approaches whereas two of them work as base-lines
for comparison with the other two that make use of the MFC. Figure 2.2 offers an overview of what
comes next.
The feature transformation that is applied (Figure 2.2 second and fourth methods) consists in passing
13
AntecedentsDetermined by
FCM
Antecedents Determined by
FCM
AntecedentsDetermined by
MFC
Antecedents Determined by
FCM
Input features: Time-variant + time-invariant
Input features: Time-variant + time-invariant
Input transformation
by FCM
Input tranformation
by MFC
Input features: Time-variant + time-invariant
Input features: Time-variant + time-invariant
Input features:Partition matrix obtained by MFC
Input features:Partition matrix obtained by FCM
Figure 2.2: Methods used for time-variant data coupled with time-invariant data.
the data through a clustering method (FCM or MFC) and using the resulting partition matrix as input to
the models. Contrary to most data transformations, this one adds a new layer of interpretability: based
on the membership degree of a patients’ data to the existent clusters (that can be seen as categori-
cal values), the model classifies whether it will need vasopressors or not. Thus there is still medical
significance by giving each patients’ data a membership degree to each cluster/category.
2.3.1 FCM Fuzzy Model
In FCM FM, the time-invariant features and samples of the time-variant features are all equally as
features, and the antecedent fuzzy sets are determined using the partition matrix generated by FCM.
This is one of the most commonly used clustering method for the identification of TS-FM and as been
used in multiple health care applications [34].
2.3.2 FCM Fuzzy Model with FCM feature transformation
In FCM-FCM FM, the time-invariant and time-variant features are initially clustered using FCM and
the generated partition matrix is used as the feature set for a FCM fuzzy model. In this case the number
of features after transformation is equal to the number of clusters specified for the clustering stage
algorithm and are equal to the degree of membership of each sample to those clusters.
This approach can be seen as a type of feature transformation method for which the resulting fea-
tures represent the degree of membership of each point to the different clusters generated by the FCM
algorithm.
2.3.3 MFC fuzzy model
In MFC FM, the antecedent fuzzy sets of the TS FM are determined based on the partition matrix
generated by the MFC algorithm.
14
This methodology was developed based on the belief that the identification of the fuzzy membership
functions should be based on a non-conventional clustering algorithm in the presence of a mixture of
time-variant and time-invariant features, the former features should not be directly blended with time-
invariant features when calculating distances and different time-variant features are dealt separately
too.
2.3.4 FCM Fuzzy Model with MFC feature transformation
In MFC-FCM FM, the time-invariant and time-variant features are initially clustered using MFC and
the generated partition matrix is used as the feature set for a FCM fuzzy model. In this case the number
of features after transformation is equal to the number of clusters specified for the MFC algorithm and
are equal to the degree of membership of each sample to the mixed clusters.
This approach can be seen as a type of feature transformation method for which the resulting features
represent the degree of membership of each patients’ data to the different clusters generated by the MFC
algorithm.
2.4 Ensemble Modelling
Traditionally, it is assumed that there is a single best model for making inferences from data. Some
authors suggest however that inference should be based on a full set of models based on data sub-
groups, where relevant triggering and aggregating mechanisms are defined to activate or define a suit-
able interaction between each single model, allowing multimodel/ensemble model inference [48].
The rationale behind ensemble machine learning systems is the creation of many classifiers and the
combination of their output such that the performance is improved when compared with each single clas-
sifier [49]. The strategies used for the combination of classifiers can be divided in two types: classifier
selection and classifier fusion.
Classifier selection - each classifier is trained to be an expert in a subspace of the feature space, and
the answer is obtained based on a single selected classifier according to the input data. In this
case each model as its own threshold to transform the continuous output into a binary output.
Classifier fusion - the classifiers are trained over the entire feature space and combines all individual
classifiers into one stronger classifier that will ultimately provide the decision. In this case the
threshold to transform the continuous output into a binary output is defined for the combined output
and not for each model.
In [21] is proposed a fuzzy multimodel approach to an ensemble classifier that uses specific sub-
groups of data obtained by clustering samples with common characteristics, to model individual clas-
sifiers. Two decision criteria are proposed: one based on the arithmetic mean of the clusters’ output
and another based on the weighted average of the output of each cluster with the distance to the corre-
sponding cluster. The performance of the proposed multimodel is compared with a previous multimodel
15
developed by [21] that uses two decision criteria: an a priori decision based on the distance from the
clusters centres to the sample characteristics; and an a posteriori decision approach based on the un-
certainty of the model output response to the threshold of each model. The proposed criteria seeks to
mimic the natural decision making process humans tend to demonstrate, by consulting the opinion of
several experts before making a decision. Following the opinion of the majority agreeing on something
is usually preferable and produces better outcomes than following the opinion of a single expert whose
experience may be significantly different from the others. In this context, final decisions are usually
approximated by an appropriate combination of different opinions, where each of them has a different
underlying weight. The objective is to see if greater predictive performance can be achieved using ag-
gregation techniques (arithmetic mean and distance-weighted mean criteria) as compared to selection
techniques using distance metrics (a priori and a posteriori criteria), to build an ensemble classifier.
2.4.1 Subgroup selection
Fuzzy C-means (FCM) clustering algorithm was initially used to divide the dataset into similar inde-
pendent subgroups, in the N-dimensional space of the input variables (unsupervised clustering). Each
sample of the train dataset is assigned to one cluster by maximizing the degree of membership of the
sample to each cluster.
FCM clustering process requires the definition of two parameters: (i) the number of clusters C and
(ii) the degree of fuzziness m of the clustering, i.e., the weighting exponent of the clustering algorithm
[6]. These two model parameters were selected using the methods presented in Section 2.6.2.
2.4.2 Subgroup modelling
A first order Takagi-Sugeno fuzzy model was developed for each unsupervised cluster/subgroup,
resulting in Kc individual models. The number of rules Ri, the antecedent fuzzy sets AiN and the
consequent parameters were determined by means of FCM clustering in the product space of the input
and output variables (supervised clustering), where the number of clusters translates into the number of
fuzzy rules.
The complete test set is evaluated upon each unsupervised cluster’s model / subgroup model, re-
sulting in Kc different predictions for each sample. From the combination or selection of these individual
outcomes, a new final prediction is obtained for each sample, the multimodel prediction, Ymm, based on
one of four criteria: a priori, a posteriori, arithmetic mean and distance-weighted mean. These criteria
are covered in detail in Section 2.4.3.
Since this is a classification problem and y ∈ [0, 1], a threshold t is required to turn the continuous
output into a binary output y ∈ {0, 1}. The threshold selected to turn the continuous output into a binary
classification is determined for each model by evaluation of the train set using the corresponding criteria.
This way, the predicted output is 1 if y ≥ t and 0 if y < t. The optimal threshold is found by balancing
sensitivity and specificity. The optimal number of clusters nc and degree of fuzziness m for each model
Kc is determined by grid search.
16
Test data1-fold
Train data9-folds
Cluster 1(Subgroup 1)
Cluster N(Subgroup 2)
Non-supervised clustering
Model 1 Model N
Define model’s threshold based on perfomance
Train supervised models varying c and m Train supervised models varying c and m
Test model with train data
Define model’s threshold based on perfomance
Test model with train data
Define criteria thresholds regardless of the model’s
threshold
Apply criterion with train data
Status: 1 – Non supervised clusters’ center defined[*]
2 – Threshold for each model defined3 – Threshold for each criterion defined
Continuous output of each model for test data
Evaluate modelsOutput for Test data
Apply Criterion using respective non-
supervised prototypes and threshold
Figure 2.3: Schematic representation of the single and multimodel approaches; [*] cluster centres is relevant forthe case where feature selection is performed independently for each subgroup/unsupervised cluster, see Section2.5.2.
A schematic representation of the multimodel, where a 10-fold cross validation was used, is depicted
in Figure 2.3.
2.4.3 Ensemble decision criteria
The decision strategies that were mentioned, two based on classifier selection (a priori and a pos-
teriori) and other two based on classifier fusion (arithmetic mean and distance-weighted mean), are
formally described below:
Distance to the clusters centres - a priori
Each sample represents a point in a N -dimensional space. The distance of each sample to the
clusters centre is calculated. The cluster closer to the point is the selected one, and the classifica-
tion of the multimodel is given by the classification of the model of that cluster:
17
g = arg mini
(dij) (2.17)
Ymm,j = Yg, (2.18)
where dij represents the euclidean distance between sample j and cluster i, g the cluster that
minimizes that distance and Ymm,j the prediction made by the multimodel for sample j.
Difference between the threshold and the predicted outcome - a posteriori
For each sample, the difference between the threshold t and the predicted value by each model is
calculated. Higher differences mean more discrimination, i.e., the predicted value is more certainly
assigned to one of the classes, depending whether it is above or below the threshold. Hence, the
model that gives the classification is the model that gives a prediction more distant to its threshold:
g = arg maxi
(|ti − yij |) (2.19)
Ymm,j = Yg (2.20)
Arithmetic mean
In this case, the multimodel prediction consists in the arithmetic mean value of each single model
prediction (equation (2.21)). The idea is to check if combining the outputs of several classifiers by
averaging can reduce the risk of selecting a poorly performing classifier. Even though the average
may not beat the performance of the best classifier in the ensemble, it may reduce the overall risk
of making a poor selection.
Ymm,j =
∑Kci=1 yijKc
(2.21)
Distance-weighted mean
A weighted mean of the output of each single model with the euclidean distance to the correspond-
ing cluster centre, given by equation (2.22), was calculated for each test sample. The idea is to
combine individual models while giving higher credit to those classifiers trained with data closest
to the sample under evaluation.
Ymm,j =
∑Kci=1
1dijyij∑Kc
i=11dij
(2.22)
2.5 Feature Selection
The concept of Feature Selection (FS) is to reduce the dimensionality of the datasets in order
to keep only the features that are most relevant, based on the optimization of a specified criterion.
18
One aspect that is desirable in a machine learned model is that the model should have low variance,
i.e., it should not over fit the training data and lose the ability to generalize to unseen data. One of
the ways in which this could be done is to minimize the number of features that model uses so as
to only use the most informative features. It is different from dimensionality reduction. Both methods
seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so
by creating new combinations of attributes, whereas feature selection methods include and exclude
attributes present in the data without changing them. Examples of dimensionality reduction methods
include Principal Component Analysis, Singular Value Decomposition and Sammon’s Mapping. There
are many motivations to perform feature selection. While it conducts to direct benefits by reducing the
dimensionality of the data diminishing the processing and storing requirements and improve visualization
and understanding of the data [27] there are also side benefits that can be of higher value: cutting down
the data storage requirements can reduce the equipment needs avoiding useless investment, in some
cases it can improve the results by removing variables that do not add to the objective and can be
considered as noise misguiding the judgement. This process has also the advantage of highlighting
sets of features that were not thought before as being the most relevant for prediction purposes due
to its capability of choosing features that isolated do not have any relevant information but working in
association with other features can be of utmost importance. In the field of health care applications
there is the potential to take advantage of every aspect that was mentioned: cut in measurements
that are consistently taken reducing laboratory costs, improve the management of resources, better
understanding of the condition under study and disclosure of the information contained in some features
or group of features. From an engineering perspective, it is a powerful tool to reduce the complexity of
the model and remove inputs that do not improve the classification performance or are redundant.
The machine learning community classifies feature selection into four different categories: filters,
wrappers, hybrids and embedded [10] [53].
Filter algorithms - in this case the selection method is used as a preprocessing that does not attempt
to optimize directly the predictor performance. Example of some filter methods include the Chi
squared test, information gain and correlation coefficient scores. Some examples include using
measurements of entropy, variance, correlation or mutual information of single and multiple vari-
ables.
Wrapper algorithms - in which the selection method optimizes directly the predictor performance. An
example of a wrapper method is the recursive feature elimination algorithm.
Hybrid algorithms - in which is applied a filter method in the first place to obtain a reduced set of
features so that a wrapper method is then used to select the most relevant ones, according to the
performance of a selected machine learning algorithm.
Embedded algorithms - This involves carrying out feature selection and model tuning at the same
time. This will probably require some experience to know where to stop for greedy algorithms like
backwards and forward selection as well as tuning the parameters for the regularization1 based1Regularization refers to the method of preventing overfitting the training sample, by explicitly controlling the model complexity
19
models. Examples of regularization algorithms are the Lasso, Elastic Net and Ridge Regression.
Greedy algorithm is a mathematical process that looks for simple, easy-to-implement solutions to
complex, multi-step problems by deciding which next step will provide the most obvious benefit. Such
algorithms are called greedy because while the optimal solution to each smaller instance will provide an
immediate output, the algorithm doesn’t consider the larger problem as a whole. Once a decision has
been made, it is never reconsidered. Greedy algorithms work by recursively constructing a set of objects
from the smallest possible constituent parts. Recursion is an approach to problem solving in which the
solution to a particular problem depends on solutions to smaller instances of the same problem. The
advantage to using a greedy algorithm is that solutions to smaller instances of the problem can be
straightforward and easy to understand. The disadvantage is that it is entirely possible that the most
optimal short-term solutions may lead to the worst possible long-term outcome. There are two ways
to use greedy algorithms: forward selection where variables are progressively included increasing the
subset of features according to its relevance when grouped, or backwards elimination that starts from
the full feature set and iteratively removes the least relevant attribute one by one until the performance
criterion stops increasing [27]. Other possibility include the use of meta-heuristics with an increased
computational effort compensated by a more refined search [58].
The present thesis makes use of a greedy search strategy implemented in a wrapper way, following
the Sequential Forward Selection (SFS) approach.
2.5.1 Sequential Forward Selection
Following the wrappers concept that has as objective function the optimization of the predictor per-
formance, the SFS method builds a model for each of the features under consideration and evaluates
the models performance in order to define the next best candidate feature, the one that returns the best
value of the performance criterion, that will be integrated with the already chosen ones adding a new
dimension to the feature space.
So this process starts by evaluating which single feature is best for prediction, settles that feature and
then tries the combination of the selected feature with all the remaining ones repeating this process suc-
cessively until the value of the performance criterion stops increasing. Discrimination based on the area
under the receiver operating characteristics curve (see Section 2.6.1) [31], was used as performance
criterion in this study.
As mentioned before, the drawbacks are the same that were stated for the greedy algorithm: the
high likelihood to stuck to a sub-optimal solution. Nevertheless, the general acceptance in the medical
community and the advantages related to its simplicity, computational efficiency and transparent inter-
pretation of the results make it an isolated contestant in this study due to the high dimensional datasets,
limiting the usability of other algorithms.
20
2.5.2 SFS with ensemble
Given the two layered structure of the ensemble modelling two different approaches were considered
when performing feature selection: FS based on overall performance and FS based on the singular
models’ performance. Both procedures are summarized in Figure 2.4.
FS based on overall performance - The feature selection procedure has as performance criterion the
final predicted outcome, adding the “best” feature to the whole operation (to all classifiers).
FS based on the singular models’ performance - The data is initially partitioned using all the vari-
ables and then SFS is performed for each group of data resulting in different subsets of features
for each group,i.e., for each single model dedicated to each group.
The idea behind the latter procedure is that different subgroups (unsupervised clustering) may have
heterogeneous relevant features and while in the first case every time a feature is added it improves
the final result it may harm one of the clusters while benefits the other. By considering each group
independently it guarantees that each feature that is added improves the performance of each model
independently without harming the other, making it more specialised for each subgroup.
The problem that arises is that different features will be selected and the unsupervised clustering
considers all the 32 variables, meaning that one of the purposes of feature selection, which is reducing
the variables that need to be collected, has no effect in this case. All the 32 time-variant variables need to
be collected to perform the unsupervised clustering and fix the subgroups centres in the feature space.
While in the former the feature space can be reduced by deducting the features that do not improve the
performance because both unsupervised clustering and modelling procedures are performed only with
the features that are being considered at each step.
An advantage of the latter is that the feature selection is performed only once for each single model
and then the ensemble model approach is applied directly based on those single models while in the
former, since the evaluation is based on the final prediction, the feature selection procedure has to be
performed for each ensemble modelling criteria.
21
FS DATA
Cluster 1 Cluster 2
SFS on cluster data 1
SFS on cluster data 2
Non-supervised clusteringConsidering all variables
FS DATA
Cluster 1 Cluster 2
Model 1 Model 2
Final prediction outcome
Non-supervised clusteringconsidering only selected
variables plus candidate variable
Apply multimodel criterion
General SFS results in:one subset of features
for each m and c
SFS dedicated to cluster 1 results in:
One subset of features for m1 and c1
SFS dedicated to cluster 2 results in:
One subset of features for m2 and c2
Figure 2.4: Left: SFS applied in individual subgroups; Right: SFS applied to the ensemble methodology.
2.5.3 SFS with MFC
The SFS was directly performed for each of the four approaches presented under the Section 2.3 in
order to check how far each system can reach in terms of outcome prediction when tuned. Nevertheless,
the core under this subject is to compare the MFC approach to its rival/competitor and verify if the
complexity that is added brings any benefit in this case.
2.6 Validation Measures
This chapter will cover the validation measures that were used to evaluate the results and take
decisions to proceed with the study. First the concept of area under the receiver operating characteristics
curve (AUC) is introduced, a measure that is crucial to evaluate the performance during the stages
of feature selection and model assessment. Then follows a group of clustering validation methods:
Partition Coefficient, Classification Entropy, Partition Index, Separation Index, Xie-Beni Index, Dunn’s
Index, Alternative Dunn Index.
2.6.1 AUC
In order to assess the performance of binary classifiers it is not enough to consider the percentage
of correct classifications (accuracy as in equation (2.23)) due to its limitations: if one of the classes is
less common misclassifying this class will not have a great impact in the accuracy value and it is not
possible to weight more one class that might be of more importance.
Accuracy(ACC) =# of correct classifications
# of classified samples=
TN + TP
TN + TP + FP + FN(2.23)
22
In the present case the datasets are heavily unbalanced and the sensitivity (hit rates) and specificity
(false alarm rates) must be considered. The most standard method used in medical applications and in
the machine learning community to merge these two measures into the assessment is the analysis of
the area under the receiver operating characteristic curve (AUC).
A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing and select-
ing classifiers based on their performance depicting the trade-off between hit rates and false alarm rates.
This performance graphing method have properties that make them particularly useful for domains with
skewed class distribution and unequal classification error costs. These characteristics have become
increasingly important as research continues into the areas of cost-sensitive learning and learning in the
presence of unbalanced classes.
A classification model (or classifier) is a mapping from instances to predicted classes. Some classifi-
cation models produce a continuous output to which different thresholds may be applied to predict class
membership. Given a classifier and an instance, there are four possibilities for the outcome:
True Positive (TP) - correctly classified positive instances;
False Negative (FN) - incorrectly classified positive instances;
True Negative (TN) - correctly classified negative instances;
False Positive (FP) - incorrectly classified negative instances;
The true positive rate (also called hit rate, recall or sensitivity) of a classifier is estimated by the equa-
tion (2.24) and the true negative rate (specificity) of the classifier is given by the equation (2.25).
TPrate ≈#TP
#TP + #FN(2.24) TNrate ≈
#TN
#TN + #FP(2.25)
An additional term associated with ROC curves is precision which in this context is defined by the
equation (2.26).
precision(prec) =#TP
#TP + #FP(2.26)
ROC graphs are two-dimensional graphs in which TPrate (sensitivity) corresponds to the Y axis and
one minus TNrate (or false positive rate), same as one minus specificity, corresponds to the X axis. The
sensitivity and specificity can take any value between [0,1] depending on the selected threshold.
Several points in ROC space are important to note as depicted in the Figure 2.5:
Point (0,0) - never assign a positive classification;
Point (1,1) - opposed to point (0,0), assigns positive classifications for every instance;
Point (0,1) - represents perfect classification.
23
Generally, one point in ROC space is better than another if it is closer to the point (0,1). Additionally,
classifiers appearing on the left-hand side of the graph, close to the X axis can be thought of as “conser-
vative” making positive classifications only when there is strong evidence so they make few false positive
errors, however, they often have low true positive rates as well. On the other hand, classifiers on the
upper right-hand side of a ROC graph may be thought of as “liberal” making positive classifications with
weak evidence and classifying nearly all positives correctly, but the rate of false positives also increases.
Other conclusions can be drawn by observing the the ROC graph such as classifiers appearing in the
lower right triangle, meaning that its performance is worse than random guessing which corresponds to
the diagonal that connects the point (0,0) with (1,1). Nevertheless, it can be thought of as having useful
information but it is applying it incorrectly so if the classifier is negated, reversing its classification, it
brings the classifier to the upper side of the diagonal (true positive classifications become false negative
mistakes, and its false positives become true negatives).
Each threshold value produces a different point in the ROC space. Thus, it is possible to choose
where one wants the classifier to stand along the ROC curve by choosing the threshold. If one wants
to bring the classifier from the conservative side to the liberal areas of the graph it suffices to lower the
threshold. In this thesis the criterion that settles the threshold is to minimize the absolute difference
|sensitivity − specificity|.
Worse than random
Random guessing class
Liberal
Conservative
Better
True P
osi
tive R
ate
(Sen
siti
vit
y)
False Positive Rate (100 - Specificity)
A
Negation of A
ROC curve
0 1
1
Figure 2.5: ROC curve example and points of interest (inspired on [20]).
2.6.2 Clustering Validation
With the significant resurgence of interest in new clustering techniques arises the need to validate the
quality of the resulting partitioning where each clustering strategy has its advantages and shortcomings.
The following are the typical requirements for a good clustering technique in data mining [28]:
24
Scalability - The cluster method should be applicable to huge databases and performance should
decrease linearly with data size increase.
Versatility - Clustering objects could be of different types: numerical data, boolean data or categorical
data. Ideally a clustering method should be suitable for all different types of data objects.
Ability to discover clusters with different shapes - This is an important requirement for spatial data
clustering. Many clustering algorithms can only discover clusters with spherical shapes, e.g., the
FCM.
Minimal input parameter - The method should require a minimum amount of domain knowledge for
correct clustering. However, most current clustering algorithms have several key parameters and
they are thus not practical for use in real world applications.
Robust with regard to noise - This is important because noise exists everywhere in practical prob-
lems. A good clustering algorithm should be able to perform successfully even in the presence of
a great deal of noise.
Insensitive to the data input order - The clustering method should give consistent results irrespective
of the order the data is presented.
Scaleable to high dimensionality - The ability to handle high dimensionality is very challenging but
real data sets are often multidimensional.
There is no single algorithm that can fulfil all the requirements. It is important to understand the
characteristics of each algorithm to select an algorithm for the clustering that suits the problem at hand
[36].
One of the main difficulties with clustering algorithms is how to assess the quality of the returned
clusters after the division. Many of the popular clustering algorithms are known to perform poorly on
many types of data sets. In addition, virtually all current clustering algorithms require their parameters to
be tweaked for the best results, but this is impossible if one cannot assess the quality of the output. While
visual inspection can be of use for low dimensional data, most clustering tasks are higher dimensional
than an human inspector can analyse.
Clustering validation serves the purpose of evaluating the quality of the clustering results [46]. It has
two main categories: internal clustering validation and external clustering validation [43].
External clustering validation - uses external information, an example of external validation measure
is entropy, which evaluates the “purity” of clusters based on the given class labels [56], i.e., the
number of clusters/labels is already known (since the number of clusters is already known it is
mainly used to choose the best clustering algorithm on a specific dataset)
Internal clustering validation - relies only on information in the data, evaluating the goodness of clus-
tering structure without respect to external information [47] (in this case internal validation mea-
sures can be used to choose the most suited clustering algorithm as well as the optimal cluster
number and other input parameters not known apriori).
25
Cluster properties such as compactness and separation are often taken into consideration as major
characteristics by which to validate clusters and are used for validity methods that are based only on
the data. Compactness is an indicator of the variation or scattering of the data within a cluster, and
separation is an indicator of the isolation of clusters from one another [39]. Thus, a good fuzzy partition
is expected to have a low degree of overlap and a large separation distance.
As already stated in Section 2.1, the aim of fuzzy clustering is to partition a data set into nc homo-
geneous fuzzy clusters. The FCM algorithm requires the user to pre-define the number of clusters (nc);
however,there are many cases where knowing nc in advance is not possible. Given that the resulting
fuzzy partitions using the FCM algorithm depend on the choice of nc it is necessary to validate each of
the fuzzy c-partitions once they are found. This validation is performed by a cluster validity index, which
evaluates each of the fuzzy c-partitions and determines the optimal partition or the optimal number of
clusters (nc) based on them. It is important to mention that conventional approaches to measuring com-
pactness suffer from a tendency to monotonically decrease when the number of clusters approaches to
the number of data points.
Partition Coefficient (PC) - The PC measures the amount of “over-lapping” between clusters. It is an
heuristic measure since it has no connection to any property of the data and it is defined by Bezdek
[6] as in the equation (2.27). The maximum values of it imply a good partition in the meaning of a
least fuzzy clustering.
PC (c) =1
Ns
nc∑i=1
Ns∑j=1
(µij)m (2.27)
Classification Entropy (CE) - The CE, defined by the equation (2.28), measures the fuzziness of the
cluster partition exclusively, being similar to the PC. It measures whether a particular location
(prototype) has high membership values in any of the classes. The minimum values imply a good
partition in the meaning of a more crisp partition.
CE (c) = − 1
Ns
nc∑i=1
Ns∑j=1
µij log (µij) (2.28)
Partition Index (SC) - The SC is the ratio of the sum of compactness and separation of clusters given
by the equation (2.29). It is useful when comparing different partitions having equal number of
clusters [5]. A lower score of SC indicates a better partition.
SC (c) =
nc∑i=1
∑Nsj=1 (µij)
m ‖xj − υi‖2
Nsi∑nck=1 ‖υk − υi‖
2 (2.29)
Separation Index (S) - The S, given by the equation (2.30), uses a minimum distance separation for
partition validity, contrary to SC. A better partition is given by a lower value of S.
S (nc) =
∑nci=1
∑Nsj=1 (µij)
m ‖xj − υi‖2
Nsminik ‖υk − υi‖2(2.30)
26
Xie-Beni Index (XB) - The XB [57] proposed a validity index that focused on two properties: compact-
ness and separation.
XB (nc) =
∑nci=1
∑Nsj=1 (µij)
m ‖xj − υi‖2
Nsminij ‖xj − υi‖2(2.31)
In equation (2.31), the numerator indicates the compactness of the fuzzy partition, while the de-
nominator indicates the strength of the separation between clusters. It states that a good partition
produces a small value for the compactness, and that well-separated υi will produce a high value
for the separation. Hence, the best number of clusters is given by the resulting lower value.
Dunn’s Index (DI) - The DI is defined by the equation (2.32)
DI(nc) = mini∈nc
{minj∈nc,i6=j
{minx∈Ci,y∈Cjd(x, y)
maxk∈c {maxx,y∈Cd(x, y)}
}}(2.32)
where Ci is a cluster of vectors, x and y any two n dimensional feature vectors assigned to the
same cluster Ci. A more intuitive formulation is given by equation (2.33).
DI(nc) =dmindmax
(2.33)
Clearly, compact clusters that are well separated in the feature space manifest themselves in small
values of dmax and large values of dmin, leading to an increased value of DI. This index is easy
to implement and has a low computational complexity, but it is vulnerable to outliers as only two
distances are used. A larger value of DI indicates a better clustering.
Alternative Dunn Index (ADI) - The aim of modifying the original Dunn’s index was that the calculation
becomes more simple, when the dissimilarity function between two clusters (minx∈Ci,y∈Cjd(x, y))
is rated in value from beneath by the triangle-nonequality, as in equation (2.34),
d(x, y) ≥ |d(y, υj)− d(x, υj)| (2.34)
where υj is the cluster centre of the jth cluster. After the alteration, the ADI is given by the equation
(2.35).
ADI(c) = mini∈nc
{minj∈nc,i6=j
{minx∈Ci,y∈Cj |d(y, υj)− d(x, υj)|maxk∈nc {maxx,y∈Cd(x, y)}
}}(2.35)
The optimal partition (or an optimal value of clusters) is obtained by maximizing PC (or minimizing
CE) with respect to the number of clusters because this provides compact clusters with higher values of
µij . The major drawback of these indexes is that they use only the fuzzy membership degrees µij for
each cluster without considering the data structure of the clusters.
Note, that the only difference of SC, S and XB is the approach of the separation of clusters. In the
case of overlapped clusters the values of DI and ADI are not really reliable because these methods
are tailored for hard clustering analysis (do not enter with the fuzziness parameter nor membership
degrees).
27
Chapter 3
Data collection and preprocessing
Recent improvements in modern ICUs have ease the capture and storage of health care related data
as high temporal resolution data including lab results, electronic documentation, and bedside monitor
trends and waveforms. The growing interest and capacity to acquire large datasets adds statistical
robustness to the data, making the results of predictive models more reliant as they can be trained and
tested with more data.
Modern electronic health records (EHR’s) are designed to capture and render clinical data during
the health care process [41]. Using them, health care providers can enter and access clinical data
when it is needed. This collected data can then be integrated to form a hospital information system.
EHR’s can incorporate decision support technologies to assist clinicians in providing better care through
data mining technologies to automatically extract useful models and assist in constructing the logic for
decision support systems.
However, because the main function of EHR’s is to store and report clinical data collected for the
purpose of health care delivery, the characteristics of the available raw medical data are widely dis-
tributed, heterogeneous in nature, and voluminous and may not be optimal for data mining and other
data analysis operations [11]. One challenge in applying data mining to clinical data is to convert data
into an appropriate form for this activity. Data mining algorithms can then be applied using the prepro-
cessed data. Whether this data mining is successful or not often hinges on the adequacy of the data
preparation. So, besides having a wealth of data available within the healthcare systems there is a lack
of effective analysis tools to discover hidden relationships and trends in data. Data mining techniques
may help in answering several important and critical questions related to health care.
Knowledge Discovery in Databases is a well-defined process consisting of an iterative sequence of
data cleaning, data integration, data selection, data mining, pattern recognition and knowledge presenta-
tion that aims to discover hidden relationships and trends in data and has attracted great deal of interest
in Information industry [26]. A formal definition of Knowledge Discovery in Databases is given as fol-
lows: “We analyse Knowledge Discovery and define it as the non-trivial extraction of implicit, previously
unknown, and potentially useful information from data” [25].
Data mining is an essential step of knowledge discovery that may accomplish class description,
28
association, classification, clustering, prediction and time series analysis. Data mining, in contrast to
traditional data analysis, is discovery driven, which results in the disclosure of hidden but useful knowl-
edge from massive databases. Medical data mining technology has great potential and provides a
user-oriented approach to novel and hidden patterns in the data sets of the medical domain. These
patterns can be utilized for clinical diagnosis and be used by the healthcare administrators and medi-
cal practitioners to improve the quality of the service reducing the number of adverse drug effect and
suggesting less expensive therapeutically equivalent alternatives. Anticipating patient’s future behaviour
based on the given history is one of the important applications of data mining techniques that can be
used in health care management.
Data preprocessing allows the transformation of the original data into a suitable shape to be used by
a particular mining algorithm. So, before applying the data mining algorithm, a number of general data
preprocessing tasks have to be addressed, namely:
Data cleaning - One of the major preprocessing tasks, to remove irrelevant items and log entries that
are not needed for the mining process such as graphics and scripts.
Data transformation and enrichment - Consists in calculating new attributes from the existing ones;
conversing of numerical attributes into nominal attributes; providing meaning to references con-
tained in the log; etc.
Data integration - Integration and synchronization of data from heterogeneous sources.
Data reduction - Reducing data dimensionality which, per se, removes unnecessary and noisy data
while simplifying the problem.
This chapter introduces the MIMIC II database and the procedures that were taken to build the final
datasets, that will allow to perform prediction over the need of vasopressors administration, that this
project deals with. The preprocessing step that is explained in detail will result in two different groups
of data: punctual data and time-series data. Each group comprises four datasets regarding the disease
headings: pancreatitis, pneumonia, pancreatitis and pneumonia, and all patients disregarding which
disease they were diagnosed with (which are addressed throughout this thesis by PAN, PNM, BOTH
and ALL respectively). The expected output is binary, 1 pointing to the need of vasopressors in a time
window of 2 hours, and 0 otherwise. In order to clarify the difference between the two mentioned groups,
a brief description is given next:
Punctual Data - Data that comprises information about the state of a patient at a given time neglecting
its history/evolution during the ICU stay. The output is binary, 1 or 0, for each state.
Time-series Data - Data organized as a sequence of events ordered by its occurrence in time, giving
information about the evolution of each variable during the patient’s stay. The output is binary, 1 or
0, after the temporal evolution.
Note that the datasets are built using different strategies only to appropriately fit into the different
data mining algorithms. The classification problem is the same in all cases, i.e, the final result - patient
29
was on vasopressors or patient was not on vasopressors - is the same regardless of the scenario. The
major difference is that for one case there is more data since there are less restrictions for the punctual
case, but if the time-series method shows up as being a strong competitor, it should be considered, in
a later study, to use the exact same information from the database and compare both to have access
to a direct comparison which cannot be achieved in the present study. Nevertheless, there will still be
an advantage to the punctual data since it only needs one measurement of each feature to be able to
compute an outcome. So it might be interesting to integrate both for prediction purpose during the whole
ICU stay / the length of stay assuming that in the long run the time-series approach performs better.
The punctual case was already studied in [24] but since there are some minor changes in the prepro-
cessing steps it is still worth to deepen its study.The main advantage of the present study in comparison
to the previous one is the fact that it can be used in real-time analysis to support decisions.
This chapter presents the steps taken to construct the final datasets that are the basis for this thesis.
It starts with an overview of the raw data, which is essential to understand the steps needed to be
taken in the data treatment phase. The detailed description provided in this section gives away the
means for future replication, and facilitates the implementation of possible required modifications in the
proposed approach, given that all the studies on vasopressors intake mentioned until now lack a detailed
description of the procedure to obtain the final data sets. Follows the general preprocessing steps that
are common to both the punctual data and time-varying data: variables that will be included in the study,
removal of outliers, removal of deceased patients and an insight about alternative ways of normalization
is undertaken. Next, preprocessing is performed separately for each type of data set since it is treated
differently to attain different data structures and its particularities that lead to its final form are explained.
3.1 Structure of MIMIC II Database
The data were collected over a seven year period, beginning in 2001, from Boston’s Beth Israel Dea-
coness Medical Center (BIDMC). Any patient who was admitted to the ICU on more than one occasion
may be represented by multiple patient visits. The adult ICUs (for patients aged 15 years and over)
include medical (MICU), surgical (SICU), coronary (CCU), and cardiac surgery (CSRU) care units. Data
were also collected from the neonatal ICU (NICU).
Figure 3.1 illustrates the data acquisition process, which did not interfere with the clinical care of
patients, since databases were dumped off-line and bedside waveform data and derived trends were
collected by an archiving agent over TCP/IP. Source data for the MIMIC II database consists of a) bed-
side monitor waveforms and associated numeric trends derived from the raw signals, b) clinical data
derived from Philips’ CareVue system, c) data from hospital electronic archives, and d) mortality data
from the Social Security Death Index (SSDI). These data are assembled in a protected and encrypted
database (both flat files for the waveforms and trends, and in the form of a relational database for all
other data). Once the data have been assembled in a central repository and time aligned, the waveforms
and trends for each individual are linked to the corresponding individuals’ data in the relational database.
The data are then de-identified to produce a final set of data for public consumption and due to its sen-
30
sitive information, access to the MIMIC II Clinical Database is restricted to registered users. It includes
calculated standardized severity scores on the database and also user feedback and corrections.
The resulting records contain realistic patient measurements with all the associated challenges (such
as noise or missing data gaps) that advanced monitoring and clinical decision support systems (CDSS)
algorithms would receive as input data.
Figure 3.1: Schematic of data collection and database construction [1].
The MIMIC II database is composed of two distinctive groups of data. The first group, the clinical
database, consists of data integrated from different information systems in the hospital and contains
diverse information such as: patient demographics, medications, results of lab tests and more. The
second group, contains high resolution waveforms recorded from the bedside monitors in the intensive
care units.
Due to the richness of data provided by MIMIC II there are almost limitless applications and investi-
gations which can be performed. The MIMIC II is a database that is gradually increasing an should be
expected to continue so expanding not only in size, by adding patients, but in detail by disposing new
information/attributes. Here it is used the version 2.6 released in 24th of February 2012 and besides
the limitless combinations of data that can be used as shown previously in 3.1 only a small part of its
information was used to proceed with this study. Next follows a brief description of the structure of the
used raw data and the major steps performed to organize this data.
The database is downloadable to everyone after one performs a general ethical test. Once all the
data is extracted one ends up with as many folders as number of patients the database contains and
each number (ranging between 1 and 32208 with some missing numbers) refers to the ID number of
each patient. Inside these folders is all the information relative to the given patient. It is not in the scope
31
of this study to provide an extensive comprehensive guide of the database, nevertheless it is attempted
to guide the reader through the process of extracting information for this particular case and so it is
only covered the part of the database that is necessary/essential and integrates this project. Below it
is described the content and the use of each file of information that can be found inside each patient
folder. All files are named as follows [CHART NAME]-[ID PATIENT].txt and for the sake of generality [ID
PATIENT] is kept as ID.
ICD9-ID.txt1 - This file maps International Classification of Diseases (ICD9) codes to each admitted
patient.
CHARTEVENTS-ID.txt - This file contains physiological measurements that were acquired during the
patient stay as well as the time at which the measurement occurred.
ICUSTAY DETAIL-ID.txt - All the called static data as gender, age ,height, weight, date of admission
and discharge, death date, SOFA and SAPSI - to mention a few - for each patient and particular
ICU stay were retrieved from this file.
IOEVENTS-ID.txt - This stands for input/output events and data for the variable urine output foley was
recovered from it as well as its timings.
MEDEVENTS-ID.txt - All the considered output events (vasopressors administration) and the times of
its administrations were collected from this file.
For more details - A more detailed description of this files/tables can be found at [1]
The aforementioned files have an “event” column with the ID of what is being measured and the
mapping between those ID’s and its description is given in the named D tables as D CHARTITEMS.txt,
D IOITEMS.txt, D MEDITEMS.txt where the correspondence is evident. Since the information is hetero-
geneous, meaning that it comes from different sources, it can happen that the same event is referenced
with different ID’s and descriptions, for instance, there are different ID’s for lactic acid and lactate which
in medical terms is the same. In an attempt to remove this caveat from the data all the variables with the
same meaning were converted from multiple ID’s to one unique ID, i.e., picking on the previous example,
lactate has the ID 818 and lactic acid the ID 1531 and the latter was switched to 818 ending up with only
one ID to this variable. Some variables have more than one or two ID linked to the same physical variable
and a list of all the aggregations is provided in Table C.1. This conversion results in repeated/redundant
measurements and this was prevented using the timestamps of each event and every time the same ID
(after switching to a common one) is coupled with the same timing more than once, regarding the same
exact measurement, only one prevails.
At this stage all the data that have potential to be used is compiled in an unique table with the
following information: subject ID, event ID, time of the occurrence of the event and value measured plus
static data for each subject ID. The Figure 3.2 depicts the path from the raw data to the data that will be
used for modelling.
32
ICD9-ID.txt
CHARTEVENTS-ID.txt
ICUSTAY_DETAIL-ID.txt
IOEVENTS-ID.txt
MEDEVENTS-ID.txt
Patients ID by disease(PAN, PNM, BOTH or ALL)
Physiological variables
Static data
Vasopressors administration
Input output events
Useful info
What patients are under study?(PAN, PNM, BOTH, ALL)
Usable info
Preprocessment suitedto the intended algorithm
Figure 3.2: From raw data to usable data.
Useful information is called so because it still contains all the data and its original structure related to
the variables and patients under study that can in some way be used, for instance, it still contains outliers
that can be used to apply statistical algorithms such as the inter-quartile method. Usable information is
the data that is preprocessed and structured in a way that the data mining algorithms that one intends
to use can be applied directly.
3.2 Preprocessing Data - General Steps
In this section the preprocessing steps that are common to both punctual and time-varying data are
described. Those steps aim to filter the data and obtain a dataset where there is no loss of information
relative to the given initial assumptions, so all the data that is unnecessary or do not fit the problem
is eliminated remaining what is considered to be useful information for the algorithms in mind. After
that, this data can be manoeuvred in several ways to analyse this same problem differently. Later, the
preprocessing steps that lead to the final datasets, be it punctual or time-varying, will be addressed
separately since their inner structure differ.
Follows an overview of the assumptions that were made and the explanation to such assumptions
are described along this chapter.
33
General assumptions:
ICU stay number - Only the first ICU stay is taken into account to avoid cumulative problems;
Patient inclusion - At least one measurement for each of the input variables is necessary to
consider the patient (two measurements for the time-series case);
Input variables - Only the selected variables have potential for predictive power over the out-
come considered;
Output variables - The output variables in study are the only existent vasopressors (so there
are no class 0 cases that had another type of vasopressor and will bias the result);
Prediction window - Predicting the need of vasopressors with 2 hours in advance is sufficient
to alert the medical staff and initiate the treatment adequately;
Vasopressor administrations - To be considered as being under vasopressor administration it
must agree with three aspects:
• Two consecutive administrations are considered to be continuous iif the interval be-
tween the events are less than or equal to 1.09 hours;
• To be considered as a positive intake of vasopressors the minimum (continuous) length
of administration must be over or equal to 2 hours. Sometimes the vasopressors are
administered for a short time period and that can mean that the physicist was being
conservative and this will avoid such cases, nevertheless all patients that had their first
vasopressors administration for a short period of time (less than 2 hours) are discarded
(nor class 0 neither class 1). This reduces the bias of more conservative physicists or
weak decisions;
• There must be a minimum length of 6 hours between the admission in the ICU and the
vasopressors administration in order to guarantee that there’s an evolution towards the
need of the medication and that the patient did not enter the ICU in a state that should
have been under drug administration before his admission (applies only to class 1
patients);
3.2.1 Chosen Input/Output Variables
As already mentioned, this study relies in many assumptions with the intention to narrow the data
to something practicable, feasible and liable for the time being and available resources. The initiation
of this narrowing starts by defining which variables should be considered and input features as well as
output variables must be defined. Maintaining the same input variables given in [24] and assuming as
output variables the ones labelled as “Pressor Medications” by [51] (exception to Isuprel which has a
34
warning note saying that it can have a reversed effect) this results in 37 and 10 variables, respectively2.
The Table 3.1 shows the input variables and their corresponding sampling rate.
Table 3.1: Features and sampling rates (measurements/day) in each dataset.
# Time variant input [Unit] ALL BOTH PNM PAN
1 Heart Rate [BPM ] 29,5 ± 6,4 27,7 ± 4,5 27,7 ± 4,5 27,5 ± 3,82 Temperature [◦C] 11,5 ± 6,9 9,3 ± 4,8 9,2 ± 4,5 9,6 ± 5,43 Arterial BP [mmHg] 24,1 ± 9,3 22,2 ± 7,5 22,2 ± 7,5 21,7 ± 7,54 Arterial BP Dystolic [BPM ] 24,1 ± 9,3 22,2 ± 7,5 22,2 ± 7,5 21,7 ± 7,55 Respiratory Rate [BPM ] 28,8 ± 6,7 26,8 ± 5,2 26,7 ± 5,3 26,9 ± 4,36 SpO2 [%] 29,2 ± 6,2 28,0 ± 4,4 28,0 ± 4,4 27,6 ± 3,87 Hematocrit [%] 2,5 ± 1,4 2,0 ± 0,9 1,9 ± 0,9 2,1 ± 1,08 Potassium [mEq/L] 3,1 ± 1,7 2,5 ± 1,2 2,4 ± 1,1 2,7 ± 1,39 Glucose (70-105) [−] 3,9 ± 2,8 2,9 ± 1,9 2,8 ± 1,9 3,1 ± 1,810 Creatinine (0-1.3) [mg/dL] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,811 BUN (6-20) [mg/dL] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,812 Platelets [cells× 103/µL] 1,7 ± 0,9 1,5 ± 0,6 1,4 ± 0,6 1,6 ± 0,613 WBC (4-11,000) [cells× 103/µL] 1,5 ± 0,8 1,4 ± 0,5 1,4 ± 0,5 1,5 ± 0,614 RBC [cells× 103/µL] 1,5 ± 0,7 1,4 ± 0,5 1,4 ± 0,5 1,5 ± 0,615 Sodium [mEq/L] 1,9 ± 1,0 1,8 ± 0,7 1,7 ± 0,7 1,9 ± 0,916 Chloride [mEq/L] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,817 Arterial CO2(Calc) [−] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,618 Magnesium [mg/dL] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,819 NBP [mmHg] 8,7 ± 6,7 8,0 ± 6,0 8,0 ± 6,0 8,3 ± 5,820 NBP Mean [mmHg] 8,6 ± 6,6 7,9 ± 5,9 7,9 ± 5,9 8,2 ± 5,721 PTT(22-35) [−] 1,3 ± 1,0 1,1 ± 0,8 1,1 ± 0,8 1,2 ± 0,922 INR (2-4 ref. range) [−] 1,3 ± 0,9 1,0 ± 0,7 1,0 ± 0,7 1,2 ± 0,823 Arterial PaCO2 [mmHg] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,624 Arterial PaO2 [mmHg] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,625 Arterial pH [−] 3,4 ± 2,1 3,0 ± 1,7 3,0 ± 1,7 2,9 ± 1,726 Arterial Base Excess [mEq/L] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,627 Ionized Calcium [−] 1,9 ± 1,6 1,3 ± 1,2 1,3 ± 1,2 1,5 ± 1,228 Phosphorous(2.7-4.5) [mEq/L] 1,4 ± 0,8 1,4 ± 0,6 1,4 ± 0,6 1,6 ± 0,829 Lactic Acid(0.5-2.0) [mg/dL] 1,1 ± 1,4 0,9 ± 1,1 0,9 ± 1,1 1,2 ± 1,030 Calcium (8.4-10.2) [mg/dL] 1,4 ± 0,8 1,4 ± 0,7 1,3 ± 0,6 1,6 ± 0,831 CVP [−] 15,7 ± 8,2 13,7 ± 7,2 13,5 ± 7,1 14,3 ± 7,032 Urine Out Foley [ml] 18,8 ± 5,6 17,8 ± 5,5 17,7 ± 5,5 18,3 ± 5,7# Time invariant input ALL BOTH PNM PAN
33 Gender - - - -34 Age - - - -35 Weight - -36 SAPS - - - -37 SOFA - - - -
Sampling time was computed using the final dataset that include all adult patients with at least one
measurement for each input variable. The time between the first and last measurement for each patient
were used, as well as the number of measurements for each variable during that interval for each
patient. The result was then obtained averaging all patients sampling times, as in equation (3.1) where
i = [1, 2, ..., number of features] and j = [1, 2, ..., number of patients].
SRi =
∑.j=patient
number of measurements variableijend timej−start timej
number of patients(3.1)
The vasopressor agents considered are listed in Table 3.2, and it is important to mention that some
of them do not show up in the datasets.
2For the punctual data, only 32 input variables are considered, being it the time-variant variables (treated in a static manner forthe punctual data analysis, whereas for the time-series analysis the same variables are studied taking into account their evolution);The remaining 5 time-invariant variables are only considered for the mixed case: time-series data coupled with static data, beingit part of the latter.
35
Table 3.2: List of vasopressors and participation in the datasets.
Vasopressor Name Presence in the data sets
Levophed AllLevophed-k All
Dopamine AllDopamine Drip None
Epinephrine All except for PNMEpinephrine-k All
Epinephrine Drip NoneVasopressin All
Neosynephrine AllNeosynephrine-k All
An important assumption that was made is that for the output to be considered as positive (oc-
currence of administration of vasopressors) it has to have a length of at least 2 hours of continued
administration and it is considered as being continuous, adding to the previous administration, if and
only if the interval between two separate intakes is lower than 1.09 hours. This is different from what
was done in [24] where an interval of 1 hour is considered instead. This minor change is due to the
observation that having an interval of 1 hour excludes many administrations because somehow many
administrations occur in intervals slightly over than 1 hour (1 hour and 5.4 minutes), this phenomenon
is shown in Figure 3.3, and it was thought that they should still be considered as continuous administra-
tions. The percentage of vasopressors administration that occur in between 1 hour and 1.09 hours after
the previous administration corresponds to 27.9% (for the dataset that contains all patients, ALL, without
discriminating the vasopressors).
36
3.2.2 Removing outliers
Removing outliers is supported by expert knowledge, contrary to the inter-quartile method used in
[24] which is a statistical method that appears as much conservative when compared to the information
acquired by expert knowledge. However, it was not possible to collect expert knowledge for all variables
and in order to avoid excessively narrowing the range of possible values for the variables that lack expert
knowledge, they were narrowed by visual inspection individually, meaning that the data is removed only
when there are measurements that are clearly out of range when compared with the most common
measurements. Plots for the variables 1-4 that contain the expert knowledge limit, limits imposed by
visual inspection and the ones one should get by the inter-quartile method are provided in Figure 3.4
(details about these limit values are provided next in Table 3.3 and plots for all the variables can be found
in annex A). The data that is presented here is related to the case where all patients are considered, ALL
dataset. Outside the green dashed lines the data points are considered to be outliers by using expert
knowledge or visual inspection (when lacking the former information) and are replaced by NaN , the
black dashed lines have the same meaning but when using the inter-quartile method just for the sake of
visual comparison. This data is then removed which in many cases subtracts patients to the data set if
they do not have at least one measurement for each variable (two measurements for the time-varying
analysis) as in the previous case. These plots were zoomed in leaving out some absurdly high and low
measurements to improve the visualization3.
3There are several superimposed data points and there is no point in trying to extract some tendency out of these plots, thepurpose is just to have an idea of the dispersion of the data; the inter-quartile limits can help having and idea of the densities;the blue and red dots are data points extracted from patients that never had vasopressors and the ones that were under itsadministration, respectively, not to be confused with data points of class 1 and class 0, since class 1 patients have both.
37
(a)
(b)
Figure 3.3: Interval between two consecutive vasopressors intake (with a random added value of ±0.015, aftertheir interval identification, for density observation) considering only patients with more than 6 hours of data beforevasopressors administration. Figure (b) is a zoom in of the figure (a) to the region of interest.
38
Figure 3.4: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 1 to 4.
39
The major point here is that, since most of the data is of class 0, the inter-quartile method will tend
to consider as outliers data that is perfectly acceptable just because it deviates from the most common
measurements. The problem that arises with it is that it removes data that might well be useful to
distinguish between classes.
The apparent problem that arise when using statistical methods such as the inter-quartile is that data
that deviate considerably from the most common measurements will be removed and it might be re-
moving trends that will differentiate the output of the patients. The inter-quartile method is conservative,
excluding noticeably more data compared to the expert knowledge approach, because it is expected to
lead to a solid range of what is considered to be the standard value creating a dense area of measure-
ments that will approximate both upper and lower limit.
Table 3.3 (please, refer to Table 3.1 for variables’ identification) shows the limits applied, and con-
sequent percentage of data that is considered outliers, using expert knowledge along with the range
established by visual inspection and for comparison it is also shown the limits and percentage of outliers
when the inter-quartile method is used.
Table 3.3: Delimiting data (Inter-quartile is applied on ALL dataset)
# var Expert Knowledge Visual Inspection Inter-Quartile
Min Max % Rem. Min Max % Rem. Min Max % Rem.1 0 250 0,00 - - - 52,5 121,5 5,192 25 42 0,05 - - - 35,5 38,9 5,213 - - - 0,001 250 0,32 67,5 172,5 4,664 - - - 0,001 200 22,21 33,5 84,5 24,315 0 200 0,00 - - - 6,5 33,5 4,646 60 100 0,12 - - - 92,0 104,0 3,127 19 60 0,33 - - - 21,9 37,3 7,078 2,2 8 0,09 - - - 3,0 5,1 6,569 - - - 0,001 1000 0,01 55,0 193,0 8,6910 0,1 9 0,42 - - - -0,7 2,7 14,4011 4 500 0,24 - - - -19,5 73,5 10,6512 3 1000 0,20 - - - -93,5 455,5 8,1613 0,4 50 0,57 - - - 0,6 23,7 8,0114 2 8 0,24 - - - 2,4 4,2 6,7415 120 160 0,19 - - - 128,5 149,5 5,4016 80 130 0,18 - - - 94,0 118,0 4,8817 - - - 0,001 60 0,02 14,5 35,5 6,2418 0 10 0,01 - - - 1,4 2,6 10,5919 30 300 0,29 - - - 70,0 166,0 3,9520 10 186,7 0,06 - - - 44,0 106,0 5,4221 - - - 0,001 150 0,00 1,1 69,2 11,5522 - - - 0,001 30 0,02 0,7 1,9 13,7023 - - - 0,001 100 0,00 25,0 55,0 8,0824 - - - 0,001 500 0,00 27,5 186,5 7,5025 6,8 7,8 0,02 - - - 7,3 7,5 7,1026 -30 20 0,06 - - - -9,0 9,0 5,7227 - - - 0,001 20 0,16 1,0 1,3 8,2928 - - - 0,001 20 0,01 1,1 5,9 7,8729 0 10 3,71 - - - -0,9 4,5 14,1530 4,8 12 0,32 - - - 6,7 9,7 6,5431 - - - 0,001 50 0,59 0,5 21,5 5,9432 0 2000 2,16 - - - -92,5 252,5 11,31
Total - - 0,34 - - 7,96 - - 8,14
Using the inter-quartile method leads to a total 8.14 % of outliers whereas expert knowledge com-
bined with visual inspection leads to a total 3.00 % of outliers (all these results before applying the
imputation step and still include the patients that will be removed after outliers removal due to lack of
data).
40
3.2.3 Removing deceased patients
At the end what is intended with this project is to mimic the physicians decision as accurate as pos-
sible and it is desirable to remove the bias and noise that can misguide the obtained models. Another
source of data that have the potential to mislead the models (in pair with the cases where the vasopres-
sors intake are of short duration) are the data related with patients that died and were not prescribed
vasopressors since the physiological variables behave in a similar way to the patients that had vaso-
pressors: they start to decay as shown in Figure 3.5 for the variables platelets, WBC, chloride and NBP
mean (the plots for all the variables can be found in Appendix B).
The decision to not prescribe vasopressors can occur deliberately or not and some hypothesis might
be: failed to give the right diagnosis, no purpose in trying, vasopressors would not solve the problem,
it was too late to prescribe vasopressors, the cause of death has nothing to do with shock, lack of
resources, etc. Given that all this assumptions, if true, will bias the data one can prevent it by the simple
decision of excluding these cases. There’s still interest in considering deceased class 1 patients since
the outcome is not part of this study. The need to remove this class of patients came with the observation
of the plots that follow (for ALL dataset) where each point is the mean of all the measurements taken
in a 30 minutes window for a particular group of patients. Despite the differences in the mean value,
that can vary depending on the patients that are being considered to compute the mean (the number
of patients under account increase with the xx axis because it is when short stays start to appear), it
is shown a clear tendency that in some cases is similar between patients that took vasopressors and
patients of class 0 that will eventually die during their first ICU stay. The final value of the xx axis has
different meanings:
For class 1 patients - it is the moment those patients started the vasopressor intake.
For class 0 patients - it is the moment they were discharged from the ICU.
For class 0 deceased patients - it is the moment the patient died.
The vertical black line is placed 2 hours before the end of the xx axis corresponding to the prediction
window prior to the starting point of the vasopressors administration.
41
Figure 3.5: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 12,13,16 and 20.Each point corresponds to the mean value of the measurements taken in a 2 hours window.
42
It was considered three different ways to evaluate if the patients that were not under vasopressors
administration (class 0 patients) ended up dying or not:
1. In the ICU stay details data there is a flag type variable with the next description: ’Subject death
during hospital admission.’ Every time this flag takes positive value (Y) it is considered that the
patient died during that ICU stay (first stay since it is the only that is taken into account) and that it
lived if negative (N);
2. Other flag type variable with the description: ’Died in ICU (assumed from icustay outtime> hospital
discharge AND died in hospital)’ was evaluated in the same way as the previous one;
3. The last way uses two variables:
• ’Subject’s date of death’
• ’Hospital discharge date’
Using this information it was assumed that the patient died if the distance in time between the first
and the second variable mentioned is less than 24 hours and that it lived otherwise.
When there is inconsistent information the priority is given by the order the conditions were pre-
sented. These assumptions could lead to errant information since the patients that have none of this
information were considered as survivals. This seems to be a valid assumption since it is more prone to
not report that a patient survived than to not report a death.
The reason to choose three ways as to do with the possibility of having missing data which is stated
in the website of the MIMIC database [1] that it is a possibility for those variables (they are classified as
nullable variables, meaning that they can be empty).
3.2.4 Normalization
Normalization of the data is an important step in Knowledge Discovery in Databases (KDD) process,
and depending on the type of datasets that are being dealt with there can be various options to tackle
this problem. In this work it was used min-max normalization and three kinds of normalization approach
were thought having each of them its advantages and disadvantages. To visualize the results of these
hypothesis an artificial time-varying dataset was created with three different behaviours as shown in
Figure 3.6.
For each one of the three methods, for which a detailed description is provided below, the min-max
normalization was applied being the only difference what is considered to be the minimum and the
maximum. The min-max normalization is given by the equation (3.2) resulting in data comprehended in
the interval [0,1].
d′ =d−min (p)
max (p)−min (p)(3.2)
43
Figure 3.6: Artificial data and results for three different methods on applying min-max normalization.
Method 1 - Normalization using all feature values
The procedure of normalization using all measurements of each feature consists of searching for the
maximum and minimum value for the feature in consideration along all the dataset and apply the min-max
normalization formula. This method is the one that would be used blindly when there’s no knowledge
about the dataset which does not reflect the current problem when clustering time-series. It is known
that different patients can have dissimilar standard healthy physiological values, for instance, those with
pre-existing hypertension may require higher blood pressures to maintain adequate perfusion [9], and
neglecting this fact will lead to a biased clustering with information that it should not be concerned with,
increasing the probability of aggregation of patients solely based in this factor.
Method 2 - Normalization for each feature considering each patient
This approach consists in normalizing each patient in itself, so each patient after normalization does
have a 0 and a 1 in its time-series that correspond to the minimum and the maximum measurement
taken from that patient, respectively. The problem that arises is that there is no perception of what is
a high or low variation. So a patient can be stable having slight variations during its ICU stay but after
normalization it does compare with a patient with highly unstable measurements. This phenomenon is
observed when comparing the blue lines with the black one which have a higher oscillation.
Method 3 - Normalization using all feature values after removing mean for each patient
This way of normalization is similar to the first one with the difference that the mean value of the temporal
series is removed for each patient. The least drawback is achieved with this approach. The problem of
having patients with heterogeneous healthy physiological measurements is removed and the variations
44
are now distinguishable, the only issue that is not taken into account is that an high variation for a given
patient might be considered low for another but this complication when compared with the others is
unusual and neglecting it should not affect the analysis in a significant way.
3.3 Preprocessing Data - Clinical Actual State Analysis
In this chapter it is described the steps taken to convey the data to its final form, suitable to the
analysis that will be carried related with the prediction of the need of vasopressor agents given the state
of the patient at a fixed time (punctual data). The big picture of this procedure can be consulted in Figure
3.9.
At this point the filtered data, i.e., after applying the constraints, consists of measurements of the
inputs and outputs and their timings. This results in a major problem where the data is not aligned as
there are differences in the sampling rate of each variable - not only for a given variable but also for a
given patient and its condition - and it is desired that one can evaluate a certain state of the patient by
having all the variables measured at a certain moment. The approach that was taken is described in
[12]. It proposes a method to align misaligned and unevenly sampled data.
Misaligned and unevenly sampled data refers to data that neither occur at the same time points nor
are equally spaced in time. The suggested alignment creates the need to calculate new values for the
new locations generated by the template variable, i.e. , by the variable with the highest sampling rate,
which is normally done through some form of interpolation.
In the present case, the Heart Rate samples, which is the most frequently measured variable, were
used as templates to align all the remaining variables. The alignment is performed by positioning the
measurements to the closest one in the template variable, using the time as the distance measure, as
depicted in Figure 3.7.
1 2 3 4 5 6 7 8 9 10 11
Template variableVariable to align
Variable after alignmentMissing data after alignment
Figure 3.7: Alignment of misaligned and unevenly sampled data (inspired in [12]).
Figure 3.8 shows a more detailed description of what happens considering the most extreme sce-
nario in the present data: misaligned and unevenly sampled data in conjunction with differences in the
starting and ending time of collection. The latter is an issue due to the need to have at least one mea-
surement of each variable. The solution to this problem is to remove all the data prior to the moment
that a measurement for each variable is available.
45
Data taken into account, there's already one measure for
each variable and all the precedent values are transported
using ZOH
Feature 1 (highest sampling rate):
Feature 2:
Feature 3:
NA
NA NA
NA
Real
information
Real
information
Real
information
Data that is not taken into account because
there's still not one measure for each variable
Zero order hold
Zero order hold
Feature 1 (highest sampling rate):
Feature 2:
Feature 3:
NA
NA NA
NA
Real
information
Information
alignment
Information
alignment
Zero order hold
Zero order hold
Data taken into account, there's already one measure for
each variable and all the precedent values are transported
using ZOH
Applying the template of thevariable
with highest sampling rate
Aligned with the next measure of the
variable with the highest sampling rate
Figure 3.8: Alignment of the data for the punctual data analysis, using the variable with highest sampling rate astemplate, covering all the case scenarios.
46
677
Pat
ien
ts37
71 P
atie
nts
4296
Pat
ien
ts*[
1]
Pan
cre
atit
isP
neu
mon
ia
677
Pat
ien
ts37
27 P
atie
nts
4252
Pat
ien
ts
> 15
yea
rs o
f ag
e>
15 y
ears
of
age
> 15
yea
rs o
f ag
e
222
Pat
ien
ts
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
1176
Pat
ien
ts
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
1326
Pat
ien
t
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
*[0
] P
atie
nt
ID g
oes
to
32
809
bu
t th
ere
are
so
me
IDs
mis
sin
g in
be
twe
en le
adin
g to
a t
ota
l of
32
535
pat
ien
ts.
*[1
] 6
77
(Pan
crea
titi
s) +
37
71 (
Pn
eu
mo
nia
) –
42
96 (
Pan
crea
titi
s &
Pn
eu
mo
nia
) =
15
2 P
atie
nts
wit
h b
oth
dis
eas
es.
*[2
] 1
st IC
U s
tay:
inte
rval
be
twe
en t
wo
co
nse
cuti
ve m
eas
ure
s o
f th
e va
ria
ble
he
art
ra
te <
24h
.*
[4]
Co
nd
itio
ns
to b
e co
nsi
de
red
as
cla
ss 1
: 1
) 6
ho
urs
of
colle
cted
da
ta b
efo
re v
aso
pre
sso
r ad
min
istr
atio
n;
2)
Len
gth
of
vaso
pre
sso
r a
dm
inis
trat
ion
> 2
h;
3)
Inte
rva
l bet
we
en
tw
o v
aso
pre
sso
r ad
min
istr
ati
on
eve
nt
< 1
h.
*[5
] P
atie
nts
th
at h
ad
vas
op
ress
ors
bu
t d
idn
’t s
atis
fy t
he
ab
ove
me
nti
on
ed
co
nd
itio
ns
in *
[4].
MIM
IC II
3253
5*[0
] P
atie
nts
No
me
asu
re o
f on
e va
r: -
2D
ata
star
t w
ith
out
put=
1: -
48C
lass
0 p
at. D
ied
: -7
Pre
dict
ion
win
do
w: -
1To
tal:
152
(59
-c1;
93
-c0)
3409
dad
os
cla
ss 1
2514
9 d
ado
s cl
ass
0
1) V
alue
0 r
epla
ced
by
NaN
exc
ept f
or v
aria
ble
‘Art
eria
l Bas
e Ex
cess
’; 2
) Ex
p.K
now
led
ge t
o r
emov
e ou
tlie
rs ;
3) R
emov
e pa
tien
ts t
hat
do
n’t
hav
e at
lea
st o
ne
mea
sure
fo
r ea
ch in
put
vari
able
(af
ter
rem
ovi
ng o
utl
iers
)
2458
0 P
atie
nts
All
pat
ient
sP
ancr
eat
itis
& P
enu
mon
ia
4370
Pat
ien
ts
Cla
ss 0
: 12
08C
lass
1: 2
874
(28
8 n
ever
had
)*[5
]
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
Con
dit
ions
in *
[4]
Cla
ss 0
: 82
Cla
ss 1
: 12
8(1
2 r
emov
ed)*
[5]
Con
dit
ions
in *
[4]
Cla
ss 0
: 32
2C
lass
1: 7
99
(55
rem
oved
)*[5
]
Con
dit
ions
in *
[4]
No
me
asu
re o
f on
e va
r: -
6D
ata
star
t w
ith
out
put=
1: -
350
Cla
ss 0
pat
. Die
d: -
52P
redi
ctio
n w
ind
ow
: -1
Tota
l: 71
2 (3
15-
c1;3
97-
c0)
1093
1 d
ado
s cl
ass
111
1049
dad
os
cla
ss 0
Cla
ss 0
: 37
7C
lass
1: 8
88
(61
rem
oved
)*[5
]
Con
dit
ions
in *
[4]
No
me
asu
re o
f on
e va
r: -
8D
ata
star
t w
ith
out
put=
1: -
387
Cla
ss 0
pat
. Die
d: -
58P
redi
ctio
n w
ind
ow
: -2
Tota
l: 81
0 (3
52-
c1;4
58-
c0)
1321
4 d
ado
s cl
ass
112
4473
dad
os
cla
ss 0
3253
5 P
atie
nts
*[1]
> 15
yea
rs o
f ag
e
No
me
asu
re o
f on
e va
r: -
40D
ata
star
t w
ith
out
put=
1: -
1296
Cla
ss 0
pat
. Die
d: -
173
Pre
dict
ion
win
do
w: -
20To
tal:
2553
(87
5-c1
;16
78-c
0)34
640
dad
os
cla
ss 1
2557
77 d
ado
s cl
ass
0
Figure 3.9: Preprocessing steps and resulting datasets of the MIMIC II for the punctual data case.
47
3.3.1 Data Imputation
After the alignment, as stated before, one has to deal with missing data (Table 3.4) and similarly to what
physicians do it is assumed that the last measurement is the more accurate one for the time being and
a zero order hold procedure was applied (Table 3.5) .
Table 3.4: After alignment with Heart Rate.
Heart Rate Temperature General Var
123 NaN NaN67 NaN 9069 NaN NaN72 NaN 8774 NaN NaN77 35 NaN78 NaN NaN75 36 8571 NaN NaN
Table 3.5: After imputation of data using ZOH.
Heart Rate Temperature General Var
123 NaN NaN67 NaN 9069 NaN 9072 NaN 8774 NaN 8777 35 8778 35 8775 36 8571 36 85
At last, all the data points that are not complete (have NaN) after those steps are eliminated (Ta-
ble 3.6). This data is associated with the first stages of the ICU stay where not all variables were
measured at least once.
Table 3.6: Removal of data due to lack of data.
Heart Rate Temperature General Var
rem. rem. rem.rem. rem. rem.rem. rem. rem.rem. rem. rem.rem. rem. rem.
77 35 8778 35 8775 36 8571 36 85
The result of this approach are rows of data where each row tries to represent the patient at a given
time by its time-variant input variables and output variable. The output variable goes through the same
process, with no added complexity it is like another input variable that takes the value 1 if the drug
is being administrated and 0 if not. Typically, all patients have class 0 data (patients that never had
vasopressors plus the early stages of patients that eventually will need the medication) and only the
patients that take vasopressors contribute for class 1 data.
As a result of the aforementioned data preprocessing step, it is shown in Table 3.7 the percentages
of imputation of data for each data set.
48
Table 3.7: Percentages of imputation by dataset.
Data set % of imputation
PAN 74,84
PNM 75,19
BOTH 75,11
ALL 74,73
These high percentages do not come as unexpected due to the sampling rate differences when
compared to the most sampled variable, Heart Rate (that is used as the template). Eliminating variables
with low collection rate would decrease immensely the percentage of imputations. As an example, it is
shown in the Table 3.8 the percentage of imputations for each variable in the dataset ALL.
Table 3.8: Percentages of imputations by input time-varying variable
variable # 1 2 3 4 5 6 7 8% imp 0,00 0,66 0,22 0,22 0,03 0,04 0,93 0,91
variable # 9 10 11 12 13 14 15 16% imp 0,90 0,94 0,94 0,95 0,95 0,95 0,94 0,94
variable # 17 18 19 20 21 22 23 24% imp 0,90 0,94 0,70 0,70 0,96 0,96 0,90 0,90
variable # 25 26 27 28 29 30 31 32% imp 0,90 0,91 0,95 0,95 0,97 0,95 0,45 0,35
3.3.2 Enabling the dataset for prediction purposes
Up until to this point, the dataset set up is only able to tell if at a given state the output is either 1 or 0.
Nevertheless, this information is not yet usable since the aim is to predict within a time window of 2 hours
a patient’s transition to vasopressor dependence for three main reasons: (i) a central line would only be
inserted if the patient needs it in fact in the future (reduction in the performance of this procedure) (ii)
the central line insertion protocol could be initiated with enough time, prior to the moment of need of
vasopressors (iii) it was demonstrated that the delay in delivering the drug can change drastically the
outcome.
Bearing this in mind, the process to transform this direct input-output relationship to an input capable
of predicting the output consists in shifting the output 2 hours, so that every row of input reflects if in that
given condition the patient will be vasopressor dependent after the prediction window. Next, follows an
example of what was described, the procedure and its final result.
Consider that the second row in Table 3.9 contains the times at which each measurement occurred
comprising 12 instances (in this case, corresponding to the times at which the heart rate was measured).
One way to guarantee that it shifts 2 hours is by using the algorithm 2.
Since it is not possible to have the desired interval for every instance, due to the uneven sampled
rates, the prediction lies between ]1, 3[ hours instead, and this algorithm excludes all the data that does
49
Algorithm 2 Shifting the output by 2 hours for prediction purposes.
1: for i← 1 to n− 1 do2: ∆←∞3: for j ← i+ 1 to n do4: δij ← 2− (xj − xi)5: end for6: if min(δi:) < 1 then7: xnewi ← find(δij == min(δi:))8: else9: xnewi ← NaN
10: end if11: end for12: return xnewi
not have the capability to predict within the specified interval. Applying the algorithm 2 results in the last
line in Table 3.9.
Table 3.9: Example of the output shifting procedure.
instance # 1 2 3 4 5 6 7 8 9 10 11 12
time (hours) 0 1 2 2,5 3 5 10 11 13 13,5 15 18Minimum Distance 2 2 3 2,5 2 5 3 2 2 1,5 3 -
new time (hours) 2 3 NaN 5 5 NaN NaN 13 15 15 NaN NaN
As a result the instances [3,6,7,11] are deleted due to the impossibility to shift the output and have
information about the drug intake after ]1,3[ hours, as shown by the third row. The instance [12] is
deleted as well but the reason is that it is the last measurement and so there is no information about
what will happen after it. In Table 3.10 and 3.11 is an example of the data before and after applying the
algorithm, respectively. It could contain any number of input variables, the statement here is that the
input variable keeps the same positioning while the output is shifted upwards two hours (in reality it is
between ]1,3[) whenever it is possible and remove the lines that cannot be shifted.
Table 3.10: Before shifting the output.
instance hours Input Output
1 0 78 02 1 75 03 2 71 04 2,5 73 05 3 78 06 5 67 17 19 68 18 11 77 19 13 79 1
10 13,5 74 111 15 72 112 18 71 1
Table 3.11: After shifting the output.
instance hours Input Output
3 2 78 05 3 75 0
NaN NaN NaN NaN6 5 73 16 5 78 1
NaN NaN NaN NaNNaN NaN NaN NaN
9 13 77 111 15 79 111 15 74 1
NaN NaN NaN NaNNaN NaN NaN NaN
50
3.4 Preprocessing Data - Clinical State Evolution Analysis
This section describes the particularities of composing the datasets using the time feature in order
to have the evolution of the patient during its ICU stay. The constraints are different from the previous
case, as in this case it is required that each patient has at least two measures of each variable ( in the
previous case only one measure was necessary), and since it is intended to use demographic data, that
adds the requirement that every patient should have a record of it during their first ICU stay.
To follow the same line of reasoning, where one must predict with at least 2 hours in advance, all
data that was collected from 2 hours before the initiation of vasopressors intake until the end of the ICU
stay was removed from the dataset so it is composed of a time series for each variable as long as there
are a minimum of two measurements for each variable (pre-requisite to have a time-series).
It is expected to have a lower number of imputations in this case since it will be filled with all taken
measurements before starting filling up the time-series with estimated values using a ZOH approach.
The Figure 3.10 shows the major steps to reach the final datasets. In order to clarify in more detail
what is described in the bottom boxes of the flowchart, follows a brief description with respect to the PAN
dataset that can be extended for all datasets.
“No data before 1st admin:-80” - Means that at the moment of the first intake of vasopressors there
were no information on all input variables, removing 80 patients from the dataset.
“Deceased class 0 patient:-4” - There were 4 patients of class 0 that died during the first ICU stay,
and they were discarded from the dataset.
“Less than 2 entries for any variable:-33” - It is required at least two measures of each temporal vari-
able and there were 33 cases that did not meet such constraint and were discarded (note that these
filters are applied in the same order it is presented here, so it happens after removing patients that
do not have at least one measure).
“No data for all vars 2h before admin:-11” - This could be summed to the immediately above filter,
refers to patients that do not have two measures of temporal variable when the data collected
during the prediction window, from two hours before vasopressors intake to the moment of admin-
istration, are neglected (note that this time the dataset is already able to predict with 2 hours in
advance since the data in this time window is removed). There are 11 such cases.
“Missing static data:-4” - There are 4 patients that do not have all the input static data available and
they were discarded as well.
51
677
Pat
ien
ts37
71 P
atie
nts
4296
Pat
ien
ts*[
1]
Pan
cre
atit
isP
neu
mon
ia
677
Pat
ien
ts37
27 P
atie
nts
4252
Pat
ien
ts
> 15
yea
rs o
f ag
e>
15 y
ears
of
age
> 15
yea
rs o
f ag
e
222
Pat
ien
ts
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
1176
Pat
ien
ts
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
1326
Pat
ien
t
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
*[0
] P
atie
nt
ID g
oes
to
32
809
bu
t th
ere
are
so
me
IDs
mis
sin
g in
be
twe
en le
adin
g to
a t
ota
l of
32
535
pat
ien
ts.
*[1
] 6
77
(Pan
crea
titi
s) +
37
71 (
Pn
eu
mo
nia
) –
42
96 (
Pan
crea
titi
s &
Pn
eu
mo
nia
) =
15
2 P
atie
nts
wit
h b
oth
dis
eas
es.
*[2
] 1
st IC
U s
tay:
inte
rval
be
twe
en t
wo
co
nse
cuti
ve m
eas
ure
s o
f th
e va
ria
ble
he
art
ra
te <
24h
.*
[4]
Co
nd
itio
ns
to b
e co
nsi
de
red
as
cla
ss 1
: 1
) 6
ho
urs
of
colle
cted
da
ta b
efo
re v
aso
pre
sso
r ad
min
istr
atio
n;
2)
Len
gth
of
vaso
pre
sso
r a
dm
inis
trat
ion
> 2
h;
3)
Inte
rva
l bet
we
en
tw
o v
aso
pre
sso
r ad
min
istr
ati
on
eve
nt
< 1
h.
*[5
] P
atie
nts
th
at h
ad
vas
op
ress
ors
bu
t d
idn
’t s
atis
fy t
he
ab
ove
me
nti
on
ed
co
nd
itio
ns
in *
[4].
MIM
IC II
3253
5*[0
] P
atie
nts
No
dat
a be
fore
1st
ad
min
.: -8
0D
ece
ased
cla
ss 0
pat
ien
t: -
4Le
ss t
han
2 e
ntr
ies
for
any
vari
able
:-34
No
dat
a fo
r al
l var
s 2h
bef
ore
ad
min
.:-1
1M
issi
ng
dem
ogra
ph
ic/s
tati
c da
ta:-
422
cla
ss 1
pat
ien
ts55
cla
ss 0
pat
ien
ts
1) V
alue
0 r
epla
ced
by
NaN
exc
ept f
or v
aria
ble
‘Art
eria
l Bas
e Ex
cess
’; 2
) Ex
p.K
now
led
ge t
o r
emov
e ou
tlie
rs ;
3) R
emov
e pa
tien
ts t
hat
do
n’t
hav
e at
lea
st o
ne
mea
sure
fo
r ea
ch in
put
vari
able
(aft
er r
emo
ving
ou
tlie
rs)
2458
0 P
atie
nts
All
pat
ient
sP
ancr
eat
itis
& P
enu
mon
ia
4370
Pat
ien
ts
Cla
ss 0
: 12
08C
lass
1: 2
874
(28
8 n
ever
had
)*[5
]
At
leas
t 1
mea
sure
for
each
inp
ut v
aria
ble
Dur
ing
1st
ICU
sta
y*[2
]
Con
dit
ions
in *
[4]
Cla
ss 0
: 82
Cla
ss 1
: 12
8(1
2 r
emov
ed)*
[5]
Con
dit
ions
in *
[4]
Cla
ss 0
: 32
2C
lass
1: 7
99
(55
rem
oved
)*[5
]
Con
dit
ions
in *
[4]
No
dat
a be
fore
1st
ad
min
.: -5
36
Dec
eas
ed c
lass
0 p
atie
nt:
-37
Less
th
an 2
en
trie
s fo
r an
y va
riab
le:-
162
No
dat
a fo
r al
l var
s 2h
bef
ore
ad
min
.:-6
0M
issi
ng
dem
ogra
ph
ic/s
tati
c da
ta:-
1311
2 cl
ass
1 p
atie
nts
202
clas
s 0
pat
ien
ts
Cla
ss 0
: 37
7C
lass
1: 8
88
(61
rem
oved
)*[5
]
Con
dit
ions
in *
[4]
No
dat
a be
fore
1st
ad
min
.: -5
95
Dec
eas
ed c
lass
0 p
atie
nt:
-41
Less
th
an 2
en
trie
s fo
r an
y va
riab
le:-
187
No
dat
a fo
r al
l var
s 2h
bef
ore
ad
min
.:-6
8M
issi
ng
dem
ogra
ph
ic/s
tati
c da
ta:-
1412
4 cl
ass
1 p
atie
nts
236
clas
s 0
pat
ien
ts
3253
5 P
atie
nts
*[1]
> 15
yea
rs o
f ag
e
No
dat
a be
fore
1st
ad
min
.: -2
127
Dec
eas
ed c
lass
0 p
atie
nt:
-13
0Le
ss t
han
2 e
ntr
ies
for
any
vari
able
:-64
9N
o d
ata
for
all v
ars
2h b
efo
re a
dm
in.:
-18
3M
issi
ng
dem
ogra
ph
ic/s
tati
c da
ta:-
4826
9 cl
ass
1 p
atie
nts
676
clas
s 0
pat
ien
ts
Figure 3.10: Preprocessing steps and resulting datasets of the MIMIC II for the time-series data case.
52
After filtering the data to include only data that fulfil the requirements, there is the need to organize
the data to make it structurally equal for all patients. Since there are different number of measurements
for each variable and for each patient, it is desired to limit the quantity of data we are dealing with due
to computational constraints.
3.4.1 Data imputation
The algorithm that will be used to analyse the time-series dataset requires that all the time-varying
data has the same length. Since there are variables that only have two measurements and it is in-
tended to study different lengths, there arises the problem of imputation. While there are many solutions
for imputation, the one that was used is one that more closely resembles the approach taken by the
physicians, having the advantage of using the most recently recorded data and treating each variable
independently: there is no requirement to align the measurements.
The logic behind this method is that the sampling rates of each variable are related with their chang-
ing speed and aligning the time-series variables with each other would require a template similar to what
was done for the punctual state analysis. It happens that using a template for this case would increase
largely the number of imputations and would remove the evolution of some time series (maintaining a
constant value). Another drawback is that it would be needed to have at least the same number of
measurements as the length it is considered and, akin to the previous approach, all the data that were
collected before the moment of having at least one measurement for each variable would be unused,
i.e., considering a vector length of 10 as a template it would be needed to have at least 10 measure-
ments of the template variable (not hard to achieve if the template variable has a high sampling rate)
and if there is only two measurements available of another variable and the first one is measured after
the first measurement considered of the template value there will be empty slots of measurements and
that patient would have to be discarded. Two cases would have to be considered:
Select the Heart Rate as a template variable - This would result in having variables that only have
one real measure and the rest are imputations of the same measurement due to their inherent
low sampling rate when compared with the one of Heart Rate. One way to avoid this situation is
to increase the number of measures that are considered and as mentioned, the top limit is ten in
order to make it computationally viable.
Select a pre-defined time between measures - Due to the different sampling rates associated with
each variable, selecting a pre-defined time between samples would be a difficult problem that
would need to be balanced to make it fair (it could also be selected by selecting another variable
as a template). In the punctual state case there was a reason to select the Heart Rate as a
template variable: it is the variable with the highest sampling rate. In this case the choice would
be ’something in the middle’ in order to not favour the variables with the sampling rates more close
to the one selected, making it the variable with less imputations and the measurements more
accurate without any reason to do so. Using such approach would remove a considerable amount
of data that is more up-to-date than what would be used instead.
53
To avoid such problems, the simplest approach was taken which is to consider the last measurements
of each variable and every time there are no sufficient number of measures the approach taken is to fix
the times of the first and last measurement of the considered variable, divide the time in between by x
equal parts (being x the considered length) and impute the data using the ZOH approach having in mind
the time at which all the measurements were taken. This process of imputation and selecting the data
that constitute the time-series are shown in Figure 3.11.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 2 3 4 5
16 17 18 19 20 21 22 23 24 25
1 1 2 2 3 3 4 4 5 6
6
1 2 3 4
1 1 1 1 2 2 3 3 3 4
ICU stay length for class 0 OR ICU stay length until 2 hours before vasopressors administration for class 1 [time]
Feature 1:
Feature 2:
Feature 3:
NA
NA
NA
NA
NA
NA
NA
NA
Real
information
Used
information
10 last
records
Divide feature
length in 10 equal
parts and apply ZOH
Real
information
Used
information
Real
information
Used
information
Divide feature
length in 10 equal
parts and apply ZOH
Variable length [ rst measure, last measure]
Variable length [ rst measure, last measure]
Variable length [ rst measure, last measure]
Has to be more than 6 hours for class 1 patients
Figure 3.11: Procedure to adapt the real measurements to a vector of length 10 for all variables including dataimputation when needed.
The aforementioned approach leads to the percentages of imputation shown in Table 3.12 which are
far better than the percentages in the punctual data approach.
Table 3.12: Percentages of imputation given the vector length.
vector length % imputationPAN PNM BOTH ALL
2 0 0 0 04 1.8 1.78 1.83 2.556 5.08 4.49 4.66 6.488 9.19 7.30 7.68 10.6210 12.93 10.22 10.74 14.62
For the purpose of clarification, Table 3.13 shows the percentage of imputation and the number of
patients that are removed in order to have all the data aligned with a time template equally spaced with
intervals of 1,2,6 and 12 hours.
54
Table 3.13: Considering an interval between measurements of x hour (for ALL dataset)
vector length intervals between measurements
1h 2h 6h 12h
2 37.94 (7 rem.) 35.92 (15 rem.) 31.02 (42 rem.) 21.79 (84 rem.)
4 61.05 (19 rem.) 55.15 (42 rem.) 41.84 (110 rem.) 32.19 (205 rem.)
6 67.13 (38 rem.) 60.45 (71 rem.) 48.84 (174 rem.) 34.86 (316 rem.)
8 70.80 (50 rem.) 62.61 (94 rem.) 49.60 (242 rem.) 36.23 (415 rem.)
10 73.16 (59 rem.) 64.51 (109 rem.) 51.76 (292 rem.) 37.06 (487 rem.)
The results that come from this table are very easily interpretable. As the interval between two
measurements increase, the percentage of imputation is reduced because variables with low sampling
rate will have an increased chance to appear as real measurements and not as imputations. As the
length of time needed for each patient goes up (by increasing the vector length or the interval between
measurements or both) the number of removed patients increases because less and less patients have
a stay long enough to be considered. At first it might look inconsistent to have a different number of
patients removed, for instance the combination (length,interval) of (2,2) and (4,1) could lead to think that
it should remove the same number of patients since the interval considered would look like to be the
same, 4 hours. Nevertheless, the last time is fixed regarding the last measurement considered and so,
in the case of (2,2) the patients need to have information of 2 hours for all variables and for the case
(4,1) it is needed 3 hours of information because the length is the number of points and not the number
of intervals. (intervals = numberofpoints− 1).
As it has been stated, this approach would lead to a decrease in the volume of the dataset thus the
reason to stick to the former one, where there are fewer imputations (more dynamics included), more
patients are taken into account, and only the newest collected data is considered.
3.4.2 Enabling the dataset for prediction purposes
In this case there is no need to use any algorithm to shift the data. Each patient has a row of data for
each variable corresponding to its evolution along the time. Since the objective is to predict timely the
need of vasopressors, the data that was collected in the interval of 2 hours before the administration is
discarded, as shown in Figure 3.12.
Figure 3.12: How the data is enabled for prediction purposes.
55
Chapter 4
Results and Discussions
In the present chapter the main results obtained through what was described in chapter 2 are pre-
sented.
First the results concerning the punctual data analysis are presented and discussed, where it will be
proven that the way the raw database was tackled provides better results when compared to previous
studies under the same subject. Starts with the unsupervised clustering analysis in Section 4.1.1, then
follows Section 4.1.2 where results for feature selection are discussed, considering both overall perfor-
mance and single models’ performance. Finally, in Sections 4.1.3 and 4.1.4, the model assessment
results for covering both overall and singular models’ performance are presented and discussed.
Next, the results concerning the time-series coupled with static data analysis are approached. Firstly,
under Section , the results for feature selection are shown along with the fixed parameters, where it is
shown that the static variables have their place side by side with time-series and that the normalization
of the data should be selected consciously according to the dataset under study. Follows the model
assessment part in Section , where the results obtained through the four methods are shown and com-
pared.
4.1 Ensemble Modelling - Punctual Data Results
4.1.1 Unsupervised Clustering Validation
As presented previously the multimodel approach consists of two stages of clustering. First the
data is divided using unsupervised clustering and then supervised clustering is performed to each of
the clusters that resulted from the former partition. To avoid a grid search approach, which would be
computationally heavier, cluster validation analysis was performed to find the most suited parameters
for the unsupervised clustering algorithm FCM.
The unsupervised clustering validation is performed through two steps: evaluation of the clustering
validation indexes results (analytical methods) and a study concerned with the distribution of the data
points along the clusters. The latter turned out to be a necessity since the results attained by the formal
analytical approach do not prove to be clear about defining one of the parameters of the FCM algorithm,
56
the number of unsupervised clusters Kc. As a result, the other parameter, the fuzziness degree m, is
defined using the clustering validation methods whereas an analysis regarding the distribution of the
data points is conducted in order to come with an optimal number of clusters Kc.
Due to the time consuming task of getting an output from the analytical methods it is desired to re-
duce the quantity of input data and get a subset of the datasets in an attempt to represent the whole
set. Rather than reducing the datasets indiscriminately, the reduction of data volume followed the same
steps as in Section 4.1.2 and 4.1.3 to obtain the train set of data, where feature selection and model
assessment is conducted respectively. The idea is to keep the same amount of data for clustering vali-
dation as the data that will be presented to train the models. The stages to get the subsets for clustering
validation are depicted in Figure 4.1, where the dataset PAN serves as an example. Summarizing these
steps:
1. The data is divided into two subsets containing 50% of the data each while keeping the same
distribution of classes: this is done in order to replicate the division of the data into a Feature
Selection (FS) subset and Model Assessment (MA) subset.
2. The 50% of the data is divided into 10 folds (being the 10-fold cross-validation method the one that
is used here) and then only 9 folds are grouped, leaving one fold out (test data), to replicate the
amount of data that will be used to train the models.
3. In order to avoid a highly skewed subset (as shown in Table 4.1), the data is balanced to obtain a
balance of 70% of class 0 and consequently 30% of class 1. This reproduces what is performed
for the training data sets (the test data remains with the real balance of the initial dataset).
Table 4.1: Classes percentages in each dataset.
Dataset Number of samples Classes %Class 0 Class 1 Class 0 Class 1
PAN 25149 3409 88,06 11,94PNM 111049 10931 91,04 8,96
BOTH 124473 13214 90,40 9,60ALL 255777 34640 88,07 11,93
Now that the problem of high volume datasets is clarified and cleared up, follows the analysis of
clustering validation.
57
PAN dataset28553 samples
(normalized data)
3409 class 1samples
25149 class 0samples
~12575 class 0samples
~1705 class 1samples
Fold 1 Fold 2 Fold 3 Fold 10...
1 to 9 fold train data:~11318 class 0 samples~1535 class 1 samples
Final subset:~3582 class 0 samples~1535 class 1 samples
Divide into classes
Reduction of the data by 50% Reduction of the data by 50%
Divide data into 10-fold Divide data into 10-fold
Select train data only
Balanced the data to 70% class 0 and 30% class 1
Figure 4.1: Example on PAN dataset to get the subset data for clustering validation.
Clustering Validation based on analytical methods
In an attempt to solve the problem of finding the optimal parameters for the unsupervised FCM
algorithm (fuzziness parameter m and number of clusters Kc) a search through the methods presented
in Section 4.1.1 was performed for all datasets. The whole results of these methods are presented under
appendix D, whereas only the results from the dataset ALL are presented here (Table 4.2 since there is
no contrast between datasets to discuss each separately so the following analysis can be extended to
the missing datasets too).
It is easily observed that the best scores are attained using the lowest m. This comes as no surprise
since most penalize high degrees of fuzziness and over-lapping, and a low m delivers low degrees of
both. Only the methods designed for hard clustering (DI and ADI) do not penalize the fuzziness degree
and look only at the final result of the clustering evaluating inter and intra-clustering distances. The major
differences occur for PC, SC and S where m = 1.4 delivers better scores.
PC - This index measures the undesired amount of over-lapping between clusters and as m increases
a higher overlap is expected. So this method will penalize higher values for the m parameter
when compared with smaller ones by delivering a lower score. A higher number of clusters is
also expected to increase the score since the distances between clusters will be smaller, thus
58
Table 4.2: Clustering validation indexes for dataset ALL performed 10 times with different partitions. (+) means ahigher value is better and (-) means the opposite.
m Index Score according to number of Clusters2 3 4 5 6
PC(+) 7,58E-01 ± 1,17E-16 6,44E-01 ± 0,00E+00 5,74E-01 ± 1,17E-16 5,25E-01 ± 1,17E-16 4,88E-01 ± 1,17E-16CE(-) 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 0,00E+00SC(-) 8,53E+06 ± 1,96E-09 9,19E+06 ± 1,96E-09 5,95E+06 ± 9,82E-10 7,57E+06 ± 9,82E-10 3,25E+06 ± 4,91E-10
1.4 S(-) 1,64E+02 ± 3,00E-14 2,67E+02 ± 5,99E-14 1,70E+02 ± 3,00E-14 2,15E+02 ± 5,99E-14 9,37E+01 ± 1,50E-14XB(-) 2,12E+00 ± 4,68E-16 1,80E+00 ± 4,68E-16 1,61E+00 ± 4,68E-16 1,47E+00 ± 2,34E-16 1,37E+00 ± 2,34E-16DI(+) 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00
ADI(+) 5,19E-03 ± 9,14E-19 5,18E-03 ± 9,14E-19 5,17E-03 ± 0,00E+00 5,16E-03 ± 9,14E-19 5,14E-03 ± 9,14E-19
PC(+) 6,16E-01 ± 1,17E-16 4,63E-01 ± 5,85E-17 3,79E-01 ± 5,85E-17 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE(-) 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 0,00E+00 1,79E+00 ± 0,00E+00SC(-) 1,17E+10 ± 2,01E-06 1,52E+10 ± 0,00E+00 1,60E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 2,48E+09 ± 5,03E-07
1.7 S(-) 2,26E+05 ± 6,14E-11 4,77E+05 ± 6,14E-11 4,78E+05 ± 1,23E-10 2,12E+05 ± 0,00E+00 7,24E+04 ± 0,00E+00XB(-) 1,72E+00 ± 2,34E-16 1,30E+00 ± 0,00E+00 1,06E+00 ± 0,00E+00 9,07E-01 ± 1,17E-16 7,99E-01 ± 1,17E-16DI(+) 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00
ADI(+) 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19
PC(+) 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 5,85E-17 1,67E-01 ± 0,00E+00CE(-) 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 2,34E-16SC(-) 4,12E+10 ± 0,00E+00 1,13E+10 ± 2,01E-06 1,50E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 3,27E+09 ± 0,00E+00
2.0 S(-) 7,93E+05 ± 1,23E-10 3,56E+05 ± 6,14E-11 4,51E+05 ± 6,14E-11 2,12E+05 ± 0,00E+00 9,68E+04 ± 1,53E-11XB(-) 1,40E+00 ± 0,00E+00 9,33E-01 ± 1,17E-16 7,00E-01 ± 1,17E-16 5,60E-01 ± 0,00E+00 4,67E-01 ± 5,85E-17DI(+) 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,04E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 8,98E-03 ± 1,83E-18
ADI(+) 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 0,00E+00 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19
increasing the over-lapping.
CE - Similarly to PC, this method also penalizes the fuzziness of the cluster partitions. Despite of it not
being clear in its equation (2.28), since m as no direct influence in this score, an higher m will still
balance the membership degrees µij reducing the confidence in the classification based on the
partitioning, thus it is expected to attain higher scores with higher values of m parameter. However,
this does not occur.
SC - In the SC equation (2.29), since a better partitioning is given by attaining a lower value of SC, it
is desired to have low results in the numerator (that measures compactness) and high results for
the denominator (that measures separation). The parameter m impacts directly and indirectly the
numerator, and the smaller the m the smaller the numerator, whereas the denominator is strictly
related with the prototypes position. While the prototypes position change withm it is shown (under
Appendix G) that for slight variations of m (has the ones tested in these methods) the change in
the clusters centres can be neglected, and so the denominator can be thought of as being constant
and the numerator outweighs these minor differences in the denominator, thus the best results for
smaller m.
S - This method clearly penalizes a higher number of clusters due to its denominator in equation (2.30):
a higher number of clusters conducts to smaller inter-cluster distances. Nevertheless, changing
m, as stated above, does not impact considerably the positioning of the prototypes and similarly to
the precedent method, a lower m returns a lower score for the numerator of S which is the desired
result and so, the smaller the m the better should be the score.
XB - The numerator of this method is the same as in S and the denominator changes too with the cluster
centres being different only in the way it measures the separation, so when it comes to defining the
best value for the parameter m, the comments remain the same as in S. Nonetheless, the results
59
show improvements for higher values of m.
DI and ADI - In these two methods, designed for the validation of hard clustering algorithms, the only
elements that play a role are the data points and the cluster centres. The only influence of the
parameter m is the positioning of the clusters and as shown in Appendix G these differences can
be neglected. The scores obtained by these two methods can also serve as a proof that the
parameter m have a small influence in the positioning of the prototypes (and subsequently in the
division of the data points), as the scores are approximately constant while changing m. The same
is observed when increasing the number of clusters.
The CE and XB index do not follow the same trend as PC, SC and S. As mentioned, these measures
are not deterministic and an overall balance between complexity, versatility (to avoid over-fitting) and
quality of the partitioning, assessed using the validation measures, had to be performed.
After this study, and since there are more indexes agreeing with a lower m, it is clear that by lowering
this parameter better scores can be achieved. One could expect to reach even better scores by testing
out smaller values of m, nevertheless, since the least value tested was m = 1.4, this study proceeds
with this parameter fixed with that value. However, there is no clear evidence regarding the best number
of clusters. It is known that most methods increase or decrease monotonously with increasing number
of clusters, so it is desired to avoid having high number of clusters in order to find a clear peak to settle
with it.
Clustering Validation based on data distribution
The indecision about the optimal number of clusters called for this distribution analysis that has
its focus on classes (Figure 4.2(a) and Figure 4.2(c)), diseases (Figure 4.2(b) and Figure 4.2(d)) and
volume distributions along the clusters and inter-cluster distances (Table 4.3 and 4.4) 1.
1In the main text only the figures related to the dataset ALL are presented and analysed, however these observations can beextended to all datasets. Exception goes for the PAN dataset that shows that a third cluster might be viable as it can be seen inappendix E, however the study proceeded without considering that exception.
60
(a) Data divided into two clusters based on the output. (b) Data divided into two clusters based on the disease.
(c) Data divided into three clusters based on the output. (d) Data divided into three clusters based on the disease.
Figure 4.2: Distribution along clusters for the dataset ALL.
Table 4.3: Euclidian distances between clusters.
ALL cluster 1 cluster 2
cluster 1 - 4,30E-04
cluster 2 4,30E-04 -
Table 4.4: Euclidian distances between clusters.
ALL cluster 1 cluster 2 cluster 3
cluster 1 - 6,66E-04 7,98E-05
cluster 2 6,66E-04 - 5,90E-04
cluster 3 7,98E-05 5,90E-04 -
The trend observed when adding one more cluster (Kc = 2 to Kc = 3) is repeated with more added
clusters (Kc = 4, 5, 6...) and the following is achieved for each subsequent added cluster:
Volume distribution - The new cluster will have considerably low volume when compared to two of
them and even lower then the “previous” one.
Classes distribution - The new cluster has a tendency to mimic the distribution of classes of one of
the two clusters with higher volume.
Diseases distribution - Same behaviour as in classes distribution.
61
Inter-cluster distances - The added cluster is at least 10 times closer to one of the two high volume
cluster than the two higher volume clusters between them.
Some conclusions can be drawn upon this information. Knowing that the volume of a third cluster is
much lower than the other two, per se, does not allow one to conclude much but knowing it allied with
the fact that the inter-cluster distances shown by the smaller cluster is at least 10 times closer to one
of the clusters than the distance between the two main clusters shows that the new cluster is “stealing
data” from one of the clusters with higher volume. The job/task of dividing the unsupervised clusters
internally are part of the fuzzy modelling that performs supervised cluster in order to define antecedent
part of the rules. Note that by setting a high number of clusters Kc, it might be possible to reach an
even distribution of data (was tested with Kc = 8 and this behaviour was not achieved), although that
would bring some repercussions during the modelling stage due to the lack of data: no sufficient data is
shown during the training stages, leading to over-fitting related problems and the trustworthiness of the
validation would be conditioned. Given these observations, it seems appropriately to assume that the
best number of clusters Kc is two.
By inspecting Figure 4.2, it is clear that the distribution is not as expected: the clustering is not group-
ing data based on diseases nor in classes. Alternatively it is partitioning the data based on volumes of
data points that do not show any correlation to the diseases, at least when using FCM algorithm. This
does not invalidate that using other clustering algorithms would not make the distinction between dis-
eases or classes, for instance, using Gustafson-Kessel clustering (GK) algorithm might be more reliable
since the differentiation in physiological variables between the diseases might only be notorious for a
few selected variables. The FCM algorithm acts in every direction (every feature) uniformly assuming
the hypothesis that clusters are spherical, whereas GK algorithm associates each cluster its centre and
its covariance allowing it to identify ellipsoidal clusters. However, GK algorithm is out of the scope of this
thesis and it will not be covered in more detail. Another possibility is to conduct this study with FCM after
reducing the number of variables to the most informative ones.
4.1.2 Fixing parameters of the models & Feature Selection
The datasets were equally divided into two groups: feature selection (FS) group that was used to
find the best combination of parameters and furthermore to obtain the best set of features given some
criterion, and model assessment (MA) group which its function is to evaluate the prediction power of the
models with the fixed parameters and selected features obtained with the other half of the data (FS). The
procedure is the same depicted in Figure 4.1 where there is a reduction of 50% and then it is divided
into 10 folds to proceed to a 10-fold cross validation in which 9 of the folds (corresponding to the training
data) is balanced into 70% of class 0 and 30% of class 1 and the remaining fold (used for testing the
trained models) keeps the original balance2.
2Due to the low percentage of class 1 samples the classes were balanced for all datasets to respect the percentages of 70%for class 0 and 30% for class 1 . A better approach would be to perform a grid search where different balances would be testedby carrying the whole study to the end respecting it, however that is unattainable due to time restrictions. The idea of balancingto those percentages is to guarantee that there is a higher presence of class 1 data while keeping at some extent the primal idea:class 0 is more predominant. This will enable to show to the models, while in the training stage, more class 1 samples. Doing this
62
For fixing parameters (number of clusters cn and fuzziness parameters m) and select the most infor-
mative features only the data pertaining to FS part is used, this way none of the parameters nor features
are over-fitted to the data that will be used to assess the models, MA group of data, and this will allow to
rely more on the results that are obtained in the model assessment.
This procedure has two stages: one for fixing the parameters cn and m, and another to select the
features.
Fixing parameters for:
1. FS based on overall performance - In order to fix the FCM parameters a grid search was
performed ranging from [2,6] for cn and [1.4,2] for m (using the same random seed to create
the initial partition matrix and the train and test sets), and the combination that output the best
AUC was selected.
2. FS based on singular models’ performance - In this case, more parameters had to be fixed:
the unsupervised cluster centres (including all features), the fuzziness parameter for each
group of data (m1 and m1) and the number of clusters for each model (nc1 and nc2). Since
the feature selection is performed for each cluster of data individually, each subset of most
predictive features are fitted to some volume of data/particular feature space, hence the ne-
cessity to fix the prototypes position. The criterion to decide the best parameters for this
scenario is given by the maximization of the index (4.1).
[% of data in cluster 1]× [AUC cluster 1] + [% of data in cluster 2]× [AUC cluster 2] (4.1)
It is known that smaller sets of data conduct to high variation in performance measures and
this approach discourages unbalanced partitions, avoiding that a smaller cluster has the same
weight in the final decision. To emphasize the importance of having a balanced criterion here
is an exaggerated example: assume that the first cluster only has 10 points (10-fold) and it
classifies well every time having an AUC of 1, the other classifier performs randomly with 0.5
AUC with 10000 samples. The final result would be 0.75 AUC, though the classifier performed
randomly most of the time.
A grid search was performed ranging from [2,6] for cn and [1.4,2] for m testing every combi-
nation of these parameters possible for the two models.
Feature selection for:
1. FS based on overall performance - After having cn and m fixed, it was run 50 times with
that combination of parameters. The combination of features that occurred more often was
selected as being the best subset of features, bringing more information for the classification,
is risk-free due to the volume of the datasets this can be made without jeopardizing the statistical value of the results. This hasbeen performed for the clustering validation and for the training stage in the modelling part, the test part of the modelling was keptwith the real percentages.
63
this way the variability can be reduced. When there was no repetition in those 50 runs, the
set of feature with the best AUC was selected.
2. FS based on singular models’ performance - The features were fixed at the same time as
the parameters since changing the seed would lead to different partitions and consequently
features that are unrelated when using a different seed.
In order to perform the validation of results a 10-fold cross-validation procedure was used. The
parameters that resulted from the first step (fixing parameters) are shown in appendix F since it is not
much relevant for analysis purposes. Having fixed all the parameters for each criterion, feature selection
is now performed and the results are shown and discussed in the following sections.
1. Feature Selection based on overall performance
The purpose of the histograms and selection order of the features is most relevant to know which
variables are predominant observing the ones that occur more often and the ones that are selected first
since at some stage the order is more or less random, showing that those variables are not predominant
and add little to the prediction performance. In this study this happens a lot and is in accordance with
the fact that using all variables conduct to better results as it will be shown later.
Histograms showing the number of times that each feature was selected are shown in appendix H.
Below are presented tables showing the most selected feature given a order for each dataset. The
frequency was computed dividing the number of times the mentioned feature is selected by the number
of times a feature in that position was selected, i.e., feature selection runs that did not reach a certain
number of features is not accounted for the frequency computation.
Table 4.5: Most selected features for single model for dataset ALL.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 29 26 19 19 13 7
Frequency 1 0,92 0,36 0,22 0,36 0,25
Table 4.6: Most selected features for a priori for datasetALL.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 29 26 19 12 3 3
Frequency 1 1 0,88 0,68 0,38 0,48
Table 4.7: Most selected features for a posteriori fordataset ALL.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 17 12 13 13 29 23
Frequency 1 0,5 0,38 0,45 0,59 0,14
Table 4.8: Most selected features for arithmetic meanfor dataset ALL.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 19 3 29 4 7
Frequency 1 0,54 0,66 0,48 0,32 0,22
Table 4.9: Most selected features for distance-weightedmean for dataset ALL.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 29 26 19 3 4 7
Frequency 1 1 0,84 0,74 0,26 0,24
64
Table 4.10: Most selected features for single model for dataset BOTH.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 19 29 13 12 12
Frequency 1 0,64 0,76 0,46 0,4 0,35
Table 4.11: Most selected features for a priori fordataset BOTH.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 3 13 12 20 29
Frequency 1 1 0,94 0,94 0,58 0,64
Table 4.12: Most selected features for a posteriori fordataset BOTH.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 20 8 14 14 5
Frequency 0,94 0,72 0,26 0,17 0,4 0,38
Table 4.13: Most selected features for arithmetic meanfor dataset BOTH.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 19 29 3 13 12
Frequency 1 0,78 0,72 0,6 0,74 0,66
Table 4.14: Most selected features for distance-weighted mean for dataset BOTH.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 3 13 12 19 29
Frequency 1 0,5 0,72 0,72 0,42 0,52
Table 4.15: Most selected features for single model for dataset PNM.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 3 29 12 20 7
Frequency 1 0,90 0,56 0,34 0,30 0,23
Table 4.16: Most selected features for a priori fordataset PNM.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 3 29 20 12 12
Frequency 1 1 1 0,84 0,52 0,48
Table 4.17: Most selected features for a posteriori fordataset PNM.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 19 27 22 2 27
Frequency 0,94 0,5 0,43 0,2 0,19 0,18
Table 4.18: Most selected features for arithmetic meanfor dataset PNM.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 20 3 29 13 12
Frequency 1 0,58 0,58 0,58 0,44 0,56
Table 4.19: Most selected features for distance-weighted mean for dataset PNM.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 26 3 29 20 13 12
Frequency 1 1 1 0,98 0,58 0,64
65
Table 4.20: Most selected features for single model for dataset PAN.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 25 4 12 13 15
Frequency 1 1 1 0,92 0,88 0,46
Table 4.21: Most selected features for a priori fordataset PAN.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 25 4 12 13 10
Frequency 1 1 0,82 0,82 0,8 0,24
Table 4.22: Most selected features for a posteriori fordataset PAN.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 20 4 4 7 13 28
Frequency 0,56 0,64 0,37 0,32 0,22 0,21
Table 4.23: Most selected features for arithmetic meanfor dataset PAN.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 4 12 20 25 15 25
Frequency 0,54 0,28 0,13 0,16 0,18 0,19
Table 4.24: Most selected features for distance-weighted mean for dataset PAN.
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 25 4 12 14 15
Frequency 1 1 1 0,98 0,52 0,30
From the observation of the tables, above one common ground can be found: depending on the
criterion, the sequence of features obtained is more or less consistent. The single model, the a priori and
the distance-weighted mean criteria show a higher consistency, whereas the a priori and the arithmetic
mean are more irregular about the order it selects the features meaning that, by using such criteria, less
robustness is achieved being it highly dependant on the subset of data that are trained with and tested
on (more prone to overfit).
Table 4.25 shows the mean and standard deviation of the number of features selected by each criteria
for each dataset. The criteria that are able to make use of a higher number of features while improving
the performance are the ones that are more consistent.
Table 4.25: Mean and standard deviation of the number of features selected through the 50 runs.
Model PAN PNM BOTH ALL
Single model 8,4 ± 1,4 6,5 ± 1,0 7,2 ± 0,9 4,3 ± 1,4
a priori 13,9 ± 2,3 8,3 ± 1,2 8,6 ± 1,6 12,1 ± 2,4
a posteriori 5,9 ± 2,0 5,0 ± 1,8 4,6 ± 1,3 5,0 ± 2,1
mean 9,4 ± 3,4 8,9 ± 1,4 9,6 ± 1,4 9,0 ± 1,7
wgd avg distance 10,2 ± 2,6 9,4 ± 1,2 9,8 ± 1,3 8,1 ± 2,7
Finally, the set of features that are considered best for each criterion and dataset are presented in
Tables 4.26 - 4.29. If one had to mention the more characteristic variables of each dataset, due to their
ubiquity, that would be:
PAN - 4 (Arterial BP Dystolic), 12 (Platelets), 13 (WBC), 19 (NBP), 25 (Arterial pH);
66
PNM - 3 (Arterial BP), 12 (Platelets), 13 (WBC), 20 (NBP Mean), 26 (Arterial Base Excess), 29 (Lactic
Acid);
BOTH - 3 (Arterial BP), 12 (Platelets), 13 (WBC), 19 (NBP), 26 (Arterial Base Excess), 29 (Lactic Acid);
ALL - 3 (Arterial BP), 7 (Hematocrit), 13 (WBC), 19 (NBP), 26 (Arterial Base Excess), 29 (Lactic Acid).
Table 4.26: Features selected using SFS for the dataset PAN.
Criterion Selected Features
Single model 2, 4, 12, 13, 14, 15, 19, 25, 27, 30
a priori 1, 4, 5, 10, 11, 12, 13, 14, 15, 19, 23, 25, 27, 28, 29, 30
a posteriori 4, 12, 13, 19, 20, 28, 32
mean 1, 2, 3, 4, 5, 12, 13, 14, 15, 17, 18, 19, 20, 24, 25, 28
wgd avg distance 1, 2, 3, 4, 5, 9, 12, 13, 14, 15, 18, 19, 20, 25, 27,28, 29, 30
Table 4.27: Features selected using SFS for the dataset PNM.
Criterion Selected Features
Single model 3, 12, 13, 20, 26, 29
a priori 3, 7, 11, 12, 13, 19, 20, 23, 24, 26, 27, 29
a posteriori 4, 13, 15, 17, 19, 23, 29
mean 3, 7, 11, 12, 13, 20, 26, 29
wgd avg distance 3, 7, 11, 12, 13, 20, 23, 26, 27, 29
Table 4.28: Features selected using SFS for the dataset BOTH.
Criterion Selected Features
Single model 3, 10, 12, 13, 19, 22, 26, 29
a priori 1, 3, 5, 7, 11, 12, 13, 19, 20, 24, 26, 29, 31
a posteriori 4, 7, 15, 19, 21, 26, 30
mean 1, 3, 4, 12, 13, 14, 19, 20, 24, 26, 27, 29, 31, 32
wgd avg distance 3, 4, 7, 12, 13, 19, 20, 26, 29, 30, 31, 32
Table 4.29: Features selected using SFS for the dataset ALL.
Criterion Selected Features
Single model 3, 7, 11, 13, 19, 26,29, 30
a priori 1, 2, 3, 4, 7, 11, 12, 13, 19, 20, 21, 26, 29, 31, 32
a posteriori 10, 12, 13, 16, 17, 23, 29, 32
mean 3, 4, 7, 12, 13, 14, 19, 20, 26, 29, 32
wgd avg distance 3, 4, 5, 7, 11, 12, 13, 19, 20, 26, 29, 32
2. Feature Selection based on the singular models’ performance
As mentioned, due to the need to fix the cluster centres and the fact that each unsupervised cluster
of data goes through its own feature selection procedure, it does not make sense to present the results
for FS in the same manner as shown previously. In this case the FS process has an high variance in the
selected features and cannot be correlated since each run of the FS will give the most predictive features
67
for the resulting unsupervised prototypes/subgroups. So, it will only be presented the features that were
selected and the final score given by the equation (4.1). These results are shown in Tables 4.30-4.33.
Comparing the score obtained during the feature selection procedure to the model assessment results
in Tables 4.37-4.40 it can be concluded that the singular models’ FS approach has an high tendency
to over-fit, something that was already expected due to the high oscillation during the feature selection
procedure.
Table 4.30: Feature selection results for PAN dataset.
Cluster number Features c m Score
cluster 1 13, 28 5 1,70, 95× 0, 51+
0, 99× 0, 49 = 0, 97cluster 2 3, 6, 19, 21, 27 4 1,7
Table 4.31: Feature selection results for PNM dataset.
Cluster number Features c m Score
cluster 1 2, 11, 13, 14, 17, 18, 20 2 1,70, 90× 0, 50+
0, 95× 0, 50 = 0, 93cluster 2 8, 9, 11, 18, 21, 22, 26 2 1,7
Table 4.32: Feature selection results for BOTH dataset.
Cluster number Features c m Score
cluster 1 7, 13, 21, 27, 32 3 20, 99× 0, 50+
0, 97× 0, 50 = 0, 98cluster 2 1, 13, 22, 24, 25, 27 2 1,7
Table 4.33: Feature selection results for ALL dataset.
Cluster number Features c m Score
cluster 1 7, 13, 21, 27, 32 3 20, 90× 0, 50+
0, 98× 0, 50 = 0, 94cluster 2 1, 13, 22, 24, 25, 27 2 1,7
4.1.3 Model Assessment based on overall performance
Under this section the results obtained for model assessment based on overall performance are
presented. This will cover the results with and without feature selection. The parameters of the models
for the study without feature selection were obtained by extensive search, training and testing the models
68
with FS data and fixing the parameters that output the best AUC, while the procedure of fixing parameters
for the case with reduced feature space is described in 4.1.2. The results for the model assessment
considering all features are presented under Table 4.35 and with the reduced feature space in Table
4.36.
It must be pointed out that the AUC computed for the a priori and a posteriori criteria is an artefact
since it cannot be computed directly due to the nature of the output: dissimilar models are asked to
provide the answer/output. For these cases, the AUC was computed after fixing the output of the model
that was selected to provide the output and then changing the threshold from 0 to 1. While it is known
that this does not reflect the real AUC it will be treated as it was. An alternative would be to compute an
AUC for each model output and use the average for each threshold, either way none of these approaches
would reflect the sensitivity and specificity that are shown as results. The sensitivity and specificity reflect
the true value since they are computed using the final decision output vs the true value.
single model - As shown in previous studies (Appendix I), single model produces better results without
feature selection. The only real competitor when all variables are considered is the a priori criterion
that has an higher AUC at the cost of a lower sensitivity and specificity. On the other hand, for
selected features, while the a priori criterion is the best performer in terms of AUC, the mean and
weighted mean criteria outperform the single model with selected features and for some datasets
it performs equally or better than the single model with all features considered (see Table 4.34).
Table 4.34: Comparison against single model with all variables (no FS).
dataset criterion AUC ACC sensitivity specificity
PAN mean w/ FS ↓ ↓ ↓ ≈wgt mean w/ FS ↑ ↑ ↑ ↑
PNM mean w/ FS ↓ ↓ ↓ ↓wgt mean w/ FS ↑ ≈ ≈ ≈
BOTH mean w/ FS ↑ ≈ ↓ ≈wgt mean w/ FS ↑ ≈ ↓ ≈
ALL mean w/ FS ≈ ↓ ↓ ↓wgt mean w/ FS ↑ ≈ ≈ ↑
a priori - This criterion, similarly to the single model, loses in performance when working with selected
features. It is the criterion that selects more variables (exception goes to BOTH dataset where
mean includes one more variable from the FS procedure). While this might be good, meaning that
it can make use of more variables (always equal or less than half of the whole feature set), it also
means that the other criteria are capable of similar performances with fewer variables. Compared
with single model, the tendencies are the same (degradation of all performance measures with
FS) but the single model uses far less features.
a posteriori - This criterion outputs better results with feature selection. It is more balanced in terms of
sensitivity and specificity than when all variables are considered. Remember that the goal, when
defining the threshold, is to selected the one that has the lower difference between sensitivity and
specificity, meaning that these two statistical measures of the performance are not well balanced
69
and the most close to an equilibrium is for the PAN dataset where the difference between both
is close to 0.09, while for the other datasets the difference lies in between 0.24 and 0.44. Con-
sidering only the most informative variables these differences range in the values 0.02 to 0.17.
Nevertheless, both results are consistent, meaning that the standard deviation does show that this
criterion is willing to have spikes/high variation in the way it classifies and so it can be seen as a
conservative classifier since the specificity is mostly higher than the sensitivity, meaning that the
decision to administrate vasopressors is rarely met. The accuracy is high and coupled with the
high specificity it only means that there are more class 0 data. In terms of AUC, it improves with
selected features, however this is the least reliable classifier among the other criteria. This criterion
is also the one that stops improving earlier with added features.
mean - Using the mean of the outputs of both models is only better than the weighted mean when
considering all variables for the dataset PAN, however the it is a minor difference. This criterion
performs well overall and except for the dataset PAN its results improve with feature selection.
weighted mean - This is the criterion that performs best with feature selection, outperforming every
other criterion in terms of sensitivity and specificity. In terms of variables needed it is not the most
greedy (except for PAN dataset with 18 selected features), using a number of features in between
the mean and a priori criteria.
Overall, apart from a posteriori criterion, all the criteria perform comparably well. The single model
is best when using all the variables, however, similarly to a priori, it loses performance with FS while all
the remaining methods improve. The weighted mean criterion with feature selection is the best classifier
in this study, providing the best sensitivity and specificity with the use of less data. The single model is
still a good approach since it performs well with all variables, and after feature selection, it still outputs
good results with even less variables than any other criteria, except for a posteriori which is the worst
considered criterion.
70
Table 4.35: Results without FS
MIMIC II - Pancreatitis without FS
Model AUC Accuracy Sensitivity Specificity
Single model 0,91 ± 0,02 0,83 ± 0,02 0,83 ± 0,03 0,83 ± 0,02
a priori 0,93 ± 0,01 0,79 ± 0,01 0,80 ± 0,03 0,79 ± 0,01
a posteriori 0,84 ± 0,02 0,85 ± 0,01 0,77 ± 0,04 0,86 ± 0,02
mean 0,90 ± 0,02 0,82 ± 0,01 0,83 ± 0,04 0,82 ± 0,01
wgd avg dist 0,89 ± 0,02 0,82 ± 0,01 0,82 ± 0,04 0,82 ± 0,01
MIMIC II - Pneumonia without FS
Model AUC Accuracy Sensitivity Specificity
Single model 0,85 ± 0,02 0,78 ± 0,01 0,78 ± 0,02 0,78 ± 0,01
a priori 0,86 ± 0,01 0,76 ± 0,01 0,76 ± 0,03 0,76 ± 0,01
a posteriori 0,69 ± 0,04 0,82 ± 0,03 0,53 ± 0,05 0,85 ± 0,03
mean 0,81 ± 0,02 0,75 ± 0,02 0,75 ± 0,02 0,75 ± 0,02
wgd avg dist 0,83 ± 0,01 0,76 ± 0,01 0,77 ± 0,02 0,76 ± 0,01
MIMIC II - Both without FS
Model AUC Accuracy Sensitivity Specificity
Single model 0,84 ± 0,02 0,77 ± 0,01 0,77 ± 0,01 0,77 ± 0,01
a priori 0,86 ± 0,01 0,76 ± 0,02 0,76 ± 0,02 0,76 ± 0,02
a posteriori 0,70 ± 0,02 0,86 ± 0,01 0,51 ± 0,04 0,90 ± 0,01
mean 0,81 ± 0,02 0,75 ± 0,01 0,74 ± 0,02 0,75 ± 0,01
wgd avg dist 0,82 ± 0,01 0,76 ± 0,01 0,76 ± 0,02 0,75 ± 0,01
MIMIC II - All patients without FS
Model AUC Accuracy Sensitivity Specificity
Single model 0,82 ± 0,02 0,75 ± 0,01 0,75 ± 0,01 0,75 ± 0,02
a priori 0,85 ± 0,01 0,73 ± 0,01 0,73 ± 0,01 0,73 ± 0,01
a posteriori 0,67 ± 0,02 0,82 ± 0,01 0,52 ± 0,04 0,86 ± 0,02
mean 0,78 ± 0,01 0,71 ± 0,01 0,71 ± 0,02 0,71 ± 0,01
wgd avg dist 0,80 ± 0,01 0,73 ± 0,01 0,73 ± 0,02 0,73 ± 0,01
71
Table 4.36: Results with FS
MIMIC II - Pancreatitis with FS
Model #FS AUC Accuracy Sensitivity Specificity
Single model 10 0,88 ± 0,03 0,81 ± 0,03 0,81 ± 0,03 0,81 ± 0,03
a priori 16 0,92 ± 0,01 0,78 ± 0,01 0,80 ± 0,02 0,77 ± 0,02
a posteriori 7 0,85 ± 0,02 0,83 ± 0,01 0,74 ± 0,05 0,84 ± 0,02
mean 16 0,89 ± 0,03 0,82 ± 0,01 0,82 ± 0,04 0,82 ± 0,01
wgd avg dist 18 0,91 ± 0,01 0,83 ± 0,01 0,84 ± 0,03 0,83 ± 0,01
MIMIC II - Pneumonia with FS
Model #FS AUC Accuracy Sensitivity Specificity
Single model 6 0,83 ± 0,03 0,76 ± 0,02 0,76 ± 0,02 0,76 ± 0,02
a priori 12 0,86 ± 0,01 0,75 ± 0,01 0,74 ± 0,02 0,75 ± 0,01
a posteriori 7 0,78 ± 0,03 0,77 ± 0,02 0,69 ± 0,04 0,78 ± 0,02
mean 8 0,83 ± 0,01 0,76 ± 0,01 0,75 ± 0,02 0,76 ± 0,01
wgd avg dist 10 0,85 ± 0,01 0,78 ± 0,01 0,78 ± 0,02 0,78 ± 0,01
MIMIC II - Both with FS
Model #FS AUC Accuracy Sensitivity Specificity
Single model 8 0,83 ± 0,01 0,76 ± 0,01 0,75 ± 0,01 0,76 ± 0,01
a priori 13 0,86 ± 0,01 0,74 ± 0,01 0,74 ± 0,02 0,74 ± 0,01
a posteriori 7 0,77 ± 0,01 0,76 ± 0,01 0,68 ± 0,02 0,77 ± 0,01
mean 14 0,84 ± 0,01 0,77 ± 0,01 0,77 ± 0,02 0,77 ± 0,01
wgd avg dist 12 0,85 ± 0,01 0,77 ± 0,01 0,77 ± 0,02 0,77 ± 0,01
MIMIC II - All patients with FS
Model #FS AUC Accuracy Sensitivity Specificity
Single model 8 0,79 ± 0,02 0,73 ± 0,01 0,73 ± 0,01 0,73 ± 0,01
a priori 15 0,85 ± 0,01 0,73 ± 0,01 0,72 ± 0,01 0,73 ± 0,01
a posteriori 8 0,77 ± 0,04 0,70 ± 0,02 0,74 ± 0,05 0,70 ± 0,03
mean 11 0,82 ± 0,01 0,74 ± 0,01 0,74 ± 0,01 0,74 ± 0,01
wgd avg dist 12 0,82 ± 0,01 0,75 ± 0,01 0,75 ± 0,01 0,75 ± 0,01
72
4.1.4 Model Assessment based on the singular models’ performance
The approach that is about to be analysed proved to be inappropriate for the datasets under study.
Follows a list of the problems that arose with the use of this method, where over-fitting is pervasive:
Feature selection do not bring any advantage - The feature selection procedure for this approach
has no power in reducing the number of features that will be used: all the variables must be
considered to compute the distance between the centre of the unsupervised clusters and the data
points, and as seen before, single models have a tendency to perform better with all features so
there is no improvement in performance with the use of less variables for each model separately
- the results during the feature selection procedure were incredibly high (see Tables 4.30-4.33)
but looking for the results of model assessment one can infer that that was because the models
were over-fitted to the presented data. Even with the use of cross-validation, the position of the
prototypes are fixed using all FS data in order to divide the data into subgroups, thus not changing
its position with the folds.
Data partition - The data was well partitioned during the feature selection procedure (see Tables 4.30-
4.33), more resemblant to the distribution shown in Section 4.1.1. However, since the centre of
the cluster had to be fixed (otherwise the selected features of the FS procedure would not make
sense), the result is that for a different group of data (MA data) the partition is highly unbalanced
(exception goes for the dataset BOTH). This procedure cannot be generalized and the cluster
centres did over-fit the FS data.
The procedure for selecting features and FCM parameters - The idea of having models suited to dif-
ferent zones of the feature space is very appealing, however since different clustering results ben-
efit from different variables and parameters (and this association is very sensible), fixing both the
features and parameters for a particular set of data compromises the generalization ability.
The model assessment results for each dataset are presented through the Tables 4.37-4.40. First
thing to notice is that every approach presented under this section performs worse than the single
model (shown in Tables 4.35 and 4.36). The percentages of data inside each cluster shown in Tables
4.37-4.40 clearly state that the fixed positions of the prototypes does not match the data presented
in model assessment (exception goes for BOTH that have an equilibrated distribution, which can be
a mere coincidence). The a priori and a posteriori criteria show high variance in the sensitivity and
specificity while the mean and weighted mean criteria keep being consistent. The latter ones are the
top performers in this case, having the highest AUC, sensitivity and specificity and both showing a very
similar behaviour.
This might be a good approach for well-behaved datasets which is not the current case where each
clustering initialization conduct to a different set of cluster centres, keeping nearly the same distribution
percentages. This transforms the clustering procedure into a push and pull game between prototypes
which is not the purpose. This also happens in the previous case (the datasets are the same), the
difference is that here the models are less flexible due to the over-fit that is transported from the FS data
73
to the MA data, namely the cluster centres and its selected features. This has the potential to produce
great results as shown during the FS stage, however it requires the MA data to be very identical to the
FS data because slight variations on the initial clustering show high differences in the final result (high
sensibility).
Table 4.37: Results for models based on the singular models’ performance: PAN dataset.
criterion AUC Accuracy Sensitivity Specificity % data cluster 1 % data cluster 2
a priori 0,72 ± 0,04 0,70 ± 0,03 0,64 ± 0,05 0,71 ± 0,03
75,9 24,1a posteriori 0,73 ± 0,05 0,48 ± 0,04 0,88 ± 0,07 0,43 ± 0,06
mean 0,75 ± 0,06 0,68 ± 0,05 0,67 ± 0,06 0,68 ± 0,05
wgd avg distance 0,75 ± 0,06 0,68 ± 0,05 0,67 ± 0,06 0,68 ± 0,05
Table 4.38: Results for models based on the singular models’ performance: PNM dataset.
criterion AUC Accuracy Sensitivity Specificity % data cluster 1 % data cluster 2
a priori 0,74 ± 0,02 0,69 ± 0,01 0,68 ± 0,03 0,69 ± 0,02
89,1 10,9a posteriori 0,72 ± 0,02 0,70 ± 0,01 0,70 ± 0,02 0,70 ± 0,02
mean 0,76 ± 0,01 0,70 ± 0,02 0,70 ± 0,01 0,70 ± 0,02
wgd avg distance 0,76 ± 0,01 0,70 ± 0,02 0,70 ± 0,01 0,70 ± 0,02
Table 4.39: Results for models based on the singular models’ performance: BOTH dataset.
criterion AUC Accuracy Sensitivity Specificity % data cluster 1 % data cluster 2
a priori 0,76 ± 0,01 0,76 ± 0,01 0,64 ± 0,02 0,78 ± 0,01
55,2 44,8a posteriori 0,73 ± 0,02 0,69 ± 0,02 0,71 ± 0,04 0,69 ± 0,02
mean 0,76 ± 0,01 0,70 ± 0,01 0,70 ± 0,03 0,70 ± 0,01
wgd avg distance 0,76 ± 0,01 0,70 ± 0,01 0,70 ± 0,03 0,70 ± 0,01
Table 4.40: Results for models based on the singular models’ performance: ALL dataset.
criterion AUC Accuracy Sensitivity Specificity % data cluster 1 % data cluster 2
a priori 0,61 ± 0,01 0,73 ± 0,05 0,37 ± 0,07 0,78 ± 0,07
85,9 14,1a posteriori 0,59 ± 0,04 0,66 ± 0,04 0,69 ± 0,04 0,66 ± 0,05
mean 0,73 ± 0,01 0,68 ± 0,02 0,68 ± 0,02 0,68 ± 0,02
wgd avg distance 0,73 ± 0,01 0,68 ± 0,02 0,67 ± 0,02 0,68 ± 0,02
4.2 Mixed Fuzzy Clustering - Time-series data approach
In this section four different approaches to modelling will be considered. Under each of them two
ways of normalizing the data are studied resulting in eight combinations.
The modelling approaches were presented in Section 2.3: FCM FM, FCM-FCM FM, MFC FM
and MFC-FCM FM. The normalization methods were mentioned in Section 3.2.4 corresponding to the
method 1 and method 3 which are the ones that seem to be reasonable for the present cases. The
imputation method is the one presented in 3.4.1.
74
The adopted workflow is similar to the ensemble modelling: firstly the parameters and feature selec-
tion is performed and then the model assessment results are presented in order to show the behaviour
of the performance under each assumption.
4.2.1 Fixing parameters of the models & Feature Selection
The first noticed drawback compared to the previous approach is that all the methods presented un-
der this section, particularly the ones that use MFC, are computational heavier which led to the decision
of limiting the number of combinations for the parameters. The first limitation is that the length of the
time-series is ten3. Then, the fuzziness parameter took the values {1.2,2} and the number of clusters
were {2, 3, 4, 5}. For MFC FM and MFC-FCM FM there is an added parameter, λ, which took the
values {0,1,2,5}. Running 50 times the parameters that got the best results in order to fix the features
is impracticable for this scenario. So the parameters were fixed by the combination that led to the best
AUC and then that combination was run 10 times with a different seed each time. The final combination
of features is the one that resulted in the best AUC of the 10 runs.
The data was equally partitioned with FS data set and MA data set, however the balance between
each class was kept the same as the whole data set in order to avoid cuts in the amount of data, which
is less than the data available for the punctual case.
Table 4.41 shows the results for the conducted feature selection: best parameters and most predictive
features. At a first glance, there is an obvious improvement in AUC by using the normalization type 1,
while the strategy of using as input variables the resulting partition matrix has an higher impact in the
number of selected features. It also conducts to an higher number of predictive features, meaning that
this simple assumption (normalization type 1) enables the methods to explore more appropriately the
data that is available. Using the transformation also reduces the number of selected features which
is intuitive because none of the variables are used directly, instead they go through a transformation
process that reduces the dimensionality of the input to the models that are again transformed in the
fuzzification step. This makes the value of each variable by itself less important.
Non-variant features, from 33 to 37, make their presence amongst the time-varying ones, meaning
that these features play a role in prediction. The variables SAPS (36) and SOFA (37) were expected to
be of use since they compact information about physiological variables (SOFA even includes the age),
but variables such as gender (33) and age (34) are also selected. It is also observed that the combination
with higher value of λ is the one that does not select any static variable which is in accordance with its
meaning: best performance is achieved by giving a higher weight to the time-varying data.
3However, out of curiosity, smaller lengths for the time-series were tested and it has shown improved results. This observationsolely might mean that a punctual approach is better since the smaller the time-series the more similar to the first presentedapproach it gets.
75
Table 4.41: Best parameters and selected feature according to AUC.
Methods Norm. Type AUC m # of clusters lambda Features
FCM FM1 0,85 2 3 - 1, 7, 19, 24, 26, 29, 31 — 33
3 0.74 2 3 - 2, 7, 19, 24, 26, 27 — 36
FCM-FCM FM1 0.84 2 3 - 3, 19, 20, 29, 31
3 0.73 2 4 - 2, 22 — 33
MFC FM1 0.86 2 2 1 1, 5, 7, 15, 17, 19, 21, 23, 25, 26, 29, 31 — 33, 36
3 0.81 2 3 2 2, 6, 13, 15, 17, 19, 23, 24, 29 — 36, 37
MFC-FCM FM1 0,87 2 5 5 3, 4, 7, 19, 22, 29, 31
3 0.76 2 2 2 1, 4, 12, 17, 19, 31 — 34, 36, 37
Considering the 10 runs, a frequency analysis to the feature selection procedure was conducted,
strictly for the normalization type 1 (the normalization type 3 analysis for FS is dropped from now on).
Tables 4.42-4.45 show the preferred order and frequency by which the features are selected for each
method.
Table 4.42: Most selected features by FCM FM (meannumber of selected features: 5.8).
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 29 10 7, 31 7 24
Frequency 0.70 0.40 0.22 0.29 0.29 0.40
Table 4.43: Most selected features by FCM-FCM FM(mean number of selected features: 4.3).
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 3 29 18, 31 20 22, 32
Frequency 0.70 0.40 0.44 0.25 0.75 0.50
Table 4.44: Most selected features by MFC FM (meannumber of selected features: 8.2).
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 7, 29, 31 31 7 24 4, 26
Frequency 1.00 0.30 0.60 0.50 0.40 0.30
Table 4.45: Most selected features by MFC-FCM FM(mean number of selected features: 6.5).
Most selected feature by order
1st 2nd 3rd 4th 5th 6th
Feature # 19 7 3 29 31 4
Frequency 0.80 0.40 0.50 0.56 0.38 0.38
4.2.2 Model Assessment
This section is based on the model assessment of the four methods under study. For this purpose
half of the data, tagged as MA dataset, is used in two different scenarios: considering all the variables
with the fixed parameters tuned with the use of FS data with all variables (Table 4.46), and considering
only the variables obtained through FS and respective parameters (Tables 4.47-4.50).
Focusing in the results obtained for the normalization type 1, it is shown in Table 4.46 that the
transformation based on the partition matrix, both for FCM-FCM FM and MFC-FCM FM, improves the
results. Not only the performance measures are better but also the stability, in terms of variance and
difference between sensitivity and specificity, is improved.
Tables 4.47-4.50 show the results of model assessment considering the features that were selected
by FS. The results are compared in pairs: Table 4.47 shows the results of FCM FM and FCM-FCM FM
considering the features selected by FCM FM, and Table 4.48 for the features selected by FCM-FCM
76
Table 4.46: MA data using all variables with best FS data parameters according to AUC.
Methods Norm. Type AUC Accuracy Sensitivity Specificity
FCM FM 1 0.70 ± 0.11 0.68 ± 0.09 0.56 ± 0.14 0.72 ± 0.123 0.66 ± 0.06 0.64 ± 0.06 0.54 ± 0.11 0.68 ± 0.07
FCM-FCM FM 1 0.75 ± 0.08 0.72 ± 0.07 0.74 ± 0.15 0.71 ± 0.093 0.57 ± 0.07 0.55 ± 0.08 0.45 ± 0.17 0.59 ± 0.08
MFC FM 1 0.70 ± 0.11 0.68 ± 0.09 0.56 ± 0,14 0,72 ± 0.123 0.66 ± 0.06 0.64 ± 0.06 0.54 ± 0.11 0.68 ± 0.07
MFC-FCM FM 1 0.79 ± 0.09 0.77 ± 0.06 0.63 ± 0.08 0.82 ± 0.093 0.68 ± 0.09 0.65 ± 0.09 0.62 ± 0.11 0.66 ± 0.12
FM, while Table 4.49 shows the results of MFC FM and MFC-FCM FM for the features selected by MFC
FM, and Table 4.47 for the features selected by MFC-FCM FM.
As expected, each method performs better for their respective feature set. The methods that use
feature transformation, FCM-FCM FM and MFC-FCM FM, show closer values of sensitivity and speci-
ficity but also have a higher variance compared to when no transformation is considered. These same
methods show better results when no FS is considered, contrary to FCM FM and MFC FM that show
improved results with FS. Considering the results in Table 4.41, one might say that the models that use
feature transformation have higher tendency to overfit, and given the results without feature selection in
Table 4.46, it can be concluded that they are also less prone to be affected by noisy variables and that
the feature transformation functions as a filter, with the potential to extract conclusive information from a
group of variables that were not considered during FS. On the other hand, the methods that do not use
feature transformation are affected by the noise, thus showing better results with the selected features.
Overall, the best performer after feature selection is FCM FM.
Table 4.47: FCM FM vs FCM-FCM FM with FCM FM selected features.
Model AUC Accuracy Sensitivity Specificity
FCM FM 0.80 ± 0.06 0.75 ± 0.05 0.70 ± 0.15 0.77 ± 0.06FCM-FCM FM 0.71 ± 0.08 0.66 ± 0.09 0.67 ± 0.12 0.66 ± 0.12
Table 4.48: FCM FM vs FCM-FCM FMwith FCM-FCM FM selected features.
Model AUC Accuracy Sensitivity Specificity
FCM FM 0.71 ± 0.08 0.68 ± 0.03 0.60 ± 0.12 0.71 ± 0.06FCM-FCM FM 0.72 ± 0.12 0.69 ± 0.08 0.68 ± 0.17 0.69 ± 0.10
Table 4.49: MFC FM vs MFC-FCM FM with MFC FM selected features.
Model AUC Accuracy Sensitivity Specificity
MFC FM 0.78 ± 0.08 0.72 ± 0.06 0.63 ± 0.06 0.76 ± 0.07
MFC-FCM FM 0.62 ± 0.11 0.68 ± 0.07 0.11 ± 0.09 0.91 ± 0.10
77
Table 4.50: MFC FM vs MFC-FCM FM with MFC-FCM FM selected features.
Model AUC Accuracy Sensitivity Specificity
MFC FM 0.75 ± 0.10 0.70 ± 0.06 0.65 ± 0.18 0.72 ± 0.06
MFC-FCM FM 0.77 ± 0.09 0.73 ± 0.09 0.72 ± 0.13 0.73 ± 0.12
78
Chapter 5
Conclusions
The present thesis was developed with the aim of predicting patients’ need of vasopressors in ICUs,
based on patients’ physiological variables and demographics/static information measured during their
hospitalization.
It proposes some changes to the preprocessing proposed in [24] for the clinical actual state / punctual
data analysis for which ensemble modelling was applied. The preprocessing solely, has contributed to
better results of the single model, and some configurations of the ensemble modelling proved to be
effective for this case study. The study conducted for the unsupervised clustering, in order to divide the
data into subgroups, has shown that the patients are not divided by their disease has it was thought
before, neither by the need of vasopressors, meaning that the disease label per se is not discriminatory
for the purpose of this study. It shows that each subgroup has closely the same data balance in terms
of diseases and vasopressors administration.
Regarding the ensemble modelling, two different approaches were conducted: one tuned with re-
spect to the overall performance, and another where the subgroups are tuned individually. Although the
modelling based on single models’ performance did not show improvements due to the high tendency it
has to overfit to the training data, the modelling based on overall performance did improve the results.
The ensemble modelling has shown to take advantage of different features depending on the subgroups.
Two strategies were used for the combination of classifiers: classifier selection, that integrates the a pri-
ori and a posteriori approach, and classifier fusion integrating the mean and weighted mean criteria.
The a posteriori did not perform well when compared to their peers that delivered better results against
the single model when feature selection is used. However, when all features are considered the single
model is the best performer showing a high level of versatility and generalization.
Another preprocessing was proposed to suit the analysis of the data following its temporal evolution
combined with demographic/static data. This approach allows for a low level of data imputation, making
it more reliable when compared to the former pre-processing.
Four different models were constructed for which feature selection and model assessment was per-
formed. These models can be divided into two groups: FCM FM and FCM-FCM FM, where the former
receives as input the time-series and static data pointwise and the latter receives the partition matrix
79
that results from the FCM clustering of the same data, and then MFC FM and MFC-FCM FM, where the
first uses MFC (instead of FCM) to construct the model, and MFC-FCM FM in which MFC is used to
transform the data and feed the FCM FM the resulting partition matrix.
For the mixed data approach, regarding the effectiveness of the use of the feature transformation
based on the resulting partition matrices, it was concluded that, for this particular case, the transfor-
mation brings consistency making a better use of higher dimensional inputs when compared to their
counterparts (feeding the models directly with the data). However, after feature selection, the models
that do not use feature transformation perform better, showing that the proposed transformation act as
a filter with the capability to extract predictive information from variables that are not considered after
feature selection while avoiding being affected by noise. These advantages do not make up for the poor
results against the simplicity and performance of FCM FM, which is the best performer after feature
selection.
Comparing both approaches cannot be done directly due to the difference in the structure of the data
and information contained in each one. However, in order to conclude, both results were satisfactory
and the best results for each of the approaches, in terms of data structure, are shown in Table 5.1 and
both make use of the feature selection results.
Table 5.1: Best results for punctual and evolution state analysis for the dataset ALL.
Model # of features AUC Accuracy Sensitivity Specificity
wgd avg dist 12 0.82 ± 0.01 0.75 ± 0.01 0.75 ± 0.01 0.75 ± 0.01
FCM-FCM FM 8 0.80 ± 0.06 0.75 ± 0.05 0.70 ± 0.15 0.77 ± 0.06
It is important to note that the mixed data is less voluminous due to the added constraints of its
structure. It is also more reliable for the purpose of this study: the imputation of data is around 14.6%
versus 74.7% and it predicts the initiation of the vasopressors intake, contrary to the punctual state
data, that contain data until the end of the first vasopressor administration, predicting the continuous
administration of the drug and not only its initiation.
5.1 Limitations
Throughout this thesis several assumptions had to be taken. Those assumptions are taken on the
basis of hypotheticals, bringing limitations at each step. During the pre-processing:
Septic shock filter - Since MIMIC II does not contain an identifier for the septic shock condition, the
filter was applied to diseases where the incidence of septic shock is higher: pancreatitis and
pneumonia. However, this does not guarantee that every patient are under septic shock condition.
Eliminating deceased patients that did not had vasopressors - While this assumption eliminates un-
desired bias, it raises the question of how will the model behave when it is presented with such
cases in real-time.
80
Data alignment - When clinical state data is used, it is desired to align the data. The approach consists
on picking up the variable with higher sample frequency and align the data based on ZOH. The
lower sampling rate of most of the variables leads to repetition of data or data that is equal in most
features and only the highly sampled features change. The impact of it was not computed, however
it might be the cause for such good classification results. This conducts to the high percentage of
imputations.
Most likely, the initiation of the administration is not predicted - Again, for the clinical state/punctual
data, the class 1 data is composed by the data that is in the interval between the starting point of
vasopressor administration and its final of continuous administration. This means that most of the
class 1 data does not indicate that vasopressors will start in a given time-window (only one class
1 data point has this information), instead most of the data says that the patient will continue to be
administrated with vasopressors from that point on. Considering only the point that will predict the
administration timely would require to have only one data point of class 1 per patient that had va-
sopressors. This would be impracticable due to the high unbalanced classes. So, the clinical state
scenario is answering the question ”will the patient be prescribed OR continue with vasopressors
administration?”. The latter question is probably the one that is answered correctly most of the
time.
Comparison between clinical state vs evolution state is made impossible - Given the data struc-
ture for each scenario, it is impossible to make a viable comparison. It would even be unfair for the
time-series analysis since, in this case, the two immediately above aspects benefit performance
of the clinical state analysis. The time-series case is less prone to inconsistencies and represents
better a real case.
The feature selection was also problematic. Due to the observed variation during the FS procedure,
it is only known which variables work best when alone or in conjunction with a few other features. This is
shown by observing that only the first selected variables are constant while the next ones highly depend
on the randomness inherent to the training data and initiation of the clustering algorithms. This idea is
reinforced when the single model performs best when all features are available. So, the greedy algorithm
approach for feature selection has proved to be inefficient in highlighting the most predictive variables
from a certain point.
5.2 Future Work
There is always room for improvement in a study and as so, the following future work is proposed:
Feature Selection - The feature selection should be explored by other algorithms. It has been done
before for single model, but finding a more consistent algorithm would point out the differences
between each criteria as far as features are concerned.
81
Find a way to compare both analysis - At the end of this thesis, the problem of comparing both ap-
proaches arises. It would be interesting being able to compare which kind of analysis performs
best: clinical state vs evolutionary state. This would require to reconstruct the clinical state data
to avoid the repetition of data (at least during the test phase) and consider only data points that
predict the initiation of vasopressors. The time-series analysis is well approached.
Length of the time-series - It would be interesting to have an analysis of the implications of increasing
and reducing the time-series.
Same methods using Dynamic Time Warping (DTW) - Unfortunately, due to the heavy computational
effort of this algorithm the study was not fully conducted, but from what was tested it is expected
to conduct to better results.
Normalization - As shown, the normalization that was proposed led to worse results. However, from
the tests that were conducted, one might want to give it a try and use this normalization only for
variables that are more likely to have different ”healthy” values for each patient.
Eliminating class 0 deceased patients - An analysis can be conducted to evaluate how the models
behave in the presence of such patients. While it makes sense to remove these patients from the
train data, in a real case scenario they will still be there, so adding them to test data and analyse
the output might be interesting. Finding the difference between these patients and patients that
need vasopressors might be interesting too.
Enable different sizes for each feature in time-series - The algorithm that is available at the moment
does not allow different sizes of the time-series for different features. This could be of use since it
was observed that different variables start tend towards ”failure” at different times and rates.
Insightful analysis of the features behaviour - Enabling the algorithm for different time-series sizes
is useless if one does not know the differences between each variable behaviour.
Clustering algorithm - Instead of using FCM, it should be tried to use Gustafson-Kessel algorithm, due
to its flexibility it might do better.
Other modelling approaches - The ensemble modelling can be applied solely to the mixed data ap-
proach or ensemble modelling that encompasses both approaches: punctual state data and state
evolution data. The unsupervised clustering might be done through MFC instead of FCM.
82
Bibliography
[1] MIMIC II V2.6, 2011 (accessed June 6, 2015). URL http://mimic.physionet.org/schema/
latest/.
[2] D. C. Angus and T. van der Poll. Severe sepsis and septic shock. New England Journal of Medicine,
369(9):840–851, 2013.
[3] D. C. Angus, W. T. Linde-Zwirble, J. Lidicker, G. Clermont, J. Carcillo, and M. R. Pinsky. Epidemiol-
ogy of severe sepsis in the united states: analysis of incidence, outcome, and associated costs of
care. Critical care medicine, 29(7):1303–1310, 2001.
[4] R. Babuska. Fuzzy systems, modeling and identification. 2001.
[5] A. M. Bensaid, L. O. Hall, J. C. Bezdek, L. P. Clarke, M. L. Silbiger, J. A. Arrington, and R. F.
Murtagh. Validity-guided (re) clustering with applications to image segmentation. Fuzzy Systems,
IEEE Transactions on, 4(2):112–123, 1996.
[6] J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Pub-
lishers, 1981.
[7] J. C. Bezdek, R. Ehrlich, and W. Full. Fcm: The fuzzy c-means clustering algorithm. Computers &
Geosciences, 10(2):191–203, 1984.
[8] R. Brause, F. Hamker, and J. Paetz. Septic shock diagnosis by neural networks and rule based
systems. In Computational intelligence processing in medical diagnosis, pages 323–356. Springer,
2002.
[9] P. Bromage. Vasopressors. Canadian Journal of Anesthesia/Journal canadien d’anesthesie, 7(3):
310–316, 1960.
[10] Y.-W. Chen and C.-J. Lin. Feature extraction, foundations and applications. Guyon, Isabelle and
Gunn, Steve and Nikravesh, Masoud and Zadeh, Lofti Combining SVMs with various feature selec-
tion strategies, Studies in Fuzziness and Soft ComputingSpringer-Verlag, pages 315–324, 2006.
[11] K. J. Cios and G. W. Moore. Uniqueness of medical data mining. Artificial intelligence in medicine,
26(1):1–24, 2002.
83
[12] F. Cismondi, A. S. Fialho, S. M. Vieira, J. M. Sousa, S. R. Reti, M. D. Howell, and S. N. Finkel-
stein. Computational intelligence methods for processing misaligned, unevenly sampled time se-
ries containing missing data. In Computational Intelligence and Data Mining (CIDM), 2011 IEEE
Symposium on, pages 224–231. IEEE, 2011.
[13] F. Cismondi, A. L. Horn, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. Sousa, and S. Finkelstein.
Multi-stage modeling using fuzzy multi-criteria feature selection to improve survival prediction of icu
septic shock patients. Expert Systems with Applications, 39(16):12332–12339, 2012.
[14] T. D. Correa, M. Vuda, A. R. Blaser, J. Takala, S. Djafarzadeh, M. W. Dunser, E. Silva, M. Lensch,
L. Wilkens, and S. M. Jakob. Effect of treatment delay on disease severity and need for resuscitation
in porcine fecal peritonitis. Critical care medicine, 40(10):2841–2849, 2012.
[15] R. P. Dellinger, M. M. Levy, A. Rhodes, D. Annane, H. Gerlach, S. M. Opal, J. E. Sevransky, C. L.
Sprung, I. S. Douglas, R. Jaeschke, et al. Surviving sepsis campaign: international guidelines for
management of severe sepsis and septic shock, 2012. Intensive care medicine, 39(2):165–228,
2013.
[16] R. P. Dellinger et al. Cardiovascular management of septic shock. Critical care medicine, 31(3):
946–955, 2003.
[17] V. Y. Dombrovskiy, A. A. Martin, J. Sunderram, and H. L. Paz. Rapid increase in hospitalization and
mortality rates for severe sepsis in the united states: A trend analysis from 1993 to 2003*. Critical
care medicine, 35(5):1244–1250, 2007.
[18] C. M. DUNHAM, J. H. SIEGEL, L. WEIRETER, M. FABIAN, S. GOODARZI, P. GUADALUPI, L. GET-
TINGS, S. E. LINBERG, and T. C. VARY. Oxygen debt and metabolic acidemia as quantitative
predictors of mortality and the severity of the ischemic insult in hemorrhagic shock. Critical care
medicine, 19(2):231–243, 1991.
[19] A. P. Engelbrecht. Computational intelligence: an introduction. John Wiley & Sons, 2007.
[20] T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
[21] M. P. Fernandes, C. F. Silva, S. M. Vieira, and J. Sousa. Multimodeling for the prediction of patient
readmissions in intensive care units. In Fuzzy Systems (FUZZ-IEEE), 2014 IEEE International
Conference on, pages 1837–1842. IEEE, 2014.
[22] M. C. Ferreira, C. M. Salgado, J. L. Viegas, H. Schafer, C. S. Azevedo, S. M. Vieira, and J. M. C.
Sousa. Fuzzy modeling based on mixed fuzzy clustering for health care applications. In Fuzzy
Systems (FUZZ-IEEE), 2015 IEEE International Conference on. IEEE, 2015.
[23] A. Fialho, L. Celi, F. Cismondi, S. Vieira, S. Reti, J. Sousa, S. Finkelstein, et al. Disease-based
modeling to predict fluid response in intensive care units. Methods Inf Med, 52(6):494–502, 2013.
84
[24] A. S. Fialho, F. Cismondi, S. M. Vieira, J. Sousa, S. R. Reti, L. Celi, M. D. Howell, S. N. Finkelstein,
et al. Fuzzy modeling to predict administration of vasopressors in intensive care unit patients. In
Fuzzy Systems (FUZZ), 2011 IEEE International Conference on, pages 2296–2303. IEEE, 2011.
[25] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge discovery in databases: An
overview. AI magazine, 13(3):57, 1992.
[26] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical inference and data mining. Com-
munications of the ACM, 39(11):35–41, 1996.
[27] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine
Learning Research, 3:1157–1182, 2003.
[28] J. Han, M. Kamber, and J. Pei. Data mining, southeast asia edition: Concepts and techniques.
Morgan kaufmann, 2006.
[29] J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques: concepts and techniques.
Elsevier, 2011.
[30] D. J. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT press, 2001.
[31] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating char-
acteristic (roc) curve. Radiology, 143(1):29–36, 1982.
[32] S. Herget-Rosenthal, F. Saner, and L. S. Chawla. Approach to hemodynamic shock and vasopres-
sors. Clinical Journal of the American Society of Nephrology, 3(2):546–553, 2008.
[33] S. M. Hollenberg. Inotrope and vasopressor therapy of septic shock. Critical care nursing clinics of
North America, 23(1):127–148, 2011.
[34] A. L. Horn, F. Cismondi, A. S. Fialho, S. M. Vieira, J. M. Sousa, S. Reti, M. Howell, and S. Finkel-
stein. Multi-objective performance evaluation using fuzzy criteria: Increasing sensitivity prediction
for outcome of septic shock patients. In Proceedings of 18th world congress of the international
federation of automatic control (IFAC), volume 18, pages 14042–14047, 2011.
[35] H. Izakian, W. Pedrycz, and I. Jamal. Clustering spatiotemporal data: An augmented fuzzy c-
means. Fuzzy Systems, IEEE Transactions on, 21(5):855–868, 2013.
[36] A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
[37] J.-S. R. Jang, C.-T. Sun, and E. Mizutani. Neuro-fuzzy and soft computing; a computational ap-
proach to learning and machine intelligence. 1997.
[38] U. Kaymak and M. Setnes. Extended fuzzy clustering algorithms. ERIM Report Series Reference
No. ERS-2001-51-LIS, 2000.
[39] D.-W. Kim, K. H. Lee, and D. Lee. On cluster validity index for estimation of the optimal number of
fuzzy clusters. Pattern Recognition, 37(10):2009–2025, 2004.
85
[40] T. W. Liao. Clustering of time series data—a survey. Pattern recognition, 38(11):1857–1874, 2005.
[41] J.-H. Lin and P. J. Haug. Data preparation framework for preprocessing clinical data in data mining.
In AMIA Annual Symposium Proceedings, volume 2006, page 489. American Medical Informatics
Association, 2006.
[42] W. T. Linde-Zwirble and D. C. Angus. Severe sepsis epidemiology: sampling, selection, and society.
Critical Care, 8(4):222, 2004.
[43] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures.
In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 911–916. IEEE, 2010.
[44] F. J. Marques, A. Moutinho, S. M. Vieira, and J. M. Sousa. Preprocessing of clinical databases to
improve classification accuracy of patient diagnosis. In World Congress, volume 18, pages 14121–
14126, 2011.
[45] G. S. Martin, D. M. Mannino, S. Eaton, and M. Moss. The epidemiology of sepsis in the united
states from 1979 through 2000. New England Journal of Medicine, 348(16):1546–1554, 2003.
[46] U. Maulik and S. Bandyopadhyay. Performance evaluation of some clustering algorithms and valid-
ity indices. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(12):1650–1654,
2002.
[47] T. Pang-Ning, M. Steinbach, V. Kumar, et al. Introduction to data mining. In Library of Congress,
page 74, 2006.
[48] W. Pedrycz. Fuzzy multimodels. Fuzzy Systems, IEEE Transactions on, 4(2):139–148, 1996.
[49] R. Polikar. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6
(3):21–45, 2006.
[50] E. Rivers, B. Nguyen, S. Havstad, J. Ressler, A. Muzzin, B. Knoblich, E. Peterson, and M. Tom-
lanovich. Early goal-directed therapy in the treatment of severe sepsis and septic shock. New
England Journal of Medicine, 345(19):1368–1377, 2001.
[51] M. Saeed, C. Lieu, G. Raber, and R. Mark. Mimic ii: a massive temporal icu patient database to
support research in intelligent patient monitoring. In Computers in Cardiology, 2002, pages 641–
644. IEEE, 2002.
[52] C. M. Salgado, C. S. Azevedo, J. Garibaldi, and S. M. Vieira. Ensemble fuzzy classifiers design
using weighted aggregation criteria. In Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International Con-
ference on. IEEE, 2015.
[53] N. Sanchez-Marono, A. Alonso-Betanzos, and M. Tombilla-Sanroman. Filter methods for feature
selection–a comparative study. In Intelligent Data Engineering and Automated Learning-IDEAL
2007, pages 178–187. Springer, 2007.
86
[54] J. M. Sousa and U. Kaymak. Fuzzy decision making in modeling and control, volume 27. World
Scientific, 2002.
[55] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and
control. Systems, Man and Cybernetics, IEEE Transactions on, (1):116–132, 1985.
[56] J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In Proceedings
of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
877–886. ACM, 2009.
[57] X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis
and machine intelligence, 13(8):841–847, 1991.
[58] S. C. Yusta. Different metaheuristic strategies to solve the feature selection problem. Pattern
Recognition Letters, 30(5):525–534, 2009.
[59] L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning—i.
Information sciences, 8(3):199–249, 1975.
87
88
Appendix A
Outliers - Expert Knowledge versus
Inter-quartile method
89
Figure A.1: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 1 to 4.
90
Figure A.2: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 5 to 8.
91
Figure A.3: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 9 to 12.
92
Figure A.4: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 13 to 16.
93
Figure A.5: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 17 to 20.
94
Figure A.6: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 21 to 24.
95
Figure A.7: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 25 to 28.
96
Figure A.8: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 29 to 32.
97
Appendix B
Removing Deceased Patients
98
Figure B.1: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 1-6. Each pointcorresponds to mean value of a 2 hours window.
99
Figure B.2: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 7-12. Each pointcorresponds to mean value of a 2 hours window.
100
Figure B.3: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 13-18. Each pointcorresponds to mean value of a 2 hours window.
101
Figure B.4: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 19-24. Each pointcorresponds to mean value of a 2 hours window.
102
Figure B.5: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 25-30. Each pointcorresponds to mean value of a 2 hours window.
103
Figure B.6: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 30-32. Each pointcorresponds to mean value of a 2 hours window.
Figure B.7: Related to the above figures. Left figure: Number of measurements considered each 2 hours timewindow. Right figure: Number of patients taken into account for each 2 hours time window.
104
Appendix C
ID’s of the same variable
Table C.1: List of de IDs associated with each variable that were grouped into one.
Variable Name Primary ID Secundary ID
WBC 861 1542, 1127, 4200Arterial Base Excess 776 74, 3740, 4196
Arterial pH / pH 780 865, 1126, 4202, 4753Arterial PaO2 779 1155
Arterial PaCO2 778 1156Calcium 786 1522
Central Venous Pressure (CVP) 113 1103Ionized Calcium 816 1350, 4453, 8177, 8325
INR 815 1530Lactate or Lactid Acid 818 1531
Phosphorous 827 1534PTT 825 1533, 739NBP 455 751, 1149
Magnesium 821 1532Sodium 837 1536, 3803
RBC 833 3797, 4197Chloride 788 1523Platelets 828 3790
BUN 781 1162Creatinine 791 3750, 1525
Glucose 811 1529Potassium 829 1535, 3792Hematocrit 813 3761Arterial BP 51 6, 6701, 6926
Temperature C 676 677
105
Appendix D
Clustering Validation Analysis -
Methods
The next tables show the for the clusters validation methods for each of the datasets: PAN,PNM,BOTH
and ALL.
Table D.1: Clustering validation indexes score for dataset PAN performed 10 times with different partitions.
m Index Score according to number of Clusters2 3 4 5 6
PC 7,71E-01 ± 1,17E-16 6,57E-01 ± 0,00E+00 5,85E-01 ± 1,17E-16 5,35E-01 ± 0,00E+00 4,98E-01 ± 5,85E-17CE 6,62E-01 ± 0,00E+00 1,06E+00 ± 2,34E-16 1,35E+00 ± 0,00E+00 1,58E+00 ± 2,34E-16 1,76E+00 ± 2,34E-16SC 5,84E+01 ± 1,50E-14 6,85E+01 ± 1,50E-14 8,40E+01 ± 1,50E-14 9,84E+01 ± 0,00E+00 1,11E+02 ± 0,00E+00
1.4 S 1,14E-02 ± 1,83E-18 1,60E-02 ± 3,66E-18 1,96E-02 ± 0,00E+00 2,26E-02 ± 0,00E+00 2,50E-02 ± 3,66E-18XB 1,58E+00 ± 2,34E-16 1,35E+00 ± 0,00E+00 1,20E+00 ± 2,34E-16 1,10E+00 ± 0,00E+00 1,02E+00 ± 2,34E-16DI 1,91E-02 ± 3,66E-18 1,84E-02 ± 0,00E+00 1,67E-02 ± 3,66E-18 4,76E-03 ± 9,14E-19 1,95E-02 ± 0,00E+00
ADI 4,78E-02 ± 7,31E-18 3,98E-02 ± 7,31E-18 5,33E-03 ± 0,00E+00 6,42E-03 ± 9,14E-19 4,80E-03 ± 9,14E-19
PC 6,16E-01 ± 1,17E-16 4,63E-01 ± 0,00E+00 3,79E-01 ± 5,85E-17 3,24E-01 ± 0,00E+00 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 0,00E+00 1,79E+00 ± 2,34E-16SC 4,50E+09 ± 1,01E-06 2,73E+09 ± 5,03E-07 2,97E+09 ± 5,03E-07 1,13E+09 ± 0,00E+00 8,32E+08 ± 0,00E+00
1.7 S 8,79E+05 ± 1,23E-10 8,88E+05 ± 0,00E+00 8,69E+05 ± 0,00E+00 3,25E+05 ± 6,14E-11 2,45E+05 ± 3,07E-11XB 1,25E+00 ± 2,34E-16 9,42E-01 ± 0,00E+00 7,70E-01 ± 0,00E+00 6,59E-01 ± 1,17E-16 5,80E-01 ± 1,17E-16DI 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18
ADI 6,59E-02 ± 1,46E-17 6,59E-02 ± 1,46E-17 6,59E-02 ± 1,46E-17 1,15E-02 ± 1,83E-18 9,65E-03 ± 1,83E-18
PC 5,00E-01 ± 0,00E+00 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 0,00E+00 1,67E-01 ± 0,00E+00CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 2,34E-16SC 7,06E+09 ± 2,01E-06 3,47E+09 ± 0,00E+00 4,33E+09 ± 1,01E-06 2,34E+09 ± 0,00E+00 1,99E+09 ± 5,03E-07
2.0 S 1,38E+06 ± 2,45E-10 1,13E+06 ± 2,45E-10 1,27E+06 ± 2,45E-10 6,76E+05 ± 1,23E-10 5,86E+05 ± 1,23E-10XB 1,02E+00 ± 0,00E+00 6,78E-01 ± 1,17E-16 5,08E-01 ± 0,00E+00 4,07E-01 ± 1,17E-16 3,39E-01 ± 5,85E-17DI 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18
ADI 6,59E-02 ± 0,00E+00 6,59E-02 ± 0,00E+00 1,15E-02 ± 0,00E+00 1,15E-02 ± 1,83E-18 9,65E-03 ± 0,00E+00
106
Table D.2: Clustering validation indexes score for dataset PNM performed 10 times with different partitions.
m Index Score according to number of Clusters2 3 4 5 6
PC 7,60E-01 ± 1,17E-16 6,47E-01 ± 1,17E-16 5,78E-01 ± 0,00E+00 5,39E-01 ± 1,17E-16 5,08E-01 ± 1,17E-16CE 6,89E-01 ± 0,00E+00 1,09E+00 ± 0,00E+00 1,37E+00 ± 2,34E-16 1,56E+00 ± 2,34E-16 1,72E+00 ± 2,34E-16SC 4,93E+02 ± 0,00E+00 4,08E+02 ± 5,99E-14 2,42E+02 ± 3,00E-14 6,59E+01 ± 1,50E-14 4,09E+01 ± 7,49E-15
1.4 S 3,01E-02 ± 0,00E+00 3,01E-02 ± 0,00E+00 1,78E-02 ± 3,66E-18 4,80E-03 ± 9,14E-19 2,93E-03 ± 4,57E-19XB 1,82E+00 ± 4,68E-16 1,55E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,29E+00 ± 2,34E-16 1,22E+00 ± 2,34E-16DI 6,84E-03 ± 9,14E-19 1,80E-02 ± 0,00E+00 1,09E-02 ± 1,83E-18 1,38E-02 ± 0,00E+00 6,27E-03 ± 9,14E-19
ADI 3,49E-03 ± 9,14E-19 8,43E-03 ± 1,83E-18 4,46E-03 ± 9,14E-19 3,97E-04 ± 0,00E+00 5,01E-04 ± 1,14E-19
PC 6,16E-01 ± 0,00E+00 4,63E-01 ± 0,00E+00 3,79E-01 ± 0,00E+00 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 2,34E-16SC 1,58E+10 ± 0,00E+00 5,02E+09 ± 1,01E-06 1,91E+10 ± 0,00E+00 5,08E+09 ± 1,01E-06 2,27E+09 ± 5,03E-07
1.7 S 9,66E+05 ± 0,00E+00 4,14E+05 ± 6,14E-11 1,50E+06 ± 2,45E-10 4,84E+05 ± 0,00E+00 2,28E+05 ± 0,00E+00XB 1,47E+00 ± 0,00E+00 1,10E+00 ± 0,00E+00 9,02E-01 ± 0,00E+00 7,72E-01 ± 0,00E+00 6,79E-01 ± 1,17E-16DI 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19
ADI 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19 4,19E-03 ± 9,14E-19 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19
PC 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 0,00E+00 2,00E-01 ± 2,93E-17 1,67E-01 ± 2,93E-17CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 0,00E+00 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 0,00E+00SC 1,68E+10 ± 0,00E+00 1,46E+10 ± 0,00E+00 2,00E+10 ± 4,02E-06 9,47E+09 ± 0,00E+00 2,62E+09 ± 5,03E-07
2.0 S 1,02E+06 ± 1,23E-10 1,19E+06 ± 0,00E+00 1,77E+06 ± 2,45E-10 9,02E+05 ± 1,23E-10 2,65E+05 ± 6,14E-11XB 1,19E+00 ± 2,34E-16 7,94E-01 ± 0,00E+00 5,95E-01 ± 1,17E-16 4,76E-01 ± 1,17E-16 3,97E-01 ± 5,85E-17DI 6,84E-03 ± 9,14E-19 1,48E-02 ± 3,66E-18 1,63E-02 ± 3,66E-18 1,63E-02 ± 3,66E-18 6,84E-03 ± 9,14E-19
ADI 4,19E-03 ± 9,14E-19 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19
Table D.3: Clustering validation indexes score for dataset BOTH performed 10 times with different partitions.
m Index Score according to number of Clusters2 3 4 5 6
PC 7,61E-01 ± 1,17E-16 6,48E-01 ± 1,17E-16 5,79E-01 ± 0,00E+00 5,31E-01 ± 1,17E-16 4,96E-01 ± 0,00E+00CE 6,85E-01 ± 1,17E-16 1,09E+00 ± 0,00E+00 1,37E+00 ± 2,34E-16 1,59E+00 ± 2,34E-16 1,76E+00 ± 0,00E+00SC 2,62E+02 ± 5,99E-14 2,56E+02 ± 5,99E-14 2,33E+02 ± 0,00E+00 1,90E+02 ± 3,00E-14 1,39E+02 ± 3,00E-14
1.4 S 1,32E-02 ± 0,00E+00 1,57E-02 ± 0,00E+00 1,42E-02 ± 1,83E-18 1,14E-02 ± 0,00E+00 8,18E-03 ± 1,83E-18XB 1,76E+00 ± 4,68E-16 1,49E+00 ± 2,34E-16 1,33E+00 ± 2,34E-16 1,22E+00 ± 2,34E-16 1,14E+00 ± 2,34E-16DI 7,15E-03 ± 0,00E+00 1,04E-02 ± 1,83E-18 8,44E-03 ± 1,83E-18 1,43E-02 ± 0,00E+00 1,43E-02 ± 0,00E+00
ADI 7,86E-02 ± 1,46E-17 1,15E-02 ± 1,83E-18 6,71E-03 ± 9,14E-19 5,29E-03 ± 9,14E-19 1,19E-03 ± 2,29E-19
PC 6,16E-01 ± 1,17E-16 4,63E-01 ± 5,85E-17 3,79E-01 ± 5,85E-17 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 2,34E-16SC 1,07E+10 ± 2,01E-06 8,20E+09 ± 2,01E-06 9,31E+09 ± 0,00E+00 6,62E+09 ± 0,00E+00 2,05E+09 ± 2,51E-07
1.7 S 5,39E+05 ± 0,00E+00 5,99E+05 ± 0,00E+00 6,77E+05 ± 0,00E+00 5,47E+05 ± 0,00E+00 1,70E+05 ± 3,07E-11XB 1,43E+00 ± 0,00E+00 1,08E+00 ± 0,00E+00 8,83E-01 ± 0,00E+00 7,55E-01 ± 1,17E-16 6,65E-01 ± 1,17E-16DI 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18
ADI 8,60E-02 ± 0,00E+00 1,54E-02 ± 0,00E+00 1,28E-02 ± 1,83E-18 1,28E-02 ± 1,83E-18 1,22E-02 ± 1,83E-18
PC 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 2,93E-17 1,67E-01 ± 2,93E-17CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 4,68E-16SC 1,34E+10 ± 0,00E+00 1,36E+10 ± 0,00E+00 1,10E+10 ± 2,01E-06 6,97E+09 ± 2,01E-06 2,71E+09 ± 5,03E-07
2.0 S 6,74E+05 ± 1,23E-10 9,85E+05 ± 0,00E+00 8,12E+05 ± 1,23E-10 5,77E+05 ± 1,23E-10 2,25E+05 ± 3,07E-11XB 1,17E+00 ± 2,34E-16 7,77E-01 ± 1,17E-16 5,83E-01 ± 1,17E-16 4,66E-01 ± 0,00E+00 3,88E-01 ± 5,85E-17DI 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 7,37E-03 ± 9,14E-19 1,37E-02 ± 1,83E-18 7,15E-03 ± 0,00E+00
ADI 8,60E-02 ± 0,00E+00 1,54E-02 ± 3,66E-18 1,28E-02 ± 1,83E-18 1,28E-02 ± 1,83E-18 8,09E-03 ± 1,83E-18
Table D.4: Clustering validation indexes score for dataset ALL performed 10 times with different partitions.
m Index Score according to number of Clusters2 3 4 5 6
PC 7,58E-01 ± 1,17E-16 6,44E-01 ± 0,00E+00 5,74E-01 ± 1,17E-16 5,25E-01 ± 1,17E-16 4,88E-01 ± 1,17E-16CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 0,00E+00SC 8,53E+06 ± 1,96E-09 9,19E+06 ± 1,96E-09 5,95E+06 ± 9,82E-10 7,57E+06 ± 9,82E-10 3,25E+06 ± 4,91E-10
1.4 S 1,64E+02 ± 3,00E-14 2,67E+02 ± 5,99E-14 1,70E+02 ± 3,00E-14 2,15E+02 ± 5,99E-14 9,37E+01 ± 1,50E-14XB 2,12E+00 ± 4,68E-16 1,80E+00 ± 4,68E-16 1,61E+00 ± 4,68E-16 1,47E+00 ± 2,34E-16 1,37E+00 ± 2,34E-16DI 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00
ADI 5,19E-03 ± 9,14E-19 5,18E-03 ± 9,14E-19 5,17E-03 ± 0,00E+00 5,16E-03 ± 9,14E-19 5,14E-03 ± 9,14E-19
PC 6,16E-01 ± 1,17E-16 4,63E-01 ± 5,85E-17 3,79E-01 ± 5,85E-17 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 0,00E+00 1,79E+00 ± 0,00E+00SC 1,17E+10 ± 2,01E-06 1,52E+10 ± 0,00E+00 1,60E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 2,48E+09 ± 5,03E-07
1.7 S 2,26E+05 ± 6,14E-11 4,77E+05 ± 6,14E-11 4,78E+05 ± 1,23E-10 2,12E+05 ± 0,00E+00 7,24E+04 ± 0,00E+00XB 1,72E+00 ± 2,34E-16 1,30E+00 ± 0,00E+00 1,06E+00 ± 0,00E+00 9,07E-01 ± 1,17E-16 7,99E-01 ± 1,17E-16DI 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00
ADI 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19
PC 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 5,85E-17 1,67E-01 ± 0,00E+00CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 2,34E-16SC 4,12E+10 ± 0,00E+00 1,13E+10 ± 2,01E-06 1,50E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 3,27E+09 ± 0,00E+00
2.0 S 7,93E+05 ± 1,23E-10 3,56E+05 ± 6,14E-11 4,51E+05 ± 6,14E-11 2,12E+05 ± 0,00E+00 9,68E+04 ± 1,53E-11XB 1,40E+00 ± 0,00E+00 9,33E-01 ± 1,17E-16 7,00E-01 ± 1,17E-16 5,60E-01 ± 0,00E+00 4,67E-01 ± 5,85E-17DI 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,04E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 8,98E-03 ± 1,83E-18
ADI 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 0,00E+00 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19
107
Appendix E
Clustering Validation Analysis -
Distributions
In order to validate the number of clusters that could not be obtained through the clustering validation
methods presented in 2.6.2, due to lack of clear evidence, a study based on distribution of data along
the clusters and inter-cluster distances was conducted delivering the following results (comments on this
can be found in section 4.1.1):
108
(a) Data divided into two clusters based on the output. (b) Data divided into two clusters based on the disease.
(c) Data divided into three clusters based on the output. (d) Data divided into three clusters based on the disease.
Figure E.1: Division in clusters for the dataset PAN.
Table E.1: Euclidian distances between clusters.
PAN cluster 1 cluster 2
cluster 1 0,00E+00 1,66E-01
cluster 2 1,66E-01 0,00E+00
Table E.2: Euclidian distances between clusters.
PAN cluster 1 cluster 2 cluster 3
cluster 1 0,00E+00 1,84E-01 3,94E-04
cluster 2 1,84E-01 0,00E+00 1,85E-01
cluster 3 3,94E-04 1,85E-01 0,00E+00
109
(a) Data divided into two clusters based on the output. (b) Data divided into two clusters based on the disease.
(c) Data divided into three clusters based on the output. (d) Data divided into three clusters based on the disease.
Figure E.2: Division in clusters for the dataset PNM.
Table E.3: Euclidian distances between clusters.
PNM cluster 1 cluster 2
cluster 1 0,00E+00 4,93E-04cluster 2 4,93E-04 0,00E+00
Table E.4: Euclidian distances between clusters.
PNM cluster 1 cluster 2 cluster 3
cluster 1 0,00E+00 4,65E-05 1,17E-03cluster 2 4,65E-05 0,00E+00 1,13E-03cluster 3 1,17E-03 1,13E-03 0,00E+00
110
(a) Data divided into two clusters based on the output. (b) Data divided into two clusters based on the disease.
(c) Data divided into three clusters based on the output. (d) Data divided into three clusters based on the disease.
Figure E.3: Division in clusters for the dataset BOTH.
Table E.5: Euclidian distances between clusters.
BOTH cluster 1 cluster 2
cluster 1 0,00E+00 1,65E-04cluster 2 1,65E-04 0,00E+00
Table E.6: Euclidian distances between clusters.
BOTH cluster 1 cluster 2 cluster 3
cluster 1 0,00E+00 2,36E-02 1,90E-02cluster 2 2,36E-02 0,00E+00 4,65E-03cluster 3 1,90E-02 4,65E-03 0,00E+00
111
(a) Data divided into two clusters based on the output. (b) Data divided into two clusters based on the disease.
(c) Data divided into three clusters based on the output. (d) Data divided into three clusters based on the disease.
Figure E.4: Division in clusters for the dataset ALL.
Table E.7: Euclidian distances between clusters.
ALL cluster 1 cluster 2
cluster 1 0,00E+00 4,30E-04cluster 2 4,30E-04 0,00E+00
Table E.8: Euclidian distances between clusters.
ALL cluster 1 cluster 2 cluster 3
cluster 1 0,00E+00 6,66E-04 7,98E-05cluster 2 6,66E-04 0,00E+00 5,90E-04cluster 3 7,98E-05 5,90E-04 0,00E+00
112
Appendix F
Fixing ensemble modelling
parameters
Table F.1: Best parameters all features; c=2:5; m=1.1:2.
criterion PAN PNM BOTH ALL
single model c=4; m=2.0 c=4; m=2.0 c=5; m=2.0 c=4; m=2.0a priori c=2; m=2.0 c=5; m=2.0 c=2; m=1.9 c=2; m=2.0
a posteriori c=2; m=2.0 c=4; m=1.3 c=2; m=2.0 c=2; m=1.9arithmetic mean c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=5; m=1.8
distance-weighted mean c=5; m=1.9 c=2; m=2.0 c=2; m=2.0 c=5; m=2.0
Table F.2: Best parameters with feature selection; c=2:5; m=1.1:2.
criterion PAN PNM BOTH ALL
single model c=2; m=1.5 c=3; m=1.3 c=3; m=2.0 c=4; m=1.4a priori c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=2; m=2.0
a posteriori c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=2; m=2.0arithmetic mean c=4; m=1.9 c=2; m=1.6 c=2; m=2.0 c=3; m=1.9
distance-weighted mean c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=3; m=1.7
113
Appendix G
Influence of the fuzziness parameter
in FCM clusters centres
From what is seen one can conclude that it might be desirable to have a low m parameter, akin to the
clustering K-means. Knowing that the architecture of the multimodel highly depends on the distance
of the data to the prototypes/centroids that result from this partition, one may draw the conclusion that
a higher distance between those centroids will turn the weighting process less unambiguous. Another
aspect that should be noted is that in terms of partition of the data there’s no considerable difference
for m ≤ 10 and higher values were only used in this short analysis just out of curiosity. Nonetheless,
the decision for the most suited m will be taken by evaluating the scores obtained with the mentioned
validation measures.
The possibility to choose m adds one more degree of freedom (DOF) and it can be useful for exten-
sive examination which might lead to better results in some cases, so the use of K-means as opposed
to FCM is not yet justifiable with only this short example. Hopefully this study cleared out the influ-
ence of such parameter m in this context. The following figures show the tracking of the position of the
prototypes by varying m:
114
115
116
Appendix H
Histograms feature selection for
punctual data
117
(k) Frequency a feature was selected during 50 runs of single
model FS.
(l) Frequency a feature was selected during 50 runs of a priori
FS.
(m) Frequency a feature was selected during 50 runs of a pos-
teriori FS.
(n) Frequency a feature was selected during 50 runs of arith-
metic mean FS.
(o) Frequency a feature was selected during 50 runs of
distance-weighted mean FS.
Figure H.1: Frequency a feature was selected during 50 runs for dataset ALL.
118
(a) Frequency a feature was selected during 50 runs of single
model FS.
(b) Frequency a feature was selected during 50 runs of a priori
FS.
(c) Frequency a feature was selected during 50 runs of a pos-
teriori FS.
(d) Frequency a feature was selected during 50 runs of arith-
metic mean FS.
(e) Frequency a feature was selected during 50 runs of
distance-weighted mean FS.
Figure H.2: Frequency a feature was selected during 50 runs for dataset BOTH.
119
(a) Frequency a feature was selected during 50 runs of single
model FS.
(b) Frequency a feature was selected during 50 runs of a priori
FS.
(c) Frequency a feature was selected during 50 runs of a pos-
teriori FS.
(d) Frequency a feature was selected during 50 runs of arith-
metic mean FS.
(e) Frequency a feature was selected during 50 runs of
distance-weighted mean FS.
Figure H.3: Frequency a feature was selected during 50 runs for dataset PNM.
120
(a) Frequency a feature was selected during 50 runs of single
model FS.
(b) Frequency a feature was selected during 50 runs of a priori
FS.
(c) Frequency a feature was selected during 50 runs of a pos-
teriori FS.
(d) Frequency a feature was selected during 50 runs of arith-
metic mean FS.
(e) Frequency a feature was selected during 50 runs of
distance-weighted mean FS.
Figure H.4: Frequency a feature was selected during 50 runs for dataset PAN.
121
Appendix I
Previous Results
(a)
(b)
Figure I.1: Tables extracted from (a) [24] (b) [23] for results comparison.
122