Fuzzy Modeling for the Prediction of Vasopressors ... · Fuzzy Modeling for the Prediction of Vasopressors Administration in the ... Severe sepsis and ... 4.15 Most selected features

Fuzzy Modeling for the Prediction of VasopressorsAdministration in the ICU Using Ensemble and Mixed Fuzzy

Clustering Approaches

Carlos Santos Azevedo

Thesis to obtain the Master of Science Degree in

Mechanical Engineering

Supervisor: Prof. Susana Margarida da Silva Vieira

Examination Committee

Chairperson: Prof. João Rogério Caldas PintoMembers of the Committee: Prof. Luís Manuel Fernandes Mendonça

Prof: João Miguel da Costa Sousa

October 2015

ii

Acknowledgments

I would like to express my gratitude to my supervisor Professor Susana Vieira for the opportunity to

work in this field of research and the useful comments, remarks, patience and engagement through the

learning process of my master thesis. Besides my supervisor, I would like to thank the rest of my thesis

committee:

Furthermore I would like to thank Catia Salgado for the friendship, availability and the stimulating

discussions and suggestions. I am grateful too for the support and advise from my faculty colleagues

and friends Marta Ferreira, Rita Viegas, Joaquim Viegas and Hugo Proenca. I must acknowledge as

well the support of my friends Bruno Vidal, Jonas Haggenjos and Henrique Goncalves without whom

part of my journey in IST would not be as pleasant.

A special thanks to my family. Words cannot express how grateful I am to my mother, and father for

all of the sacrifices that they have made on my behalf.

iii

iv

Abstract

Severe sepsis and septic shock are major health care problems and remain one of the leading causes

of death in critically ill patients. Therapy with vasopressor agents is usually initiated in this group of

patients. The main objective of this work is to describe and implement a data mining solution to predict

the need of vasopressors administration in septic shock patients in the Intensive Care Unit (ICU). MIMIC

II database was used to extract clinical data from 32 physiological and 5 static variables for a cohort

of patients of interest. Two different analysis were conducted using this data: patients’ clinical state

and patients’ clinical evolution. The former group is studied under an ensemble modelling approach.

Feature selection was performed for four different criteria, a priori, a posteriori, arithmetic mean and

mean weighted distance, and for the single model. Ensemble approaches benefited from the feature

selection, whereas the single model performs best with all the 32 input variables. The spatio-temporal

analysis was approached using two different clustering techniques: Fuzzy C-means (FCM) and Mixed

Fuzzy Clustering (MFC). The latter weights the relevance of the temporal component of data in the

clustering process, allowing a more flexible identification of structures in datasets composed by mixed

features (temporal and static). Two modeling approaches based on MFC were tested and compared

with similar approaches based on the traditional FCM, where both clustering algorithms are used either

for transforming the feature space of the input variables in membership degrees, or for determining

the antecedent fuzzy sets of Takagi Sugeno fuzzy models. The use of feature transformation showed

better performance than the other methods, however, when sequential feature selection is combined with

fuzzy modeling, FCM is the best performer. Overall, the best results obtained are AUC=0.82±0.01 and

AUC=0.80±0.06 for the ensemble and MFC strategies, respectively. Additionally, considering the fact

that the imputation of data is around 14.6% in MFC and 74.7% in ensemble, MFC should be considered,

preferentially, to predict vasopressors administration in critically ill patients.

Keywords: Data mining, Vasopressors, Feature selection, data pre-processing, time-series,

mixed data, ensemble modelling, fuzzy clustering, specialized modelling

v

vi

Resumo

Sepsis grave e choque septico sao um dos grandes problemas existentes em cuidados medicos

sendo uma das principais causas de morte em doentes crıticos. A terapia por vasopressores e usual-

mente usada neste grupo de pacientes. O principal objectivo deste trabalho e descrever e implemen-

tar uma solucao de data mining para prever a necessidade de administracao de vasopressores em

doentes de cuidados intesivos em choque septico. A base de dados MIMIC II foi utilizada para extrair

dados clınicos de 32 variaveis fisiologicas e 5 variaveis estaticas para os pacientes que fazem parte

do grupo de interesse. Foram executados dois tipos de analises: uma que avalia o estado clınico do

paciente para um dado instante e outro que tem em conta a evolucao clınica do paciente. O primeiro

tipo foi estudado utilizando uma abordagem multimodelo. Foi feita uma seleccao de variaveis para o

modelo singular e para quatro criterios multimodelo: a priori, a posteriori, media aritmetica e media

pesada pela distancia aos clusters. A abordagem multimodel beneficiou da seleccao de variaveis, en-

quanto que o modelo singular teve melhor performance usando as 32 variaveis fisiologicas. Foi feita

tambem uma abordagem espaco-temporal atraves do uso de duas tecnicas de clustering diferentes:

Fuzzy C-means (FCM) e Mixed Fuzzy Clustering (MFC). Esta ultima tem a particularidade de pesar a

componente temporal durante o processo de particao, permitindo uma identificacao mais flexıvel das

estruturas existentes nos dados compostos por atributos temporais e estaticos. Duas abordagens de

modelacao baseadas no MFC foram testadas e comparadas com abordagens similares baseadas no

FCM, em que ambos os algoritmos de particao foram usados para transformar as variaveis em matrizes

de particao, ou para determinar os antecedentes dos conjuntos fuzzy de modelos fuzzy Takagi-Sugeno.

O uso da transformacao revelou uma melhor performance face aos outros metodos, no entanto, quando

a seleccao de variaveis e combinada com modelacao fuzzy, o FCM sem transformacao deu melhores

resultados. Os melhores resultados obtidos foram AUC=0.82±0.01 para a abordagem multimodelo e

AUC=0.80±0.06 para a abordagem MFC. Considerando o facto de que a quantidade de insercao arti-

ficial de dados ronda os 14.6% para MFC e 74.7% para os multimodelos, a abordagem MFC deve ser

considerada para prever a administracao da medicacao para pacientes em estado crıtico.

Palavras-chave: Data mining, Vasopressores, Seleccao de atributos, pre-processamento

de dados, dados temporais, acoplamento de dados estaticos com dinamicos, multimodelos, fuzzy clus-

tering, modelacao especializada

vii

viii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Datamining applied to the prediction of vasopressor dependency . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Methods 7

2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Fuzzy C-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Mixed Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Fuzzy Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Modelling Based on MFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 FCM Fuzzy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 FCM Fuzzy Model with FCM feature transformation . . . . . . . . . . . . . . . . . 14

2.3.3 MFC fuzzy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 FCM Fuzzy Model with MFC feature transformation . . . . . . . . . . . . . . . . . 15

2.4 Ensemble Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Subgroup selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 Subgroup modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.3 Ensemble decision criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Sequential Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.2 SFS with ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

ix

2.5.3 SFS with MFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Validation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.1 AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.2 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Data collection and preprocessing 28

3.1 Structure of MIMIC II Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Preprocessing Data - General Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Chosen Input/Output Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Removing outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.3 Removing deceased patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Preprocessing Data - Clinical Actual State Analysis . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.2 Enabling the dataset for prediction purposes . . . . . . . . . . . . . . . . . . . . . 49

3.4 Preprocessing Data - Clinical State Evolution Analysis . . . . . . . . . . . . . . . . . . . . 51

3.4.1 Data imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.2 Enabling the dataset for prediction purposes . . . . . . . . . . . . . . . . . . . . . 55

4 Results and Discussions 56

4.1 Ensemble Modelling - Punctual Data Results . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 Unsupervised Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.2 Fixing parameters of the models & Feature Selection . . . . . . . . . . . . . . . . . 62

4.1.3 Model Assessment based on overall performance . . . . . . . . . . . . . . . . . . 68

4.1.4 Model Assessment based on the singular models’ performance . . . . . . . . . . . 73

4.2 Mixed Fuzzy Clustering - Time-series data approach . . . . . . . . . . . . . . . . . . . . . 74

4.2.1 Fixing parameters of the models & Feature Selection . . . . . . . . . . . . . . . . . 75

4.2.2 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Conclusions 79

5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography 87

A Outliers - Expert Knowledge versus Inter-quartile method 89

B Removing Deceased Patients 98

C ID’s of the same variable 105

D Clustering Validation Analysis - Methods 106

x

E Clustering Validation Analysis - Distributions 108

F Fixing ensemble modelling parameters 113

G Influence of the fuzziness parameter in FCM clusters centres 114

H Histograms feature selection for punctual data 117

I Previous Results 122

xi

xii

List of Tables

3.1 Features and sampling rates (measurements/day) in each dataset. . . . . . . . . . . . . . 35

3.2 List of vasopressors and participation in the datasets. . . . . . . . . . . . . . . . . . . . . 36

3.3 Delimiting data (Inter-quartile is applied on ALL dataset) . . . . . . . . . . . . . . . . . . . 40

3.4 After alignment with Heart Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 After imputation of data using ZOH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Removal of data due to lack of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.7 Percentages of imputation by dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.8 Percentages of imputations by input time-varying variable . . . . . . . . . . . . . . . . . . 49

3.9 Example of the output shifting procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.10 Before shifting the output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.11 After shifting the output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.12 Percentages of imputation given the vector length. . . . . . . . . . . . . . . . . . . . . . . 54

3.13 Considering an interval between measurements of x hour (for ALL dataset) . . . . . . . . 55

4.1 Classes percentages in each dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Clustering validation indexes for dataset ALL performed 10 times with different partitions.

(+) means a higher value is better and (-) means the opposite. . . . . . . . . . . . . . . . 59

4.3 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 Most selected features for single model for dataset ALL. . . . . . . . . . . . . . . . . . . . 64

4.6 Most selected features for a priori for dataset ALL. . . . . . . . . . . . . . . . . . . . . . . 64

4.7 Most selected features for a posteriori for dataset ALL. . . . . . . . . . . . . . . . . . . . . 64

4.8 Most selected features for arithmetic mean for dataset ALL. . . . . . . . . . . . . . . . . . 64

4.9 Most selected features for distance-weighted mean for dataset ALL. . . . . . . . . . . . . 64

4.10 Most selected features for single model for dataset BOTH. . . . . . . . . . . . . . . . . . . 65

4.11 Most selected features for a priori for dataset BOTH. . . . . . . . . . . . . . . . . . . . . . 65

4.12 Most selected features for a posteriori for dataset BOTH. . . . . . . . . . . . . . . . . . . . 65

4.13 Most selected features for arithmetic mean for dataset BOTH. . . . . . . . . . . . . . . . . 65

4.14 Most selected features for distance-weighted mean for dataset BOTH. . . . . . . . . . . . 65

4.15 Most selected features for single model for dataset PNM. . . . . . . . . . . . . . . . . . . 65

4.16 Most selected features for a priori for dataset PNM. . . . . . . . . . . . . . . . . . . . . . . 65

xiii

4.17 Most selected features for a posteriori for dataset PNM. . . . . . . . . . . . . . . . . . . . 65

4.18 Most selected features for arithmetic mean for dataset PNM. . . . . . . . . . . . . . . . . 65

4.19 Most selected features for distance-weighted mean for dataset PNM. . . . . . . . . . . . . 65

4.20 Most selected features for single model for dataset PAN. . . . . . . . . . . . . . . . . . . . 66

4.21 Most selected features for a priori for dataset PAN. . . . . . . . . . . . . . . . . . . . . . . 66

4.22 Most selected features for a posteriori for dataset PAN. . . . . . . . . . . . . . . . . . . . . 66

4.23 Most selected features for arithmetic mean for dataset PAN. . . . . . . . . . . . . . . . . . 66

4.24 Most selected features for distance-weighted mean for dataset PAN. . . . . . . . . . . . . 66

4.25 Mean and standard deviation of the number of features selected through the 50 runs. . . 66

4.26 Features selected using SFS for the dataset PAN. . . . . . . . . . . . . . . . . . . . . . . 67

4.27 Features selected using SFS for the dataset PNM. . . . . . . . . . . . . . . . . . . . . . . 67

4.28 Features selected using SFS for the dataset BOTH. . . . . . . . . . . . . . . . . . . . . . 67

4.29 Features selected using SFS for the dataset ALL. . . . . . . . . . . . . . . . . . . . . . . . 67

4.30 Feature selection results for PAN dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.31 Feature selection results for PNM dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.32 Feature selection results for BOTH dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.33 Feature selection results for ALL dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.34 Comparison against single model with all variables (no FS). . . . . . . . . . . . . . . . . . 69

4.35 Results without FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.36 Results with FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.37 Results for models based on the singular models’ performance: PAN dataset. . . . . . . . 74

4.38 Results for models based on the singular models’ performance: PNM dataset. . . . . . . 74

4.39 Results for models based on the singular models’ performance: BOTH dataset. . . . . . . 74

4.40 Results for models based on the singular models’ performance: ALL dataset. . . . . . . . 74

4.41 Best parameters and selected feature according to AUC. . . . . . . . . . . . . . . . . . . . 76

4.42 Most selected features by FCM FM (mean number of selected features: 5.8). . . . . . . . 76

4.43 Most selected features by FCM-FCM FM (mean number of selected features: 4.3). . . . . 76

4.44 Most selected features by MFC FM (mean number of selected features: 8.2). . . . . . . . 76

4.45 Most selected features by MFC-FCM FM (mean number of selected features: 6.5). . . . . 76

4.46 MA data using all variables with best FS data parameters according to AUC. . . . . . . . 77

4.47 FCM FM vs FCM-FCM FM with FCM FM selected features. . . . . . . . . . . . . . . . . . 77

4.48 FCM FM vs FCM-FCM FMwith FCM-FCM FM selected features. . . . . . . . . . . . . . . 77

4.49 MFC FM vs MFC-FCM FM with MFC FM selected features. . . . . . . . . . . . . . . . . . 77

4.50 MFC FM vs MFC-FCM FM with MFC-FCM FM selected features. . . . . . . . . . . . . . . 78

5.1 Best results for punctual and evolution state analysis for the dataset ALL. . . . . . . . . . 80

C.1 List of de IDs associated with each variable that were grouped into one. . . . . . . . . . . 105

xiv

D.1 Clustering validation indexes score for dataset PAN performed 10 times with different

partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

D.2 Clustering validation indexes score for dataset PNM performed 10 times with different

partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

D.3 Clustering validation indexes score for dataset BOTH performed 10 times with different

partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

D.4 Clustering validation indexes score for dataset ALL performed 10 times with different par-

titions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

E.1 Euclidian distances between clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109








F.1 Best parameters all features; c=2:5; m=1.1:2. . . . . . . . . . . . . . . . . . . . . . . . . . 113

F.2 Best parameters with feature selection; c=2:5; m=1.1:2. . . . . . . . . . . . . . . . . . . . 113

xv

xvi

List of Figures

2.1 Functional block diagram of a fuzzy inference system (FIS). . . . . . . . . . . . . . . . . . 12

2.2 Methods used for time-variant data coupled with time-invariant data. . . . . . . . . . . . . 14

2.3 Schematic representation of the single and multimodel approaches; [*] cluster centres is

relevant for the case where feature selection is performed independently for each sub-

group/unsupervised cluster, see Section 2.5.2. . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Left: SFS applied in individual subgroups; Right: SFS applied to the ensemble methodology. 22

2.5 ROC curve example and points of interest (inspired on [20]). . . . . . . . . . . . . . . . . 24

3.1 Schematic of data collection and database construction [1]. . . . . . . . . . . . . . . . . . 31

3.2 From raw data to usable data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Interval between two consecutive vasopressors intake (with a random added value of

±0.015, after their interval identification, for density observation) considering only patients

with more than 6 hours of data before vasopressors administration. Figure (b) is a zoom

in of the figure (a) to the region of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-

tion (green dashed line) and inter-quartile (black dashed line) method for the input vari-

ables 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 12,13,16

and 20. Each point corresponds to the mean value of the measurements taken in a 2

hours window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Artificial data and results for three different methods on applying min-max normalization. . 44

3.7 Alignment of misaligned and unevenly sampled data (inspired in [12]). . . . . . . . . . . . 45

3.8 Alignment of the data for the punctual data analysis, using the variable with highest sam-

pling rate as template, covering all the case scenarios. . . . . . . . . . . . . . . . . . . . . 46

3.9 Preprocessing steps and resulting datasets of the MIMIC II for the punctual data case. . . 47

3.10 Preprocessing steps and resulting datasets of the MIMIC II for the time-series data case. 52

3.11 Procedure to adapt the real measurements to a vector of length 10 for all variables includ-

ing data imputation when needed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.12 How the data is enabled for prediction purposes. . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Example on PAN dataset to get the subset data for clustering validation. . . . . . . . . . . 58

xvii

4.2 Distribution along clusters for the dataset ALL. . . . . . . . . . . . . . . . . . . . . . . . . 61

A.1 Dispersion of data points and boundaries given by expert knowledge plus visual inspec-


ables 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90



ables 5 to 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91



ables 9 to 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92



ables 13 to 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93



ables 17 to 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94



ables 21 to 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95



ables 25 to 28. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96



ables 29 to 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

B.1 Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 1-6.

Each point corresponds to mean value of a 2 hours window. . . . . . . . . . . . . . . . . . 99











xviii

B.7 Related to the above figures. Left figure: Number of measurements considered each 2

hours time window. Right figure: Number of patients taken into account for each 2 hours

time window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

E.1 Division in clusters for the dataset PAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

E.2 Division in clusters for the dataset PNM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

E.3 Division in clusters for the dataset BOTH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

E.4 Division in clusters for the dataset ALL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

H.1 Frequency a feature was selected during 50 runs for dataset ALL. . . . . . . . . . . . . . 118

H.2 Frequency a feature was selected during 50 runs for dataset BOTH. . . . . . . . . . . . . 119

H.3 Frequency a feature was selected during 50 runs for dataset PNM. . . . . . . . . . . . . . 120

H.4 Frequency a feature was selected during 50 runs for dataset PAN. . . . . . . . . . . . . . 121

I.1 Tables extracted from (a) [24] (b) [23] for results comparison. . . . . . . . . . . . . . . . . 122

xix

xx

Notation

Symbols

βi Degree of activation of the ith rule

δ The euclidian distance

λ Temporal component weight

xsi Vector with the spatial component of the samples

µAij Membership function of Aij

µij Membership degree of sample j to the ith cluster (FCM approach)

vsl Spatial prototypes for each cluster l

Aij the antecedent fuzzy set for rule Ri and the jth feature

Ai Positive definite symmetric matrix

ci The N-dimension centre of the ith cluster

d2λ distance function between a sample and the spatial and temporal prototype of a cluster

dij Distance from sample j to the ith cluster prototype (FCM approach)

fi consequent function of the rule Ri

g Cluster that minimizes the criterion

J Objective function of the augmented FCM (MFC approach)

Jm Objective Function (FCM approach)

Kc Number of individual models that belong to a MCS

m Fuzziness parameter

N Total number of features

nc Number of clusters

Ns Number of samples

p Total number of temporal features (MFC approach)

q Total number of temporal samples

r Total number of spatial features (MFC approach)

SRi Sample rate of the ith feature

t Threshold

U Partition matrix

ul,i The degree of membership of sample i to cluster l

vtl,k Temporal prototypes for each cluster l and feature k

X Collection of data samples

xj The N-dimensional jth sample

xi Matrix that includes Xti and xsi

Xti Matrix q × p with the temporal component of the samples

y Output of the model

Ymm,j the prediction made by the MCS for sample j

z Iteration step number

xxi

Acronyms

ACC Accuracy

ADI Alternative Dunn Index

ALL Dataset with no restriction in terms of diseases

AUC Area Under the receiver operating characteristics Curve

BOTH Pancreatitis and Pneumonia dataset

CE Classification Entropy

DI Dunn’s Index

EHR Modern Electronic Health Records

FCM Fuzzy C-Means

FCM-FCM FM FCM model with FCM feature transformation FM

FIS Fuzzy Inference Systems

FM Fuzzy Model

FN False Negatives

FP False Positives

FS Feature Selection

ICU Intensive Care Unit

ID Identification

KDD Knowledge Discovery in Databases

MCS Multiple Classifier System

MFC Mixed Fuzzy Clustering

MFC-FCM FM FCM model with MFC feature transformation FM

MIMIC II Multi-Parameter Intelligent Monitoring

MRN Medical Record Number

OLS Ordinary-Least Squares

PAN Pancreatitis dataset

PC Partition Coefficient

PNM Pneumonia dataset

prec Precision

ROC Receiver Operating Characteristics

S Separation Index

SC Partition Index

sens Sensitivity

SFS Sequential Forward Selection

spec Specificity

TN True Negatives

TP True Positives

TS Takagi-Sugeno

XB Xie-Beni Index

xxii

Chapter 1

Introduction

This study addresses the need of vasopressors administration for patients in septic shock condition

in intensive care units (ICU) attempting to timely predict its administration. It proposes several ways to

pre-process the data adding some details that were not taken into consideration so far: use of expert

knowledge provided by physicians to remove outliers in place of statistical methods, removing patients

that due to their physiological variables behaviour may misguide the classification, new normalization

approach that tries to make use of patients’ healthy condition to evaluate only the deviation from its

own natural state, and not compare with their peers, and readjusted interval time between vasopressors

administration to consider it as continuous. The data is preprocessed in order to fit to two different

approaches in predicting the need of vasopressors: data that comprises information about the state

of a patient at a given time neglecting its history/evolution during the ICU stay (called punctual data

throughout this work), and data organized as a sequence of events ordered by its occurrence in time,

giving information about the evolution of each variable during the patient’s stay jointly with static data

(in this case: demographic data and ICU scoring systems). The latter uses an algorithm capable of

evaluating the time-variant and time-invariant data, being it the first time it is used in modelling to obtain

fuzzy models.

1.1 Motivation

Severe sepsis and septic shock are major health care problems and remain one of the leading causes

of death in critically ill patients [2] along with substantial consumption of health care resources [3]. It

affects millions of people around the world each year, killing one in four (and often more), and increasing

in incidence [3] [16] [45] [42] [17].

Sepsis is a systemic, deleterious host response to infection leading to severe sepsis and septic

shock, whereas the former refers to acute organ dysfunction secondary to documented or suspected in-

fection and the latter is severe sepsis plus hypotension (low blood pressure) despite the increase in the

cardiac output assumed it had been obtained by volume expansion, i.e., not reversed with fluid resusci-

tation. The initial priority in managing septic shock is to maintain a reasonable mean arterial pressure

1

and cardiac output to keep the patient alive allowing organ perfusion while the source of infection is iden-

tified and addressed, and measures to interrupt the pathogenic sequence leading to septic shock are

undertaken. While these latter goals are being pursued, adequate organ system perfusion and function

must be maintained, guided by cardiovascular monitoring [33]. It is this necessity that brings the use of

vasopressor agents administration, when vascular failure is such that the increase in cardiac output is

insufficient to maintain the mean arterial pressure in accordance with good organic infusion [15]. Es-

sentially, vasopressors are drugs which cause the blood vessels to constrict, increasing blood pressure

in critically ill patients.

As in other types of shock, septic shock, in parallel with the treatment of infection, demands an urgent

need to restore blood pressure and cardiac output for organ perfusion and thus stop the process that

aggravates the debt oxygen [50] [14] [18], similarly to polytrauma, acute myocardial infarction, or stroke,

the speed and appropriateness of therapy administered in the initial hours after severe sepsis develops

are likely to influence outcome.

Current recommendations are to try to restore both the cardiac output and blood pressure by volume

expansion/fluid resuscitation. It is only when fluid administration fails to restore an adequate arterial

pressure and organ perfusion that therapy with vasopressor agents should be initiated.

1.2 Datamining applied to the prediction of vasopressor depen-

dency

Medical or health care services are traditionally rendered by numerous providers who operate in-

dependently of one another. Providers may include, for example, hospitals, clinics, doctors, therapists

and diagnostic laboratories. A single patient may obtain the services of a number of these providers

when being treated for a particular illness or injury. Over the course of a lifetime, a patient may receive

the services of a large number of providers. Each medical service provider typically maintains medical

records for services the provider renders for a patient, but rarely if ever has medical records generated

by other providers. Such documents may include, for example, new patient information or admission

records, doctors’ notes, and lab and test results. Each provider will identify a patient with a medical

record number (MRN) of its own choosing to track medical records the provider generates in connection

with the patient. In order to make health care management more efficient, improve the quality of health

care delivered and eliminate inefficiencies in the delivery of the services, there is a desire to collect all

of a patient’s medical records into a central location for access by health care managers and providers.

A central database of medical information about its patients enables a network or organization to deter-

mine and set practices that help to reduce costs. It also fosters sharing of information between health

care providers about specific patients that will tend to improve the quality of health care delivered to the

patients and reduce duplication of services.

Nowadays technological advancements in the form of computer-based patient record software and

personal computer hardware are making the collection of and access to health care data more man-

2

ageable. The use of data mining in medicine is a rapidly growing field, which aims at discovering some

structure in large clinical heterogeneous data [11]. This interest arose due to the rapid emergence of

electronic data management methods, holding valuable and complex information. Human experts are

limited and may overlook important details, while automated discovery tools can analyse the raw data

and extract high level information for the decision-maker [30].

In this era databases are typically very large and contain high percentages of missing data. In

most cases, the missing data come from multiple heterogeneous sources [29]. After data collection

and problem definition, pre-processing is very important for data analysis, especially for retrospective

evaluations. Medical databases are a good example where the preprocessing is essential. Clearly, the

quality of the results from data analysis strongly depends on the careful execution of the preprocessing

steps [8].

The typical problems associated with medical databases are listed below [44]:

1. Each of the patients has a different length of stay in the medical unit. For each patient, a different

number of variables is documented.

2. Different data is measured at different times of day with different frequencies.

3. Some hospitals may not have data recorded online. Since the data can be transferred from hand-

written records to the database, typing errors are a common error source.

4. Many variables have a high percentage of missing values caused by faults or simply by seldom

measurements. Nevertheless, with the preprocessing approach proposed in this work, it was

possible to achieve better modelling results.

This study deals with all the aforementioned irregularities, in order to obtain more concise datasets

that are less vulnerable to inaccurate data collection and medical decisions. The techniques of data

mining were used to search for relationships in such a large clinical database.

The present problem has been addressed through the use of machine learning algorithms to clas-

sify whether ICU patients need vasopressors. In [32], heart rate and arterial blood pressure are used

as inputs to a fuzzy-logic based algorithm generating a ”a vasopressor advisability index”. In [13], a

multimodel approach is tested for predicting the risk of death in septic shock patients where two models

were used in parallel to enable selective sensitivity or specificity. In [24], fuzzy modelling with feature

selection was used to classify the use of vasopressors in septic shock patients, and in [23] the authors

account with the fluid resuscitation intake to filter the database and use fuzzy modelling to predict the

need of vasopressors. Both use disease-based subsets of patients showing that this approach improves

the prediction results when compared with a general model.

At the present moment there is no available time-series analysis dedicated to this problem even

though the use of time-varying data has shown to be useful in discovering patterns and extracting knowl-

edge from data in the most diverse domains as telecommunication applications, environmental analysis,

medical problems and financial markets [40]. All the aforementioned studies are based on what is called,

3

throughout this work, by punctual data that refers to the state of a patient at a given time, neglecting its

history/evolution during the ICU stay.

1.3 Contributions

Following the work done in [24] and [23], this thesis offers a detailed description of the whole pre-

processing, including improvements that resulted from the thorough inspection of the data that led to the

redefinition of some assumptions that were made previously. These changes led to improved results

(previous results are presented in Appendix I) and include: exclusion of diseased patients that did

not take vasopressors, update an inconsistency shown by the analysis of the intervals between the

drug intakes, tested a new normalization for the dataset (that is then proved to output worse results,

however this idea deserves an exhaustive study given that it was applied blindly to every variable),

outliers removal is here approached by using expert knowledge, only use data starting at the point

where each variable has at least one measurement (not extrapolating backwards, and using only data

that would be available in a real-time situation).

This thesis performs feature selection for each of the criteria proposed that use ensemble modelling

approach: a priori, a posteriori, arithmetic mean and weighted mean-distance. It shows two ways of

selecting the most predictive features: one based on the overall performance of the multiple classifier

system, and another based on single models’ performance. This states the difference between having

a specialized tuned system vs a system that contains specialized tuned models.

A spatiotemporal approach is undertaken making use of the physiological variables as time-series

(temporal data), and demographic data along with scores as non-varying data (spatial data). This in-

cludes the pre-processing of the data in order to be possible to build fuzzy models based on Mixed

Fuzzy Clustering. In a previous study [22] it was shown that using the resulting partition matrix of the

Mixed Fuzzy Clustering as input to the models has improved the results when compared to using the

spatiotemporal data directly. This idea is now applied to modelling based on the Fuzzy C-means (FCM),

where the partition matrix that comes out of the FCM algorithm is used as input for the models. There

are four approaches that use the data in this setup and feature selection was performed for each of them

and its results are discussed. It was proven that the proposed feature transformations deal better with

higher order feature sets when compared to using real measurements as input to the models, and the

intuitiveness of such transformations has the advantage of not making it lose the transparency needed

in health care data for further interpretation.

The output of the collaborative work developed in this thesis includes two conference papers [22]

[52]. Furthermore, it is expected that the work developed in this thesis and in [52] is extended to a

journal paper in the near future.

4

1.4 Outline

Chapter 2 presents the underlying methods that are present throughout this study. The theoretical

aspects of clustering methodology is introduced, follows a description of fuzzy modelling. Next, feature

selection and its variants are presented and validation measures are described.

Chapter 3 introduces the MIMIC II database and covers the data pre-processing, by giving a de-

tailed description of the transformation from the original database to the final datasets - punctual and

spatiotemporal - to which the methods will be applied.

In Chapter 4, the main results are shown, discussed and compared. This includes all the decisions

that were taken based on performance indexes to proceed with the study - fixing parameters and feature

selection. Feature selection results are shown and discussed, as well as the model assessment results.

These is performed for both ensemble modelling and time-series coupled with non-variant features anal-

ysis. Finally, in Chapter 5, the results are summarized and conclusions are drawn. The advantages and

disadvantages of each method are discussed, their limitations are presented and future work is pro-

posed.

5

6

Chapter 2

Methods

This chapter goes in depth about the theoretical background of the material that are ingrained

throughout this study. It starts with the concept of clustering and its role in data partitioning covering

in particular the Fuzzy C-Means algorithm (FCM) and Mixed Fuzzy Clustering algorithm (MFC), where

the main difference has to do with distance costs: one for singular data points clustering, not differ-

entiating between time-variant and time-invariant data, and another suited for clustering of time-series

combined with time-invariant data where the weight of the time-variant data is given through a parameter

λ. Next it covers fuzzy modelling, namely the Takagi-Sugeno (TS) fuzzy models, crucial to the develop-

ment of this study, and the proposed variants of this model are addressed in detail: antecedents of the

fuzzy model obtained through FCM and MFC and feature transformation using FCM and MFC. Then it is

presented what is denominated by ensemble modelling (or multimodel approach) which is an alternative

to the common structure of modelling passing through different stages of clustering in order to build a

group of models trained to deal with specific partitions of the data playing its role in the final prediction

using four different criteria. Then, it is introduced the concept of feature selection and how it is applied

in this context. Finally, the validation measures both for clustering validation and model validation are

introduced.

2.1 Clustering

Clustering is the problem of grouping data based on similarity, i.e., it is the task of dividing data

points into homogeneous classes or clusters so that items in the same class are as similar as possible

and items in different classes are as dissimilar as possible. Clustering can also be thought of as a

form of data compression, where a large number of samples are converted into a small number (when

compared with the size of the dataset) of representative prototypes or clusters. Depending on the data

and the application, different types of similarity measures may be used to identify classes, where the

similarity measure controls how the clusters are formed and are often expressed in terms of distance

norm between the data vectors or data vectors and centres of the clusters (prototypes). One may or may

not include the output (if known) to aid during the partitioning task, the algorithm is named supervised

7

or non-supervised, respectively.

There exist a large number of clustering algorithms but, generally speaking, clustering algorithms

can be divided into four groups: partitioning methods, hierarchical methods, density-based methods and

grid-based methods. In the same lines the clustering can be distinguished into two categories according

to the way the data is partitioned: hard and fuzzy clustering.

In hard (non-fuzzy) clustering, data is divided into crisp clusters, where each data point belongs

to exactly one cluster. In fuzzy clustering, the data points can belong to more than one cluster, and

associated with each of the data points are membership grades which indicate the degree to which the

data points belong to each cluster. In this study, only partitioning fuzzy methods are covered in detail

which is the base for all the modelling part, namely Fuzzy C-Means (FCM) algorithm and Mixed Fuzzy

Clustering (MFC) algorithm.

2.1.1 Fuzzy C-Means

The FCM algorithm is one of the most widely used fuzzy clustering algorithms. It attempts to partition

a finite collection of elements X = {x1, x2, ..., xNs} into a collection of nc fuzzy clusters, allowing one

data point to belong to two or more clusters, given some criterion. This method (developed by Dunn in

1973 and improved by Bezdek in 1981) is frequently used in pattern recognition, image processing, data

mining and fuzzy modelling [38]. It is based on the minimization of the objective function as in 2.1:

Jm =

N∑i=1

nc∑j=1

µmijd2ij(xj , ci), 1 ≤ m <∞ (2.1)

where nc is the number of clusters, N is the number of features, m is the fuzziness parameter that can

take any real number greater than 1, µij is the membership degree of sample xj to the ith cluster , xj is

the jth of N-dimensional measured data, ci is the N-dimension centre of the ith cluster, and d2ij(xj , ci) is

any norm expressing the similarity between any measured data xj and the prototypes ci.

The information contained in each µij can be organized into a matrix U , as in 2.2, commonly called

the partition matrix:

U =

µ11 · · · µ1Ns

.... . .

...

µnc1 · · · µncNs

, (2.2)

There are three properties inherent to µij and it is worth noting that µij ∈ [0, 1]∀ i, j, where zero implies

that the sample xj does not belong at all to cluster ci, and one implies that a sample xj completely

belongs to cluster ci (superimposed). The second property is that the sum of all membership degrees

of any sample to all clusters must be equal to one, according to 2.3:

nc∑i=1

µij = 1 ∀ j (2.3)

and that the total sample memberships to any cluster must be bigger than zero and smaller than one

8

2.4, implying that the clusters cannot be overlapped:

0 <

Ns∑j=1

µij < Ns ∀ i (2.4)

After defining m and the number of clusters nc, the matrix U is randomly initialized and the fuzzy

partitioning is carried out through an iterative optimization of the objective function 2.1, with the update

of the prototypes ci as in equation 2.5 :

ci =

∑nj=1 u

mij · xj∑n

j=1 umij

(2.5)

following the update of the membership degree µij , equation 2.6:

µij =1∑nc

k=1

(dijdkj

) 2m−1

(2.6)

The stopping condition can be anything that works for the problem. In this case this iteration stops when

max{∣∣∣µ(z+1)

ij − µ(z)ij

∣∣∣ } < ε ∀ i, j, where ε is a stopping condition between 0 and 1, and z is the iteration

step. This procedure converges to a local minimum or a saddle point in Jm.

In the present study, each sample is assigned to each cluster with a certain degree of membership.

This degree is proportional to the distance between the sample and the cluster prototype, which in a

general way can be computed as:

d2ij(xj , ci) = ‖xj − ci‖2 = (xj − ci)TAi(xj − ci), (2.7)

where Ai is a positive definite symmetric matrix, usually equal to the identity matrix in the FCM algorithm.

2.1.2 Mixed Fuzzy Clustering

Mixed Fuzzy Clustering (MFC) is a clustering method based on Fuzzy C-Means [7] and its principle is to

couple time-invariant data with time-variant data. This method generalizes the spatio-temporal concept

to any set of time-variant and time-invariant features and its extension to the analysis of multiple time-

series. In health care related data this means using information as demographic data, standard scores

or other type of information that can be considered static while still considering physiological varying

data, i.e., data with a sampling rate that cannot be neglected and which its evolutionary condition is

measured/taken into account.

In this approach each sample xi is composed by features that are constant during the sampling time

in analysis and by features that change over time (multiple time-series).

xi = (xsi , Xti ) (2.8)

In order to extend the spatio-temporal clustering method proposed in [35] which only deals with one

time-series to the case of multiple time-series, a new dimension is introduced, to handle p time-variant

9

features. As presented in equations 2.9 and 2.10, the spatial component of the samples is represented

by xsi and r is the number of spatial features, the temporal component of the samples is represented

by the matrix Xti with number of columns equal to the number of temporal features p and rows equal to

the number of temporal samples q. The MFC algorithm clusters the dataset using an augmented form

of the FCM. The main difference between the augmented and the classical FCM relies on the distance

function. In the augmented FCM a new pondering element λ is included, factoring the importance given

to the time-variant component. The distance is also computed separately for each time-series.

xsi = (xsi,1, ..., xsi,r) (2.9)

Xti =

xti,1,1 xti,1,2 · · · xti,1,p

xti,2,1 xti,2,2 · · · xti,2,p...

.... . .

...

xti,q,1 xti,q,2 · · · xti,q,p

(2.10)

The spatial prototypes are represented for each cluster l = {1, 2, ..., nc} by vsl or vsl and are computed

following the equation 2.11.

vsl =

∑ni=1 u

ml,ix

si∑n

i=1 uml,i

(2.11)

The temporal prototypes for each cluster l and feature k are represented by vtl,k, computed following

equation 2.12 and the matrix of temporal prototypes for cluster l is represented by V tl .

vtl,k =

∑ni=1 u

ml,ix

ti,k∑n

i=1 uml,i

(2.12)

The distance function between a sample and the spatial and temporal prototype of a cluster is computed

following equation 2.13, where δ represents the euclidean distance.

d2λ(vsl , Vtl , xi) = ||vsl − xsi ||2 + λ

p∑k=1

δ(vtl,k,xti,k) (2.13)

From this point both the membership degree of a sample i to cluster l equation 2.14 and the objective

function of the augmented FCM given by equation 2.15 have the same format as its parallel in the FCM

Section 2.1.1, the only difference is the distance measure that now has to consider the spatial and

temporal distances.

ul,i =1∑C

o=1

(dλ(vsl ,V

tl ,xi)

dλ(vso,Vto ,xi)

) 2m−1

(2.14)

J =

c∑l=1

n∑i=1

uml,id2λ(vsl , V

tl , xi) (2.15)

The MFC algorithm is described in algorithm 1. Its inputs are the spacial Xs and temporal data

10

Xt, number of clusters nc, initial partition matrix U , fuzzification parameter m and temporal component

weight λ. It returns the final partition matrix U , the spacial prototypes V s and temporal prototypes V t.

Algorithm 1 Mixed Fuzzy Clustering (MFC)

1: Input:2: Xs: Ns × r matrix of spacial data3: Xt: Ns × q × p matrix of temporal data4: nc: number of cluster prototypes5: U : nc ×Ns initial partition matrix6: m: fuzzification parameter7: λ: temporal component weight8: Output:9: U : nc × n partition matrix

10: V s: nc × r spatial cluster prototypes11: V t: nc × q × p temporal cluster prototypes12: while ∆J < ε do13: Compute the spatial cluster prototypes V s

14: for k in {0, ..., p} do15: Compute the temporal cluster prototype vtk16: end for17: Compute the distances dλ18: Update the partition matrix U19: Compute ∆J20: end while

2.2 Fuzzy Modelling

In the present thesis, fuzzy modelling is used in order to obtain classification models. These models

are created recurring to the train set of data of a target system and they are expected to be able to

reproduce its behaviour, i.e., to correctly assign samples of the validation set of data to the correspondent

label [37]. Fuzzy modelling systems are non-linear systems capable of inferring complex non-linear

relationships between input and output data when there is little or no previous knowledge of the problem

to be modelled.

In contrast to classical set theory and its latent boolean logic, fuzzy logic presents various possible

degrees of membership for an element to pertain to a given fuzzy set [19]. Fuzzy logic can be viewed as

an extension of the classical sets by adding a fuzziness parameter to handle the concept of partial truth,

where its value may vary between 0 and 1, being it completely false or completely true, respectively.

Furthermore, when linguistic variables are used, these degrees may be managed by specific functions

offering a more realistic framework for human reasoning (also called approximate reasoning) than the

traditional two-valued logic [59].

Fuzzy systems (static or dynamic systems that make use of fuzzy sets and fuzzy logic)[4] can be

used for a variety of purposes as modelling, data analysis, prediction, control, etc. The process of

formulating the mapping from a given input to an output using fuzzy logic is called Fuzzy Inference. This

process comprises four parts as presented in Figure 2.1 and they are described as follows [54]:

Fuzzification of the input variables - the first step is to take the inputs and determine the degree to

11

which they belong to each of the appropriate fuzzy sets via membership functions (conversion

of the input variables into linguistic values). A fuzzy operator (AND or OR) is applied in the an-

tecedent: after the inputs are fuzzified, the degree to which each part of the antecedent is satisfied

for each rule is known. If the antecedent of a given rule has more than one part, the fuzzy operator

is applied (to two or more membership values from the fuzzified inputs variables) to obtain one

number that represents the result of the antecedent for that rule. This number is then applied to

the output function resulting in a single truth value.

Knowledge base - contains the main relationships between inputs and outputs: It is composed by a

database where membership functions for linguistic terms are defined and a rule base, generally

represented by if-then statements.

Inference engine - is responsible for computing the fuzzy output of the system, using the information

contained in the rule base and the given input value to produce an output. Here occurs the aggre-

gation of the consequents across the rules.

Defuzzification of the output fuzzy set - provides a crisp value from the fuzzy set output.

Fuzzifier(Fuzzification)

Inferenceengine

Defuzzifier(Defuzzification)

Fuzzy sets ofInput variables

Fuzzy set output

Rule base(IF-THEN…)

+Database

Crisp inputs Crisp output

Also known as Knowledge Base:Provided by experts or extracted from numerical data

- Maps fuzzysets into fuzzysets- Determines how the rules are activated and combined

Provides crisp output valueActivates the linguistic rules

Figure 2.1: Functional block diagram of a fuzzy inference system (FIS).

Different consequent constituents result in different fuzzy inference systems, but their antecedents

are always the same [37]. There are two mostly well known types of Fuzzy Inference Systems (FIS):

Linquistic or Mamdani fuzzy model / type inference - where both the antecedent and consequent

are fuzzy propositions. It expects the output membership functions to be fuzzy sets. After the

aggregation process, there is a fuzzy set for each output variable that needs deffuzification.

Takagi-Sugeno (TS) fuzzy model / type inference - where the consequents are crisp functions of the

antecedent variables rather than a fuzzy proposition. The first two parts of the fuzzy inference

process, fuzzifying the inputs and applying the fuzzy operator, are exactly the same as the former.

The main difference is that the TS output membership functions/consequent are either a constant

or a linear equation, zero-order or first-order respectively.

12

So, the main difference between these two types of inference systems lies in the consequent of

fuzzy rules and thus their aggregation and deffuzification procedures changing the way the crisp output

is generated from the fuzzy inputs. The first two parts remain unchangeable: fuzzifying the inputs and

applying the fuzzy operator. At the cost of losing the expressive power and interpretability of Mamdani

output (since the consequents of the TS rules are not fuzzy) TS offers better processing time since

the weighted average replace the time consuming defuzzification process making it by far the most

popular candidate for sample-data-based fuzzy modelling. Moreover TS works better with optimization

and adaptive techniques which makes it very attractive for dynamic non-linear systems. These adaptive

techniques can be used to customize the membership functions so that fuzzy system best models the

data.

In some domains like medicine it is preferable to not use black box models enabling the user to

understand the classifier and evaluate its results. Due to the FM grey box nature this approach is

appealing as it provides not only a transparent, non-crisp model, but also a linguistic interpretation in the

form of if-then rules, which can potentially be embedded into clinical decision support processes. In this

work, first-order Takagi-Sugeno fuzzy models (TS-FM) [55] are derived from the data. These consist

of fuzzy rules where each rule describes a local input-output relation. With TS-FM, each discriminant

function consists, for the binary classification case, of rules of the type

Ri : If x1is Ai1 and ... and xN is AiN

then y = fi(x), i = 1, 2, ...,K (2.16)

where fi is the consequent function of rule Ri. The output of the discriminant function y can be inter-

preted as a score (or evidence) for the positive example given the input feature vector x. The degree

of activation of the ith rule is given by βi =∏Nj=1 µAij (x), where µAij (x) : R → [0, 1]. The discrimi-

nant output is computed by aggregating the individual rules contributions in a weighted average fashion:

d(x) =∑Ki=1 βifi(x)∑Ki=1 βi

. A sample x is considered positive if the score is higher than a certain t threshold

y > t, transforming the continuous discriminant output into a binary classifier.

The number of rules K of the type Ri and the antecedent fuzzy sets Aij are determined by fuzzy

clustering in the product space of the input variables. The consequent functions fi(x) are linear functions

determined by ordinary-least squares (OLS) in the space of the input and output variables.

2.3 Modelling Based on MFC

The methods that are presented here will deal with mixed data: time-variant data paired with time-

invariant data. There are four different modelling approaches whereas two of them work as base-lines

for comparison with the other two that make use of the MFC. Figure 2.2 offers an overview of what

comes next.

The feature transformation that is applied (Figure 2.2 second and fourth methods) consists in passing

13

AntecedentsDetermined by

FCM

Antecedents Determined by

FCM

AntecedentsDetermined by

MFC

Antecedents Determined by

FCM

Input features: Time-variant + time-invariant


Input transformation

by FCM

Input tranformation

by MFC



Input features:Partition matrix obtained by MFC

Input features:Partition matrix obtained by FCM

Figure 2.2: Methods used for time-variant data coupled with time-invariant data.

the data through a clustering method (FCM or MFC) and using the resulting partition matrix as input to

the models. Contrary to most data transformations, this one adds a new layer of interpretability: based

on the membership degree of a patients’ data to the existent clusters (that can be seen as categori-

cal values), the model classifies whether it will need vasopressors or not. Thus there is still medical

significance by giving each patients’ data a membership degree to each cluster/category.

2.3.1 FCM Fuzzy Model

In FCM FM, the time-invariant features and samples of the time-variant features are all equally as

features, and the antecedent fuzzy sets are determined using the partition matrix generated by FCM.

This is one of the most commonly used clustering method for the identification of TS-FM and as been

used in multiple health care applications [34].

2.3.2 FCM Fuzzy Model with FCM feature transformation

In FCM-FCM FM, the time-invariant and time-variant features are initially clustered using FCM and

the generated partition matrix is used as the feature set for a FCM fuzzy model. In this case the number

of features after transformation is equal to the number of clusters specified for the clustering stage

algorithm and are equal to the degree of membership of each sample to those clusters.

This approach can be seen as a type of feature transformation method for which the resulting fea-

tures represent the degree of membership of each point to the different clusters generated by the FCM

algorithm.

2.3.3 MFC fuzzy model

In MFC FM, the antecedent fuzzy sets of the TS FM are determined based on the partition matrix

generated by the MFC algorithm.

14

This methodology was developed based on the belief that the identification of the fuzzy membership

functions should be based on a non-conventional clustering algorithm in the presence of a mixture of

time-variant and time-invariant features, the former features should not be directly blended with time-

invariant features when calculating distances and different time-variant features are dealt separately

too.

2.3.4 FCM Fuzzy Model with MFC feature transformation

In MFC-FCM FM, the time-invariant and time-variant features are initially clustered using MFC and

the generated partition matrix is used as the feature set for a FCM fuzzy model. In this case the number

of features after transformation is equal to the number of clusters specified for the MFC algorithm and

are equal to the degree of membership of each sample to the mixed clusters.

This approach can be seen as a type of feature transformation method for which the resulting features

represent the degree of membership of each patients’ data to the different clusters generated by the MFC

algorithm.

2.4 Ensemble Modelling

Traditionally, it is assumed that there is a single best model for making inferences from data. Some

authors suggest however that inference should be based on a full set of models based on data sub-

groups, where relevant triggering and aggregating mechanisms are defined to activate or define a suit-

able interaction between each single model, allowing multimodel/ensemble model inference [48].

The rationale behind ensemble machine learning systems is the creation of many classifiers and the

combination of their output such that the performance is improved when compared with each single clas-

sifier [49]. The strategies used for the combination of classifiers can be divided in two types: classifier

selection and classifier fusion.

Classifier selection - each classifier is trained to be an expert in a subspace of the feature space, and

the answer is obtained based on a single selected classifier according to the input data. In this

case each model as its own threshold to transform the continuous output into a binary output.

Classifier fusion - the classifiers are trained over the entire feature space and combines all individual

classifiers into one stronger classifier that will ultimately provide the decision. In this case the

threshold to transform the continuous output into a binary output is defined for the combined output

and not for each model.

In [21] is proposed a fuzzy multimodel approach to an ensemble classifier that uses specific sub-

groups of data obtained by clustering samples with common characteristics, to model individual clas-

sifiers. Two decision criteria are proposed: one based on the arithmetic mean of the clusters’ output

and another based on the weighted average of the output of each cluster with the distance to the corre-

sponding cluster. The performance of the proposed multimodel is compared with a previous multimodel

15

developed by [21] that uses two decision criteria: an a priori decision based on the distance from the

clusters centres to the sample characteristics; and an a posteriori decision approach based on the un-

certainty of the model output response to the threshold of each model. The proposed criteria seeks to

mimic the natural decision making process humans tend to demonstrate, by consulting the opinion of

several experts before making a decision. Following the opinion of the majority agreeing on something

is usually preferable and produces better outcomes than following the opinion of a single expert whose

experience may be significantly different from the others. In this context, final decisions are usually

approximated by an appropriate combination of different opinions, where each of them has a different

underlying weight. The objective is to see if greater predictive performance can be achieved using ag-

gregation techniques (arithmetic mean and distance-weighted mean criteria) as compared to selection

techniques using distance metrics (a priori and a posteriori criteria), to build an ensemble classifier.

2.4.1 Subgroup selection

Fuzzy C-means (FCM) clustering algorithm was initially used to divide the dataset into similar inde-

pendent subgroups, in the N-dimensional space of the input variables (unsupervised clustering). Each

sample of the train dataset is assigned to one cluster by maximizing the degree of membership of the

sample to each cluster.

FCM clustering process requires the definition of two parameters: (i) the number of clusters C and

(ii) the degree of fuzziness m of the clustering, i.e., the weighting exponent of the clustering algorithm

[6]. These two model parameters were selected using the methods presented in Section 2.6.2.

2.4.2 Subgroup modelling

A first order Takagi-Sugeno fuzzy model was developed for each unsupervised cluster/subgroup,

resulting in Kc individual models. The number of rules Ri, the antecedent fuzzy sets AiN and the

consequent parameters were determined by means of FCM clustering in the product space of the input

and output variables (supervised clustering), where the number of clusters translates into the number of

fuzzy rules.

The complete test set is evaluated upon each unsupervised cluster’s model / subgroup model, re-

sulting in Kc different predictions for each sample. From the combination or selection of these individual

outcomes, a new final prediction is obtained for each sample, the multimodel prediction, Ymm, based on

one of four criteria: a priori, a posteriori, arithmetic mean and distance-weighted mean. These criteria

are covered in detail in Section 2.4.3.

Since this is a classification problem and y ∈ [0, 1], a threshold t is required to turn the continuous

output into a binary output y ∈ {0, 1}. The threshold selected to turn the continuous output into a binary

classification is determined for each model by evaluation of the train set using the corresponding criteria.

This way, the predicted output is 1 if y ≥ t and 0 if y < t. The optimal threshold is found by balancing

sensitivity and specificity. The optimal number of clusters nc and degree of fuzziness m for each model

Kc is determined by grid search.

16

Test data1-fold

Train data9-folds

Cluster 1(Subgroup 1)

Cluster N(Subgroup 2)

Non-supervised clustering

Model 1 Model N

Define model’s threshold based on perfomance

Train supervised models varying c and m Train supervised models varying c and m

Test model with train data

Define model’s threshold based on perfomance

Test model with train data

Define criteria thresholds regardless of the model’s

threshold

Apply criterion with train data

Status: 1 – Non supervised clusters’ center defined[*]

2 – Threshold for each model defined3 – Threshold for each criterion defined

Continuous output of each model for test data

Evaluate modelsOutput for Test data

Apply Criterion using respective non-

supervised prototypes and threshold

Figure 2.3: Schematic representation of the single and multimodel approaches; [*] cluster centres is relevant forthe case where feature selection is performed independently for each subgroup/unsupervised cluster, see Section2.5.2.

A schematic representation of the multimodel, where a 10-fold cross validation was used, is depicted

in Figure 2.3.

2.4.3 Ensemble decision criteria

The decision strategies that were mentioned, two based on classifier selection (a priori and a pos-

teriori) and other two based on classifier fusion (arithmetic mean and distance-weighted mean), are

formally described below:

Distance to the clusters centres - a priori

Each sample represents a point in a N -dimensional space. The distance of each sample to the

clusters centre is calculated. The cluster closer to the point is the selected one, and the classifica-

tion of the multimodel is given by the classification of the model of that cluster:

17

g = arg mini

(dij) (2.17)

Ymm,j = Yg, (2.18)

where dij represents the euclidean distance between sample j and cluster i, g the cluster that

minimizes that distance and Ymm,j the prediction made by the multimodel for sample j.

Difference between the threshold and the predicted outcome - a posteriori

For each sample, the difference between the threshold t and the predicted value by each model is

calculated. Higher differences mean more discrimination, i.e., the predicted value is more certainly

assigned to one of the classes, depending whether it is above or below the threshold. Hence, the

model that gives the classification is the model that gives a prediction more distant to its threshold:

g = arg maxi

(|ti − yij |) (2.19)

Ymm,j = Yg (2.20)

Arithmetic mean

In this case, the multimodel prediction consists in the arithmetic mean value of each single model

prediction (equation (2.21)). The idea is to check if combining the outputs of several classifiers by

averaging can reduce the risk of selecting a poorly performing classifier. Even though the average

may not beat the performance of the best classifier in the ensemble, it may reduce the overall risk

of making a poor selection.

Ymm,j =

∑Kci=1 yijKc

(2.21)

Distance-weighted mean

A weighted mean of the output of each single model with the euclidean distance to the correspond-

ing cluster centre, given by equation (2.22), was calculated for each test sample. The idea is to

combine individual models while giving higher credit to those classifiers trained with data closest

to the sample under evaluation.

Ymm,j =

∑Kci=1

1dijyij∑Kc

i=11dij

(2.22)

2.5 Feature Selection

The concept of Feature Selection (FS) is to reduce the dimensionality of the datasets in order

to keep only the features that are most relevant, based on the optimization of a specified criterion.

18

One aspect that is desirable in a machine learned model is that the model should have low variance,

i.e., it should not over fit the training data and lose the ability to generalize to unseen data. One of

the ways in which this could be done is to minimize the number of features that model uses so as

to only use the most informative features. It is different from dimensionality reduction. Both methods

seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so

by creating new combinations of attributes, whereas feature selection methods include and exclude

attributes present in the data without changing them. Examples of dimensionality reduction methods

include Principal Component Analysis, Singular Value Decomposition and Sammon’s Mapping. There

are many motivations to perform feature selection. While it conducts to direct benefits by reducing the

dimensionality of the data diminishing the processing and storing requirements and improve visualization

and understanding of the data [27] there are also side benefits that can be of higher value: cutting down

the data storage requirements can reduce the equipment needs avoiding useless investment, in some

cases it can improve the results by removing variables that do not add to the objective and can be

considered as noise misguiding the judgement. This process has also the advantage of highlighting

sets of features that were not thought before as being the most relevant for prediction purposes due

to its capability of choosing features that isolated do not have any relevant information but working in

association with other features can be of utmost importance. In the field of health care applications

there is the potential to take advantage of every aspect that was mentioned: cut in measurements

that are consistently taken reducing laboratory costs, improve the management of resources, better

understanding of the condition under study and disclosure of the information contained in some features

or group of features. From an engineering perspective, it is a powerful tool to reduce the complexity of

the model and remove inputs that do not improve the classification performance or are redundant.

The machine learning community classifies feature selection into four different categories: filters,

wrappers, hybrids and embedded [10] [53].

Filter algorithms - in this case the selection method is used as a preprocessing that does not attempt

to optimize directly the predictor performance. Example of some filter methods include the Chi

squared test, information gain and correlation coefficient scores. Some examples include using

measurements of entropy, variance, correlation or mutual information of single and multiple vari-

ables.

Wrapper algorithms - in which the selection method optimizes directly the predictor performance. An

example of a wrapper method is the recursive feature elimination algorithm.

Hybrid algorithms - in which is applied a filter method in the first place to obtain a reduced set of

features so that a wrapper method is then used to select the most relevant ones, according to the

performance of a selected machine learning algorithm.

Embedded algorithms - This involves carrying out feature selection and model tuning at the same

time. This will probably require some experience to know where to stop for greedy algorithms like

backwards and forward selection as well as tuning the parameters for the regularization1 based1Regularization refers to the method of preventing overfitting the training sample, by explicitly controlling the model complexity

19

models. Examples of regularization algorithms are the Lasso, Elastic Net and Ridge Regression.

Greedy algorithm is a mathematical process that looks for simple, easy-to-implement solutions to

complex, multi-step problems by deciding which next step will provide the most obvious benefit. Such

algorithms are called greedy because while the optimal solution to each smaller instance will provide an

immediate output, the algorithm doesn’t consider the larger problem as a whole. Once a decision has

been made, it is never reconsidered. Greedy algorithms work by recursively constructing a set of objects

from the smallest possible constituent parts. Recursion is an approach to problem solving in which the

solution to a particular problem depends on solutions to smaller instances of the same problem. The

advantage to using a greedy algorithm is that solutions to smaller instances of the problem can be

straightforward and easy to understand. The disadvantage is that it is entirely possible that the most

optimal short-term solutions may lead to the worst possible long-term outcome. There are two ways

to use greedy algorithms: forward selection where variables are progressively included increasing the

subset of features according to its relevance when grouped, or backwards elimination that starts from

the full feature set and iteratively removes the least relevant attribute one by one until the performance

criterion stops increasing [27]. Other possibility include the use of meta-heuristics with an increased

computational effort compensated by a more refined search [58].

The present thesis makes use of a greedy search strategy implemented in a wrapper way, following

the Sequential Forward Selection (SFS) approach.

2.5.1 Sequential Forward Selection

Following the wrappers concept that has as objective function the optimization of the predictor per-

formance, the SFS method builds a model for each of the features under consideration and evaluates

the models performance in order to define the next best candidate feature, the one that returns the best

value of the performance criterion, that will be integrated with the already chosen ones adding a new

dimension to the feature space.

So this process starts by evaluating which single feature is best for prediction, settles that feature and

then tries the combination of the selected feature with all the remaining ones repeating this process suc-

cessively until the value of the performance criterion stops increasing. Discrimination based on the area

under the receiver operating characteristics curve (see Section 2.6.1) [31], was used as performance

criterion in this study.

As mentioned before, the drawbacks are the same that were stated for the greedy algorithm: the

high likelihood to stuck to a sub-optimal solution. Nevertheless, the general acceptance in the medical

community and the advantages related to its simplicity, computational efficiency and transparent inter-

pretation of the results make it an isolated contestant in this study due to the high dimensional datasets,

limiting the usability of other algorithms.

20

2.5.2 SFS with ensemble

Given the two layered structure of the ensemble modelling two different approaches were considered

when performing feature selection: FS based on overall performance and FS based on the singular

models’ performance. Both procedures are summarized in Figure 2.4.

FS based on overall performance - The feature selection procedure has as performance criterion the

final predicted outcome, adding the “best” feature to the whole operation (to all classifiers).

FS based on the singular models’ performance - The data is initially partitioned using all the vari-

ables and then SFS is performed for each group of data resulting in different subsets of features

for each group,i.e., for each single model dedicated to each group.

The idea behind the latter procedure is that different subgroups (unsupervised clustering) may have

heterogeneous relevant features and while in the first case every time a feature is added it improves

the final result it may harm one of the clusters while benefits the other. By considering each group

independently it guarantees that each feature that is added improves the performance of each model

independently without harming the other, making it more specialised for each subgroup.

The problem that arises is that different features will be selected and the unsupervised clustering

considers all the 32 variables, meaning that one of the purposes of feature selection, which is reducing

the variables that need to be collected, has no effect in this case. All the 32 time-variant variables need to

be collected to perform the unsupervised clustering and fix the subgroups centres in the feature space.

While in the former the feature space can be reduced by deducting the features that do not improve the

performance because both unsupervised clustering and modelling procedures are performed only with

the features that are being considered at each step.

An advantage of the latter is that the feature selection is performed only once for each single model

and then the ensemble model approach is applied directly based on those single models while in the

former, since the evaluation is based on the final prediction, the feature selection procedure has to be

performed for each ensemble modelling criteria.

21

FS DATA

Cluster 1 Cluster 2

SFS on cluster data 1

SFS on cluster data 2

Non-supervised clusteringConsidering all variables

FS DATA

Cluster 1 Cluster 2

Model 1 Model 2

Final prediction outcome

Non-supervised clusteringconsidering only selected

variables plus candidate variable

Apply multimodel criterion

General SFS results in:one subset of features

for each m and c

SFS dedicated to cluster 1 results in:

One subset of features for m1 and c1

SFS dedicated to cluster 2 results in:

One subset of features for m2 and c2

Figure 2.4: Left: SFS applied in individual subgroups; Right: SFS applied to the ensemble methodology.

2.5.3 SFS with MFC

The SFS was directly performed for each of the four approaches presented under the Section 2.3 in

order to check how far each system can reach in terms of outcome prediction when tuned. Nevertheless,

the core under this subject is to compare the MFC approach to its rival/competitor and verify if the

complexity that is added brings any benefit in this case.

2.6 Validation Measures

This chapter will cover the validation measures that were used to evaluate the results and take

decisions to proceed with the study. First the concept of area under the receiver operating characteristics

curve (AUC) is introduced, a measure that is crucial to evaluate the performance during the stages

of feature selection and model assessment. Then follows a group of clustering validation methods:

Partition Coefficient, Classification Entropy, Partition Index, Separation Index, Xie-Beni Index, Dunn’s

Index, Alternative Dunn Index.

2.6.1 AUC

In order to assess the performance of binary classifiers it is not enough to consider the percentage

of correct classifications (accuracy as in equation (2.23)) due to its limitations: if one of the classes is

less common misclassifying this class will not have a great impact in the accuracy value and it is not

possible to weight more one class that might be of more importance.

Accuracy(ACC) =# of correct classifications

# of classified samples=

TN + TP

TN + TP + FP + FN(2.23)

22

In the present case the datasets are heavily unbalanced and the sensitivity (hit rates) and specificity

(false alarm rates) must be considered. The most standard method used in medical applications and in

the machine learning community to merge these two measures into the assessment is the analysis of

the area under the receiver operating characteristic curve (AUC).

A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing and select-

ing classifiers based on their performance depicting the trade-off between hit rates and false alarm rates.

This performance graphing method have properties that make them particularly useful for domains with

skewed class distribution and unequal classification error costs. These characteristics have become

increasingly important as research continues into the areas of cost-sensitive learning and learning in the

presence of unbalanced classes.

A classification model (or classifier) is a mapping from instances to predicted classes. Some classifi-

cation models produce a continuous output to which different thresholds may be applied to predict class

membership. Given a classifier and an instance, there are four possibilities for the outcome:

True Positive (TP) - correctly classified positive instances;

False Negative (FN) - incorrectly classified positive instances;

True Negative (TN) - correctly classified negative instances;

False Positive (FP) - incorrectly classified negative instances;

The true positive rate (also called hit rate, recall or sensitivity) of a classifier is estimated by the equa-

tion (2.24) and the true negative rate (specificity) of the classifier is given by the equation (2.25).

TPrate ≈#TP

#TP + #FN(2.24) TNrate ≈

#TN

#TN + #FP(2.25)

An additional term associated with ROC curves is precision which in this context is defined by the

equation (2.26).

precision(prec) =#TP

#TP + #FP(2.26)

ROC graphs are two-dimensional graphs in which TPrate (sensitivity) corresponds to the Y axis and

one minus TNrate (or false positive rate), same as one minus specificity, corresponds to the X axis. The

sensitivity and specificity can take any value between [0,1] depending on the selected threshold.

Several points in ROC space are important to note as depicted in the Figure 2.5:

Point (0,0) - never assign a positive classification;

Point (1,1) - opposed to point (0,0), assigns positive classifications for every instance;

Point (0,1) - represents perfect classification.

23

Generally, one point in ROC space is better than another if it is closer to the point (0,1). Additionally,

classifiers appearing on the left-hand side of the graph, close to the X axis can be thought of as “conser-

vative” making positive classifications only when there is strong evidence so they make few false positive

errors, however, they often have low true positive rates as well. On the other hand, classifiers on the

upper right-hand side of a ROC graph may be thought of as “liberal” making positive classifications with

weak evidence and classifying nearly all positives correctly, but the rate of false positives also increases.

Other conclusions can be drawn by observing the the ROC graph such as classifiers appearing in the

lower right triangle, meaning that its performance is worse than random guessing which corresponds to

the diagonal that connects the point (0,0) with (1,1). Nevertheless, it can be thought of as having useful

information but it is applying it incorrectly so if the classifier is negated, reversing its classification, it

brings the classifier to the upper side of the diagonal (true positive classifications become false negative

mistakes, and its false positives become true negatives).

Each threshold value produces a different point in the ROC space. Thus, it is possible to choose

where one wants the classifier to stand along the ROC curve by choosing the threshold. If one wants

to bring the classifier from the conservative side to the liberal areas of the graph it suffices to lower the

threshold. In this thesis the criterion that settles the threshold is to minimize the absolute difference

|sensitivity − specificity|.

Worse than random

Random guessing class

Liberal

Conservative

Better

True P

osi

tive R

ate

(Sen

siti

vit

y)

False Positive Rate (100 - Specificity)

A

Negation of A

ROC curve

0 1

1

Figure 2.5: ROC curve example and points of interest (inspired on [20]).

2.6.2 Clustering Validation

With the significant resurgence of interest in new clustering techniques arises the need to validate the

quality of the resulting partitioning where each clustering strategy has its advantages and shortcomings.

The following are the typical requirements for a good clustering technique in data mining [28]:

24

Scalability - The cluster method should be applicable to huge databases and performance should

decrease linearly with data size increase.

Versatility - Clustering objects could be of different types: numerical data, boolean data or categorical

data. Ideally a clustering method should be suitable for all different types of data objects.

Ability to discover clusters with different shapes - This is an important requirement for spatial data

clustering. Many clustering algorithms can only discover clusters with spherical shapes, e.g., the

FCM.

Minimal input parameter - The method should require a minimum amount of domain knowledge for

correct clustering. However, most current clustering algorithms have several key parameters and

they are thus not practical for use in real world applications.

Robust with regard to noise - This is important because noise exists everywhere in practical prob-

lems. A good clustering algorithm should be able to perform successfully even in the presence of

a great deal of noise.

Insensitive to the data input order - The clustering method should give consistent results irrespective

of the order the data is presented.

Scaleable to high dimensionality - The ability to handle high dimensionality is very challenging but

real data sets are often multidimensional.

There is no single algorithm that can fulfil all the requirements. It is important to understand the

characteristics of each algorithm to select an algorithm for the clustering that suits the problem at hand

[36].

One of the main difficulties with clustering algorithms is how to assess the quality of the returned

clusters after the division. Many of the popular clustering algorithms are known to perform poorly on

many types of data sets. In addition, virtually all current clustering algorithms require their parameters to

be tweaked for the best results, but this is impossible if one cannot assess the quality of the output. While

visual inspection can be of use for low dimensional data, most clustering tasks are higher dimensional

than an human inspector can analyse.

Clustering validation serves the purpose of evaluating the quality of the clustering results [46]. It has

two main categories: internal clustering validation and external clustering validation [43].

External clustering validation - uses external information, an example of external validation measure

is entropy, which evaluates the “purity” of clusters based on the given class labels [56], i.e., the

number of clusters/labels is already known (since the number of clusters is already known it is

mainly used to choose the best clustering algorithm on a specific dataset)

Internal clustering validation - relies only on information in the data, evaluating the goodness of clus-

tering structure without respect to external information [47] (in this case internal validation mea-

sures can be used to choose the most suited clustering algorithm as well as the optimal cluster

number and other input parameters not known apriori).

25

Cluster properties such as compactness and separation are often taken into consideration as major

characteristics by which to validate clusters and are used for validity methods that are based only on

the data. Compactness is an indicator of the variation or scattering of the data within a cluster, and

separation is an indicator of the isolation of clusters from one another [39]. Thus, a good fuzzy partition

is expected to have a low degree of overlap and a large separation distance.

As already stated in Section 2.1, the aim of fuzzy clustering is to partition a data set into nc homo-

geneous fuzzy clusters. The FCM algorithm requires the user to pre-define the number of clusters (nc);

however,there are many cases where knowing nc in advance is not possible. Given that the resulting

fuzzy partitions using the FCM algorithm depend on the choice of nc it is necessary to validate each of

the fuzzy c-partitions once they are found. This validation is performed by a cluster validity index, which

evaluates each of the fuzzy c-partitions and determines the optimal partition or the optimal number of

clusters (nc) based on them. It is important to mention that conventional approaches to measuring com-

pactness suffer from a tendency to monotonically decrease when the number of clusters approaches to

the number of data points.

Partition Coefficient (PC) - The PC measures the amount of “over-lapping” between clusters. It is an

heuristic measure since it has no connection to any property of the data and it is defined by Bezdek

[6] as in the equation (2.27). The maximum values of it imply a good partition in the meaning of a

least fuzzy clustering.

PC (c) =1

Ns

nc∑i=1

Ns∑j=1

(µij)m (2.27)

Classification Entropy (CE) - The CE, defined by the equation (2.28), measures the fuzziness of the

cluster partition exclusively, being similar to the PC. It measures whether a particular location

(prototype) has high membership values in any of the classes. The minimum values imply a good

partition in the meaning of a more crisp partition.

CE (c) = − 1

Ns

nc∑i=1

Ns∑j=1

µij log (µij) (2.28)

Partition Index (SC) - The SC is the ratio of the sum of compactness and separation of clusters given

by the equation (2.29). It is useful when comparing different partitions having equal number of

clusters [5]. A lower score of SC indicates a better partition.

SC (c) =

nc∑i=1

∑Nsj=1 (µij)

m ‖xj − υi‖2

Nsi∑nck=1 ‖υk − υi‖

2 (2.29)

Separation Index (S) - The S, given by the equation (2.30), uses a minimum distance separation for

partition validity, contrary to SC. A better partition is given by a lower value of S.

S (nc) =

∑nci=1

∑Nsj=1 (µij)

m ‖xj − υi‖2

Nsminik ‖υk − υi‖2(2.30)

26

Xie-Beni Index (XB) - The XB [57] proposed a validity index that focused on two properties: compact-

ness and separation.

XB (nc) =

∑nci=1

∑Nsj=1 (µij)

m ‖xj − υi‖2

Nsminij ‖xj − υi‖2(2.31)

In equation (2.31), the numerator indicates the compactness of the fuzzy partition, while the de-

nominator indicates the strength of the separation between clusters. It states that a good partition

produces a small value for the compactness, and that well-separated υi will produce a high value

for the separation. Hence, the best number of clusters is given by the resulting lower value.

Dunn’s Index (DI) - The DI is defined by the equation (2.32)

DI(nc) = mini∈nc

{minj∈nc,i6=j

{minx∈Ci,y∈Cjd(x, y)

maxk∈c {maxx,y∈Cd(x, y)}

}}(2.32)

where Ci is a cluster of vectors, x and y any two n dimensional feature vectors assigned to the

same cluster Ci. A more intuitive formulation is given by equation (2.33).

DI(nc) =dmindmax

(2.33)

Clearly, compact clusters that are well separated in the feature space manifest themselves in small

values of dmax and large values of dmin, leading to an increased value of DI. This index is easy

to implement and has a low computational complexity, but it is vulnerable to outliers as only two

distances are used. A larger value of DI indicates a better clustering.

Alternative Dunn Index (ADI) - The aim of modifying the original Dunn’s index was that the calculation

becomes more simple, when the dissimilarity function between two clusters (minx∈Ci,y∈Cjd(x, y))

is rated in value from beneath by the triangle-nonequality, as in equation (2.34),

d(x, y) ≥ |d(y, υj)− d(x, υj)| (2.34)

where υj is the cluster centre of the jth cluster. After the alteration, the ADI is given by the equation

(2.35).

ADI(c) = mini∈nc

{minj∈nc,i6=j

{minx∈Ci,y∈Cj |d(y, υj)− d(x, υj)|maxk∈nc {maxx,y∈Cd(x, y)}

}}(2.35)

The optimal partition (or an optimal value of clusters) is obtained by maximizing PC (or minimizing

CE) with respect to the number of clusters because this provides compact clusters with higher values of

µij . The major drawback of these indexes is that they use only the fuzzy membership degrees µij for

each cluster without considering the data structure of the clusters.

Note, that the only difference of SC, S and XB is the approach of the separation of clusters. In the

case of overlapped clusters the values of DI and ADI are not really reliable because these methods

are tailored for hard clustering analysis (do not enter with the fuzziness parameter nor membership

degrees).

27

Chapter 3

Data collection and preprocessing

Recent improvements in modern ICUs have ease the capture and storage of health care related data

as high temporal resolution data including lab results, electronic documentation, and bedside monitor

trends and waveforms. The growing interest and capacity to acquire large datasets adds statistical

robustness to the data, making the results of predictive models more reliant as they can be trained and

tested with more data.

Modern electronic health records (EHR’s) are designed to capture and render clinical data during

the health care process [41]. Using them, health care providers can enter and access clinical data

when it is needed. This collected data can then be integrated to form a hospital information system.

EHR’s can incorporate decision support technologies to assist clinicians in providing better care through

data mining technologies to automatically extract useful models and assist in constructing the logic for

decision support systems.

However, because the main function of EHR’s is to store and report clinical data collected for the

purpose of health care delivery, the characteristics of the available raw medical data are widely dis-

tributed, heterogeneous in nature, and voluminous and may not be optimal for data mining and other

data analysis operations [11]. One challenge in applying data mining to clinical data is to convert data

into an appropriate form for this activity. Data mining algorithms can then be applied using the prepro-

cessed data. Whether this data mining is successful or not often hinges on the adequacy of the data

preparation. So, besides having a wealth of data available within the healthcare systems there is a lack

of effective analysis tools to discover hidden relationships and trends in data. Data mining techniques

may help in answering several important and critical questions related to health care.

Knowledge Discovery in Databases is a well-defined process consisting of an iterative sequence of

data cleaning, data integration, data selection, data mining, pattern recognition and knowledge presenta-

tion that aims to discover hidden relationships and trends in data and has attracted great deal of interest

in Information industry [26]. A formal definition of Knowledge Discovery in Databases is given as fol-

lows: “We analyse Knowledge Discovery and define it as the non-trivial extraction of implicit, previously

unknown, and potentially useful information from data” [25].

Data mining is an essential step of knowledge discovery that may accomplish class description,

28

association, classification, clustering, prediction and time series analysis. Data mining, in contrast to

traditional data analysis, is discovery driven, which results in the disclosure of hidden but useful knowl-

edge from massive databases. Medical data mining technology has great potential and provides a

user-oriented approach to novel and hidden patterns in the data sets of the medical domain. These

patterns can be utilized for clinical diagnosis and be used by the healthcare administrators and medi-

cal practitioners to improve the quality of the service reducing the number of adverse drug effect and

suggesting less expensive therapeutically equivalent alternatives. Anticipating patient’s future behaviour

based on the given history is one of the important applications of data mining techniques that can be

used in health care management.

Data preprocessing allows the transformation of the original data into a suitable shape to be used by

a particular mining algorithm. So, before applying the data mining algorithm, a number of general data

preprocessing tasks have to be addressed, namely:

Data cleaning - One of the major preprocessing tasks, to remove irrelevant items and log entries that

are not needed for the mining process such as graphics and scripts.

Data transformation and enrichment - Consists in calculating new attributes from the existing ones;

conversing of numerical attributes into nominal attributes; providing meaning to references con-

tained in the log; etc.

Data integration - Integration and synchronization of data from heterogeneous sources.

Data reduction - Reducing data dimensionality which, per se, removes unnecessary and noisy data

while simplifying the problem.

This chapter introduces the MIMIC II database and the procedures that were taken to build the final

datasets, that will allow to perform prediction over the need of vasopressors administration, that this

project deals with. The preprocessing step that is explained in detail will result in two different groups

of data: punctual data and time-series data. Each group comprises four datasets regarding the disease

headings: pancreatitis, pneumonia, pancreatitis and pneumonia, and all patients disregarding which

disease they were diagnosed with (which are addressed throughout this thesis by PAN, PNM, BOTH

and ALL respectively). The expected output is binary, 1 pointing to the need of vasopressors in a time

window of 2 hours, and 0 otherwise. In order to clarify the difference between the two mentioned groups,

a brief description is given next:

Punctual Data - Data that comprises information about the state of a patient at a given time neglecting

its history/evolution during the ICU stay. The output is binary, 1 or 0, for each state.

Time-series Data - Data organized as a sequence of events ordered by its occurrence in time, giving

information about the evolution of each variable during the patient’s stay. The output is binary, 1 or

0, after the temporal evolution.

Note that the datasets are built using different strategies only to appropriately fit into the different

data mining algorithms. The classification problem is the same in all cases, i.e, the final result - patient

29

was on vasopressors or patient was not on vasopressors - is the same regardless of the scenario. The

major difference is that for one case there is more data since there are less restrictions for the punctual

case, but if the time-series method shows up as being a strong competitor, it should be considered, in

a later study, to use the exact same information from the database and compare both to have access

to a direct comparison which cannot be achieved in the present study. Nevertheless, there will still be

an advantage to the punctual data since it only needs one measurement of each feature to be able to

compute an outcome. So it might be interesting to integrate both for prediction purpose during the whole

ICU stay / the length of stay assuming that in the long run the time-series approach performs better.

The punctual case was already studied in [24] but since there are some minor changes in the prepro-

cessing steps it is still worth to deepen its study.The main advantage of the present study in comparison

to the previous one is the fact that it can be used in real-time analysis to support decisions.

This chapter presents the steps taken to construct the final datasets that are the basis for this thesis.

It starts with an overview of the raw data, which is essential to understand the steps needed to be

taken in the data treatment phase. The detailed description provided in this section gives away the

means for future replication, and facilitates the implementation of possible required modifications in the

proposed approach, given that all the studies on vasopressors intake mentioned until now lack a detailed

description of the procedure to obtain the final data sets. Follows the general preprocessing steps that

are common to both the punctual data and time-varying data: variables that will be included in the study,

removal of outliers, removal of deceased patients and an insight about alternative ways of normalization

is undertaken. Next, preprocessing is performed separately for each type of data set since it is treated

differently to attain different data structures and its particularities that lead to its final form are explained.

3.1 Structure of MIMIC II Database

The data were collected over a seven year period, beginning in 2001, from Boston’s Beth Israel Dea-

coness Medical Center (BIDMC). Any patient who was admitted to the ICU on more than one occasion

may be represented by multiple patient visits. The adult ICUs (for patients aged 15 years and over)

include medical (MICU), surgical (SICU), coronary (CCU), and cardiac surgery (CSRU) care units. Data

were also collected from the neonatal ICU (NICU).

Figure 3.1 illustrates the data acquisition process, which did not interfere with the clinical care of

patients, since databases were dumped off-line and bedside waveform data and derived trends were

collected by an archiving agent over TCP/IP. Source data for the MIMIC II database consists of a) bed-

side monitor waveforms and associated numeric trends derived from the raw signals, b) clinical data

derived from Philips’ CareVue system, c) data from hospital electronic archives, and d) mortality data

from the Social Security Death Index (SSDI). These data are assembled in a protected and encrypted

database (both flat files for the waveforms and trends, and in the form of a relational database for all

other data). Once the data have been assembled in a central repository and time aligned, the waveforms

and trends for each individual are linked to the corresponding individuals’ data in the relational database.

The data are then de-identified to produce a final set of data for public consumption and due to its sen-

30

sitive information, access to the MIMIC II Clinical Database is restricted to registered users. It includes

calculated standardized severity scores on the database and also user feedback and corrections.

The resulting records contain realistic patient measurements with all the associated challenges (such

as noise or missing data gaps) that advanced monitoring and clinical decision support systems (CDSS)

algorithms would receive as input data.

Figure 3.1: Schematic of data collection and database construction [1].

The MIMIC II database is composed of two distinctive groups of data. The first group, the clinical

database, consists of data integrated from different information systems in the hospital and contains

diverse information such as: patient demographics, medications, results of lab tests and more. The

second group, contains high resolution waveforms recorded from the bedside monitors in the intensive

care units.

Due to the richness of data provided by MIMIC II there are almost limitless applications and investi-

gations which can be performed. The MIMIC II is a database that is gradually increasing an should be

expected to continue so expanding not only in size, by adding patients, but in detail by disposing new

information/attributes. Here it is used the version 2.6 released in 24th of February 2012 and besides

the limitless combinations of data that can be used as shown previously in 3.1 only a small part of its

information was used to proceed with this study. Next follows a brief description of the structure of the

used raw data and the major steps performed to organize this data.

The database is downloadable to everyone after one performs a general ethical test. Once all the

data is extracted one ends up with as many folders as number of patients the database contains and

each number (ranging between 1 and 32208 with some missing numbers) refers to the ID number of

each patient. Inside these folders is all the information relative to the given patient. It is not in the scope

31

of this study to provide an extensive comprehensive guide of the database, nevertheless it is attempted

to guide the reader through the process of extracting information for this particular case and so it is

only covered the part of the database that is necessary/essential and integrates this project. Below it

is described the content and the use of each file of information that can be found inside each patient

folder. All files are named as follows [CHART NAME]-[ID PATIENT].txt and for the sake of generality [ID

PATIENT] is kept as ID.

ICD9-ID.txt1 - This file maps International Classification of Diseases (ICD9) codes to each admitted

patient.

CHARTEVENTS-ID.txt - This file contains physiological measurements that were acquired during the

patient stay as well as the time at which the measurement occurred.

ICUSTAY DETAIL-ID.txt - All the called static data as gender, age ,height, weight, date of admission

and discharge, death date, SOFA and SAPSI - to mention a few - for each patient and particular

ICU stay were retrieved from this file.

IOEVENTS-ID.txt - This stands for input/output events and data for the variable urine output foley was

recovered from it as well as its timings.

MEDEVENTS-ID.txt - All the considered output events (vasopressors administration) and the times of

its administrations were collected from this file.

For more details - A more detailed description of this files/tables can be found at [1]

The aforementioned files have an “event” column with the ID of what is being measured and the

mapping between those ID’s and its description is given in the named D tables as D CHARTITEMS.txt,

D IOITEMS.txt, D MEDITEMS.txt where the correspondence is evident. Since the information is hetero-

geneous, meaning that it comes from different sources, it can happen that the same event is referenced

with different ID’s and descriptions, for instance, there are different ID’s for lactic acid and lactate which

in medical terms is the same. In an attempt to remove this caveat from the data all the variables with the

same meaning were converted from multiple ID’s to one unique ID, i.e., picking on the previous example,

lactate has the ID 818 and lactic acid the ID 1531 and the latter was switched to 818 ending up with only

one ID to this variable. Some variables have more than one or two ID linked to the same physical variable

and a list of all the aggregations is provided in Table C.1. This conversion results in repeated/redundant

measurements and this was prevented using the timestamps of each event and every time the same ID

(after switching to a common one) is coupled with the same timing more than once, regarding the same

exact measurement, only one prevails.

At this stage all the data that have potential to be used is compiled in an unique table with the

following information: subject ID, event ID, time of the occurrence of the event and value measured plus

static data for each subject ID. The Figure 3.2 depicts the path from the raw data to the data that will be

used for modelling.

32

ICD9-ID.txt

CHARTEVENTS-ID.txt

ICUSTAY_DETAIL-ID.txt

IOEVENTS-ID.txt

MEDEVENTS-ID.txt

Patients ID by disease(PAN, PNM, BOTH or ALL)

Physiological variables

Static data

Vasopressors administration

Input output events

Useful info

What patients are under study?(PAN, PNM, BOTH, ALL)

Usable info

Preprocessment suitedto the intended algorithm

Figure 3.2: From raw data to usable data.

Useful information is called so because it still contains all the data and its original structure related to

the variables and patients under study that can in some way be used, for instance, it still contains outliers

that can be used to apply statistical algorithms such as the inter-quartile method. Usable information is

the data that is preprocessed and structured in a way that the data mining algorithms that one intends

to use can be applied directly.

3.2 Preprocessing Data - General Steps

In this section the preprocessing steps that are common to both punctual and time-varying data are

described. Those steps aim to filter the data and obtain a dataset where there is no loss of information

relative to the given initial assumptions, so all the data that is unnecessary or do not fit the problem

is eliminated remaining what is considered to be useful information for the algorithms in mind. After

that, this data can be manoeuvred in several ways to analyse this same problem differently. Later, the

preprocessing steps that lead to the final datasets, be it punctual or time-varying, will be addressed

separately since their inner structure differ.

Follows an overview of the assumptions that were made and the explanation to such assumptions

are described along this chapter.

33

General assumptions:

ICU stay number - Only the first ICU stay is taken into account to avoid cumulative problems;

Patient inclusion - At least one measurement for each of the input variables is necessary to

consider the patient (two measurements for the time-series case);

Input variables - Only the selected variables have potential for predictive power over the out-

come considered;

Output variables - The output variables in study are the only existent vasopressors (so there

are no class 0 cases that had another type of vasopressor and will bias the result);

Prediction window - Predicting the need of vasopressors with 2 hours in advance is sufficient

to alert the medical staff and initiate the treatment adequately;

Vasopressor administrations - To be considered as being under vasopressor administration it

must agree with three aspects:

• Two consecutive administrations are considered to be continuous iif the interval be-

tween the events are less than or equal to 1.09 hours;

• To be considered as a positive intake of vasopressors the minimum (continuous) length

of administration must be over or equal to 2 hours. Sometimes the vasopressors are

administered for a short time period and that can mean that the physicist was being

conservative and this will avoid such cases, nevertheless all patients that had their first

vasopressors administration for a short period of time (less than 2 hours) are discarded

(nor class 0 neither class 1). This reduces the bias of more conservative physicists or

weak decisions;

• There must be a minimum length of 6 hours between the admission in the ICU and the

vasopressors administration in order to guarantee that there’s an evolution towards the

need of the medication and that the patient did not enter the ICU in a state that should

have been under drug administration before his admission (applies only to class 1

patients);

3.2.1 Chosen Input/Output Variables

As already mentioned, this study relies in many assumptions with the intention to narrow the data

to something practicable, feasible and liable for the time being and available resources. The initiation

of this narrowing starts by defining which variables should be considered and input features as well as

output variables must be defined. Maintaining the same input variables given in [24] and assuming as

output variables the ones labelled as “Pressor Medications” by [51] (exception to Isuprel which has a

34

warning note saying that it can have a reversed effect) this results in 37 and 10 variables, respectively2.

The Table 3.1 shows the input variables and their corresponding sampling rate.

Table 3.1: Features and sampling rates (measurements/day) in each dataset.

# Time variant input [Unit] ALL BOTH PNM PAN

1 Heart Rate [BPM ] 29,5 ± 6,4 27,7 ± 4,5 27,7 ± 4,5 27,5 ± 3,82 Temperature [◦C] 11,5 ± 6,9 9,3 ± 4,8 9,2 ± 4,5 9,6 ± 5,43 Arterial BP [mmHg] 24,1 ± 9,3 22,2 ± 7,5 22,2 ± 7,5 21,7 ± 7,54 Arterial BP Dystolic [BPM ] 24,1 ± 9,3 22,2 ± 7,5 22,2 ± 7,5 21,7 ± 7,55 Respiratory Rate [BPM ] 28,8 ± 6,7 26,8 ± 5,2 26,7 ± 5,3 26,9 ± 4,36 SpO2 [%] 29,2 ± 6,2 28,0 ± 4,4 28,0 ± 4,4 27,6 ± 3,87 Hematocrit [%] 2,5 ± 1,4 2,0 ± 0,9 1,9 ± 0,9 2,1 ± 1,08 Potassium [mEq/L] 3,1 ± 1,7 2,5 ± 1,2 2,4 ± 1,1 2,7 ± 1,39 Glucose (70-105) [−] 3,9 ± 2,8 2,9 ± 1,9 2,8 ± 1,9 3,1 ± 1,810 Creatinine (0-1.3) [mg/dL] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,811 BUN (6-20) [mg/dL] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,812 Platelets [cells× 103/µL] 1,7 ± 0,9 1,5 ± 0,6 1,4 ± 0,6 1,6 ± 0,613 WBC (4-11,000) [cells× 103/µL] 1,5 ± 0,8 1,4 ± 0,5 1,4 ± 0,5 1,5 ± 0,614 RBC [cells× 103/µL] 1,5 ± 0,7 1,4 ± 0,5 1,4 ± 0,5 1,5 ± 0,615 Sodium [mEq/L] 1,9 ± 1,0 1,8 ± 0,7 1,7 ± 0,7 1,9 ± 0,916 Chloride [mEq/L] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,817 Arterial CO2(Calc) [−] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,618 Magnesium [mg/dL] 1,7 ± 0,8 1,6 ± 0,6 1,6 ± 0,6 1,8 ± 0,819 NBP [mmHg] 8,7 ± 6,7 8,0 ± 6,0 8,0 ± 6,0 8,3 ± 5,820 NBP Mean [mmHg] 8,6 ± 6,6 7,9 ± 5,9 7,9 ± 5,9 8,2 ± 5,721 PTT(22-35) [−] 1,3 ± 1,0 1,1 ± 0,8 1,1 ± 0,8 1,2 ± 0,922 INR (2-4 ref. range) [−] 1,3 ± 0,9 1,0 ± 0,7 1,0 ± 0,7 1,2 ± 0,823 Arterial PaCO2 [mmHg] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,624 Arterial PaO2 [mmHg] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,625 Arterial pH [−] 3,4 ± 2,1 3,0 ± 1,7 3,0 ± 1,7 2,9 ± 1,726 Arterial Base Excess [mEq/L] 3,2 ± 2,1 2,9 ± 1,7 2,9 ± 1,7 2,9 ± 1,627 Ionized Calcium [−] 1,9 ± 1,6 1,3 ± 1,2 1,3 ± 1,2 1,5 ± 1,228 Phosphorous(2.7-4.5) [mEq/L] 1,4 ± 0,8 1,4 ± 0,6 1,4 ± 0,6 1,6 ± 0,829 Lactic Acid(0.5-2.0) [mg/dL] 1,1 ± 1,4 0,9 ± 1,1 0,9 ± 1,1 1,2 ± 1,030 Calcium (8.4-10.2) [mg/dL] 1,4 ± 0,8 1,4 ± 0,7 1,3 ± 0,6 1,6 ± 0,831 CVP [−] 15,7 ± 8,2 13,7 ± 7,2 13,5 ± 7,1 14,3 ± 7,032 Urine Out Foley [ml] 18,8 ± 5,6 17,8 ± 5,5 17,7 ± 5,5 18,3 ± 5,7# Time invariant input ALL BOTH PNM PAN

33 Gender - - - -34 Age - - - -35 Weight - -36 SAPS - - - -37 SOFA - - - -

Sampling time was computed using the final dataset that include all adult patients with at least one

measurement for each input variable. The time between the first and last measurement for each patient

were used, as well as the number of measurements for each variable during that interval for each

patient. The result was then obtained averaging all patients sampling times, as in equation (3.1) where

i = [1, 2, ..., number of features] and j = [1, 2, ..., number of patients].

SRi =

∑.j=patient

number of measurements variableijend timej−start timej

number of patients(3.1)

The vasopressor agents considered are listed in Table 3.2, and it is important to mention that some

of them do not show up in the datasets.

2For the punctual data, only 32 input variables are considered, being it the time-variant variables (treated in a static manner forthe punctual data analysis, whereas for the time-series analysis the same variables are studied taking into account their evolution);The remaining 5 time-invariant variables are only considered for the mixed case: time-series data coupled with static data, beingit part of the latter.

35

Table 3.2: List of vasopressors and participation in the datasets.

Vasopressor Name Presence in the data sets

Levophed AllLevophed-k All

Dopamine AllDopamine Drip None

Epinephrine All except for PNMEpinephrine-k All

Epinephrine Drip NoneVasopressin All

Neosynephrine AllNeosynephrine-k All

An important assumption that was made is that for the output to be considered as positive (oc-

currence of administration of vasopressors) it has to have a length of at least 2 hours of continued

administration and it is considered as being continuous, adding to the previous administration, if and

only if the interval between two separate intakes is lower than 1.09 hours. This is different from what

was done in [24] where an interval of 1 hour is considered instead. This minor change is due to the

observation that having an interval of 1 hour excludes many administrations because somehow many

administrations occur in intervals slightly over than 1 hour (1 hour and 5.4 minutes), this phenomenon

is shown in Figure 3.3, and it was thought that they should still be considered as continuous administra-

tions. The percentage of vasopressors administration that occur in between 1 hour and 1.09 hours after

the previous administration corresponds to 27.9% (for the dataset that contains all patients, ALL, without

discriminating the vasopressors).

36

3.2.2 Removing outliers

Removing outliers is supported by expert knowledge, contrary to the inter-quartile method used in

[24] which is a statistical method that appears as much conservative when compared to the information

acquired by expert knowledge. However, it was not possible to collect expert knowledge for all variables

and in order to avoid excessively narrowing the range of possible values for the variables that lack expert

knowledge, they were narrowed by visual inspection individually, meaning that the data is removed only

when there are measurements that are clearly out of range when compared with the most common

measurements. Plots for the variables 1-4 that contain the expert knowledge limit, limits imposed by

visual inspection and the ones one should get by the inter-quartile method are provided in Figure 3.4

(details about these limit values are provided next in Table 3.3 and plots for all the variables can be found

in annex A). The data that is presented here is related to the case where all patients are considered, ALL

dataset. Outside the green dashed lines the data points are considered to be outliers by using expert

knowledge or visual inspection (when lacking the former information) and are replaced by NaN , the

black dashed lines have the same meaning but when using the inter-quartile method just for the sake of

visual comparison. This data is then removed which in many cases subtracts patients to the data set if

they do not have at least one measurement for each variable (two measurements for the time-varying

analysis) as in the previous case. These plots were zoomed in leaving out some absurdly high and low

measurements to improve the visualization3.

3There are several superimposed data points and there is no point in trying to extract some tendency out of these plots, thepurpose is just to have an idea of the dispersion of the data; the inter-quartile limits can help having and idea of the densities;the blue and red dots are data points extracted from patients that never had vasopressors and the ones that were under itsadministration, respectively, not to be confused with data points of class 1 and class 0, since class 1 patients have both.

37

(a)

(b)

Figure 3.3: Interval between two consecutive vasopressors intake (with a random added value of ±0.015, aftertheir interval identification, for density observation) considering only patients with more than 6 hours of data beforevasopressors administration. Figure (b) is a zoom in of the figure (a) to the region of interest.

38

Figure 3.4: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 1 to 4.

39

The major point here is that, since most of the data is of class 0, the inter-quartile method will tend

to consider as outliers data that is perfectly acceptable just because it deviates from the most common

measurements. The problem that arises with it is that it removes data that might well be useful to

distinguish between classes.

The apparent problem that arise when using statistical methods such as the inter-quartile is that data

that deviate considerably from the most common measurements will be removed and it might be re-

moving trends that will differentiate the output of the patients. The inter-quartile method is conservative,

excluding noticeably more data compared to the expert knowledge approach, because it is expected to

lead to a solid range of what is considered to be the standard value creating a dense area of measure-

ments that will approximate both upper and lower limit.

Table 3.3 (please, refer to Table 3.1 for variables’ identification) shows the limits applied, and con-

sequent percentage of data that is considered outliers, using expert knowledge along with the range

established by visual inspection and for comparison it is also shown the limits and percentage of outliers

when the inter-quartile method is used.

Table 3.3: Delimiting data (Inter-quartile is applied on ALL dataset)

# var Expert Knowledge Visual Inspection Inter-Quartile

Min Max % Rem. Min Max % Rem. Min Max % Rem.1 0 250 0,00 - - - 52,5 121,5 5,192 25 42 0,05 - - - 35,5 38,9 5,213 - - - 0,001 250 0,32 67,5 172,5 4,664 - - - 0,001 200 22,21 33,5 84,5 24,315 0 200 0,00 - - - 6,5 33,5 4,646 60 100 0,12 - - - 92,0 104,0 3,127 19 60 0,33 - - - 21,9 37,3 7,078 2,2 8 0,09 - - - 3,0 5,1 6,569 - - - 0,001 1000 0,01 55,0 193,0 8,6910 0,1 9 0,42 - - - -0,7 2,7 14,4011 4 500 0,24 - - - -19,5 73,5 10,6512 3 1000 0,20 - - - -93,5 455,5 8,1613 0,4 50 0,57 - - - 0,6 23,7 8,0114 2 8 0,24 - - - 2,4 4,2 6,7415 120 160 0,19 - - - 128,5 149,5 5,4016 80 130 0,18 - - - 94,0 118,0 4,8817 - - - 0,001 60 0,02 14,5 35,5 6,2418 0 10 0,01 - - - 1,4 2,6 10,5919 30 300 0,29 - - - 70,0 166,0 3,9520 10 186,7 0,06 - - - 44,0 106,0 5,4221 - - - 0,001 150 0,00 1,1 69,2 11,5522 - - - 0,001 30 0,02 0,7 1,9 13,7023 - - - 0,001 100 0,00 25,0 55,0 8,0824 - - - 0,001 500 0,00 27,5 186,5 7,5025 6,8 7,8 0,02 - - - 7,3 7,5 7,1026 -30 20 0,06 - - - -9,0 9,0 5,7227 - - - 0,001 20 0,16 1,0 1,3 8,2928 - - - 0,001 20 0,01 1,1 5,9 7,8729 0 10 3,71 - - - -0,9 4,5 14,1530 4,8 12 0,32 - - - 6,7 9,7 6,5431 - - - 0,001 50 0,59 0,5 21,5 5,9432 0 2000 2,16 - - - -92,5 252,5 11,31

Total - - 0,34 - - 7,96 - - 8,14

Using the inter-quartile method leads to a total 8.14 % of outliers whereas expert knowledge com-

bined with visual inspection leads to a total 3.00 % of outliers (all these results before applying the

imputation step and still include the patients that will be removed after outliers removal due to lack of

data).

40

3.2.3 Removing deceased patients

At the end what is intended with this project is to mimic the physicians decision as accurate as pos-

sible and it is desirable to remove the bias and noise that can misguide the obtained models. Another

source of data that have the potential to mislead the models (in pair with the cases where the vasopres-

sors intake are of short duration) are the data related with patients that died and were not prescribed

vasopressors since the physiological variables behave in a similar way to the patients that had vaso-

pressors: they start to decay as shown in Figure 3.5 for the variables platelets, WBC, chloride and NBP

mean (the plots for all the variables can be found in Appendix B).

The decision to not prescribe vasopressors can occur deliberately or not and some hypothesis might

be: failed to give the right diagnosis, no purpose in trying, vasopressors would not solve the problem,

it was too late to prescribe vasopressors, the cause of death has nothing to do with shock, lack of

resources, etc. Given that all this assumptions, if true, will bias the data one can prevent it by the simple

decision of excluding these cases. There’s still interest in considering deceased class 1 patients since

the outcome is not part of this study. The need to remove this class of patients came with the observation

of the plots that follow (for ALL dataset) where each point is the mean of all the measurements taken

in a 30 minutes window for a particular group of patients. Despite the differences in the mean value,

that can vary depending on the patients that are being considered to compute the mean (the number

of patients under account increase with the xx axis because it is when short stays start to appear), it

is shown a clear tendency that in some cases is similar between patients that took vasopressors and

patients of class 0 that will eventually die during their first ICU stay. The final value of the xx axis has

different meanings:

For class 1 patients - it is the moment those patients started the vasopressor intake.

For class 0 patients - it is the moment they were discharged from the ICU.

For class 0 deceased patients - it is the moment the patient died.

The vertical black line is placed 2 hours before the end of the xx axis corresponding to the prediction

window prior to the starting point of the vasopressors administration.

41

Figure 3.5: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 12,13,16 and 20.Each point corresponds to the mean value of the measurements taken in a 2 hours window.

42

It was considered three different ways to evaluate if the patients that were not under vasopressors

administration (class 0 patients) ended up dying or not:

1. In the ICU stay details data there is a flag type variable with the next description: ’Subject death

during hospital admission.’ Every time this flag takes positive value (Y) it is considered that the

patient died during that ICU stay (first stay since it is the only that is taken into account) and that it

lived if negative (N);

2. Other flag type variable with the description: ’Died in ICU (assumed from icustay outtime> hospital

discharge AND died in hospital)’ was evaluated in the same way as the previous one;

3. The last way uses two variables:

• ’Subject’s date of death’

• ’Hospital discharge date’

Using this information it was assumed that the patient died if the distance in time between the first

and the second variable mentioned is less than 24 hours and that it lived otherwise.

When there is inconsistent information the priority is given by the order the conditions were pre-

sented. These assumptions could lead to errant information since the patients that have none of this

information were considered as survivals. This seems to be a valid assumption since it is more prone to

not report that a patient survived than to not report a death.

The reason to choose three ways as to do with the possibility of having missing data which is stated

in the website of the MIMIC database [1] that it is a possibility for those variables (they are classified as

nullable variables, meaning that they can be empty).

3.2.4 Normalization

Normalization of the data is an important step in Knowledge Discovery in Databases (KDD) process,

and depending on the type of datasets that are being dealt with there can be various options to tackle

this problem. In this work it was used min-max normalization and three kinds of normalization approach

were thought having each of them its advantages and disadvantages. To visualize the results of these

hypothesis an artificial time-varying dataset was created with three different behaviours as shown in

Figure 3.6.

For each one of the three methods, for which a detailed description is provided below, the min-max

normalization was applied being the only difference what is considered to be the minimum and the

maximum. The min-max normalization is given by the equation (3.2) resulting in data comprehended in

the interval [0,1].

d′ =d−min (p)

max (p)−min (p)(3.2)

43

Figure 3.6: Artificial data and results for three different methods on applying min-max normalization.

Method 1 - Normalization using all feature values

The procedure of normalization using all measurements of each feature consists of searching for the

maximum and minimum value for the feature in consideration along all the dataset and apply the min-max

normalization formula. This method is the one that would be used blindly when there’s no knowledge

about the dataset which does not reflect the current problem when clustering time-series. It is known

that different patients can have dissimilar standard healthy physiological values, for instance, those with

pre-existing hypertension may require higher blood pressures to maintain adequate perfusion [9], and

neglecting this fact will lead to a biased clustering with information that it should not be concerned with,

increasing the probability of aggregation of patients solely based in this factor.

Method 2 - Normalization for each feature considering each patient

This approach consists in normalizing each patient in itself, so each patient after normalization does

have a 0 and a 1 in its time-series that correspond to the minimum and the maximum measurement

taken from that patient, respectively. The problem that arises is that there is no perception of what is

a high or low variation. So a patient can be stable having slight variations during its ICU stay but after

normalization it does compare with a patient with highly unstable measurements. This phenomenon is

observed when comparing the blue lines with the black one which have a higher oscillation.

Method 3 - Normalization using all feature values after removing mean for each patient

This way of normalization is similar to the first one with the difference that the mean value of the temporal

series is removed for each patient. The least drawback is achieved with this approach. The problem of

having patients with heterogeneous healthy physiological measurements is removed and the variations

44

are now distinguishable, the only issue that is not taken into account is that an high variation for a given

patient might be considered low for another but this complication when compared with the others is

unusual and neglecting it should not affect the analysis in a significant way.

3.3 Preprocessing Data - Clinical Actual State Analysis

In this chapter it is described the steps taken to convey the data to its final form, suitable to the

analysis that will be carried related with the prediction of the need of vasopressor agents given the state

of the patient at a fixed time (punctual data). The big picture of this procedure can be consulted in Figure

3.9.

At this point the filtered data, i.e., after applying the constraints, consists of measurements of the

inputs and outputs and their timings. This results in a major problem where the data is not aligned as

there are differences in the sampling rate of each variable - not only for a given variable but also for a

given patient and its condition - and it is desired that one can evaluate a certain state of the patient by

having all the variables measured at a certain moment. The approach that was taken is described in

[12]. It proposes a method to align misaligned and unevenly sampled data.

Misaligned and unevenly sampled data refers to data that neither occur at the same time points nor

are equally spaced in time. The suggested alignment creates the need to calculate new values for the

new locations generated by the template variable, i.e. , by the variable with the highest sampling rate,

which is normally done through some form of interpolation.

In the present case, the Heart Rate samples, which is the most frequently measured variable, were

used as templates to align all the remaining variables. The alignment is performed by positioning the

measurements to the closest one in the template variable, using the time as the distance measure, as

depicted in Figure 3.7.

1 2 3 4 5 6 7 8 9 10 11

Template variableVariable to align

Variable after alignmentMissing data after alignment

Figure 3.7: Alignment of misaligned and unevenly sampled data (inspired in [12]).

Figure 3.8 shows a more detailed description of what happens considering the most extreme sce-

nario in the present data: misaligned and unevenly sampled data in conjunction with differences in the

starting and ending time of collection. The latter is an issue due to the need to have at least one mea-

surement of each variable. The solution to this problem is to remove all the data prior to the moment

that a measurement for each variable is available.

45

Data taken into account, there's already one measure for

each variable and all the precedent values are transported

using ZOH

Feature 1 (highest sampling rate):

Feature 2:

Feature 3:

NA

NA NA

NA

Real

information

Real

information

Real

information

Data that is not taken into account because

there's still not one measure for each variable

Zero order hold

Zero order hold

Feature 1 (highest sampling rate):

Feature 2:

Feature 3:

NA

NA NA

NA

Real

information

Information

alignment

Information

alignment

Zero order hold

Zero order hold

Data taken into account, there's already one measure for

each variable and all the precedent values are transported

using ZOH

Applying the template of thevariable

with highest sampling rate

Aligned with the next measure of the

variable with the highest sampling rate

Figure 3.8: Alignment of the data for the punctual data analysis, using the variable with highest sampling rate astemplate, covering all the case scenarios.

46

677

Pat

ien

ts37

71 P

atie

nts

4296

Pat

ien

ts*[

1]

Pan

cre

atit

isP

neu

mon

ia

677

Pat

ien

ts37

27 P

atie

nts

4252

Pat

ien

ts

> 15

yea

rs o

f ag

e>

15 y

ears

of

age

> 15

yea

rs o

f ag

e

222

Pat

ien

ts

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

1176

Pat

ien

ts

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

1326

Pat

ien

t

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

*[0

] P

atie

nt

ID g

oes

to

32

809

bu

t th

ere

are

so

me

IDs

mis

sin

g in

be

twe

en le

adin

g to

a t

ota

l of

32

535

pat

ien

ts.

*[1

] 6

77

(Pan

crea

titi

s) +

37

71 (

Pn

eu

mo

nia

) –

42

96 (

Pan

crea

titi

s &

Pn

eu

mo

nia

) =

15

2 P

atie

nts

wit

h b

oth

dis

eas

es.

*[2

] 1

st IC

U s

tay:

inte

rval

be

twe

en t

wo

co

nse

cuti

ve m

eas

ure

s o

f th

e va

ria

ble

he

art

ra

te <

24h

.*

[4]

Co

nd

itio

ns

to b

e co

nsi

de

red

as

cla

ss 1

: 1

) 6

ho

urs

of

colle

cted

da

ta b

efo

re v

aso

pre

sso

r ad

min

istr

atio

n;

2)

Len

gth

of

vaso

pre

sso

r a

dm

inis

trat

ion

> 2

h;

3)

Inte

rva

l bet

we

en

tw

o v

aso

pre

sso

r ad

min

istr

ati

on

eve

nt

< 1

h.

*[5

] P

atie

nts

th

at h

ad

vas

op

ress

ors

bu

t d

idn

’t s

atis

fy t

he

ab

ove

me

nti

on

ed

co

nd

itio

ns

in *

[4].

MIM

IC II

3253

5*[0

] P

atie

nts

No

me

asu

re o

f on

e va

r: -

2D

ata

star

t w

ith

out

put=

1: -

48C

lass

0 p

at. D

ied

: -7

Pre

dict

ion

win

do

w: -

1To

tal:

152

(59

-c1;

93

-c0)

3409

dad

os

cla

ss 1

2514

9 d

ado

s cl

ass

0

1) V

alue

0 r

epla

ced

by

NaN

exc

ept f

or v

aria

ble

‘Art

eria

l Bas

e Ex

cess

’; 2

) Ex

p.K

now

led

ge t

o r

emov

e ou

tlie

rs ;

3) R

emov

e pa

tien

ts t

hat

do

n’t

hav

e at

lea

st o

ne

mea

sure

fo

r ea

ch in

put

vari

able

(af

ter

rem

ovi

ng o

utl

iers

)

2458

0 P

atie

nts

All

pat

ient

sP

ancr

eat

itis

& P

enu

mon

ia

4370

Pat

ien

ts

Cla

ss 0

: 12

08C

lass

1: 2

874

(28

8 n

ever

had

)*[5

]

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

Con

dit

ions

in *

[4]

Cla

ss 0

: 82

Cla

ss 1

: 12

8(1

2 r

emov

ed)*

[5]

Con

dit

ions

in *

[4]

Cla

ss 0

: 32

2C

lass

1: 7

99

(55

rem

oved

)*[5

]

Con

dit

ions

in *

[4]

No

me

asu

re o

f on

e va

r: -

6D

ata

star

t w

ith

out

put=

1: -

350

Cla

ss 0

pat

. Die

d: -

52P

redi

ctio

n w

ind

ow

: -1

Tota

l: 71

2 (3

15-

c1;3

97-

c0)

1093

1 d

ado

s cl

ass

111

1049

dad

os

cla

ss 0

Cla

ss 0

: 37

7C

lass

1: 8

88

(61

rem

oved

)*[5

]

Con

dit

ions

in *

[4]

No

me

asu

re o

f on

e va

r: -

8D

ata

star

t w

ith

out

put=

1: -

387

Cla

ss 0

pat

. Die

d: -

58P

redi

ctio

n w

ind

ow

: -2

Tota

l: 81

0 (3

52-

c1;4

58-

c0)

1321

4 d

ado

s cl

ass

112

4473

dad

os

cla

ss 0

3253

5 P

atie

nts

*[1]

> 15

yea

rs o

f ag

e

No

me

asu

re o

f on

e va

r: -

40D

ata

star

t w

ith

out

put=

1: -

1296

Cla

ss 0

pat

. Die

d: -

173

Pre

dict

ion

win

do

w: -

20To

tal:

2553

(87

5-c1

;16

78-c

0)34

640

dad

os

cla

ss 1

2557

77 d

ado

s cl

ass

0

Figure 3.9: Preprocessing steps and resulting datasets of the MIMIC II for the punctual data case.

47

3.3.1 Data Imputation

After the alignment, as stated before, one has to deal with missing data (Table 3.4) and similarly to what

physicians do it is assumed that the last measurement is the more accurate one for the time being and

a zero order hold procedure was applied (Table 3.5) .

Table 3.4: After alignment with Heart Rate.

Heart Rate Temperature General Var

123 NaN NaN67 NaN 9069 NaN NaN72 NaN 8774 NaN NaN77 35 NaN78 NaN NaN75 36 8571 NaN NaN

Table 3.5: After imputation of data using ZOH.


123 NaN NaN67 NaN 9069 NaN 9072 NaN 8774 NaN 8777 35 8778 35 8775 36 8571 36 85

At last, all the data points that are not complete (have NaN) after those steps are eliminated (Ta-

ble 3.6). This data is associated with the first stages of the ICU stay where not all variables were

measured at least once.

Table 3.6: Removal of data due to lack of data.


rem. rem. rem.rem. rem. rem.rem. rem. rem.rem. rem. rem.rem. rem. rem.

77 35 8778 35 8775 36 8571 36 85

The result of this approach are rows of data where each row tries to represent the patient at a given

time by its time-variant input variables and output variable. The output variable goes through the same

process, with no added complexity it is like another input variable that takes the value 1 if the drug

is being administrated and 0 if not. Typically, all patients have class 0 data (patients that never had

vasopressors plus the early stages of patients that eventually will need the medication) and only the

patients that take vasopressors contribute for class 1 data.

As a result of the aforementioned data preprocessing step, it is shown in Table 3.7 the percentages

of imputation of data for each data set.

48

Table 3.7: Percentages of imputation by dataset.

Data set % of imputation

PAN 74,84

PNM 75,19

BOTH 75,11

ALL 74,73

These high percentages do not come as unexpected due to the sampling rate differences when

compared to the most sampled variable, Heart Rate (that is used as the template). Eliminating variables

with low collection rate would decrease immensely the percentage of imputations. As an example, it is

shown in the Table 3.8 the percentage of imputations for each variable in the dataset ALL.

Table 3.8: Percentages of imputations by input time-varying variable

variable # 1 2 3 4 5 6 7 8% imp 0,00 0,66 0,22 0,22 0,03 0,04 0,93 0,91

variable # 9 10 11 12 13 14 15 16% imp 0,90 0,94 0,94 0,95 0,95 0,95 0,94 0,94

variable # 17 18 19 20 21 22 23 24% imp 0,90 0,94 0,70 0,70 0,96 0,96 0,90 0,90

variable # 25 26 27 28 29 30 31 32% imp 0,90 0,91 0,95 0,95 0,97 0,95 0,45 0,35

3.3.2 Enabling the dataset for prediction purposes

Up until to this point, the dataset set up is only able to tell if at a given state the output is either 1 or 0.

Nevertheless, this information is not yet usable since the aim is to predict within a time window of 2 hours

a patient’s transition to vasopressor dependence for three main reasons: (i) a central line would only be

inserted if the patient needs it in fact in the future (reduction in the performance of this procedure) (ii)

the central line insertion protocol could be initiated with enough time, prior to the moment of need of

vasopressors (iii) it was demonstrated that the delay in delivering the drug can change drastically the

outcome.

Bearing this in mind, the process to transform this direct input-output relationship to an input capable

of predicting the output consists in shifting the output 2 hours, so that every row of input reflects if in that

given condition the patient will be vasopressor dependent after the prediction window. Next, follows an

example of what was described, the procedure and its final result.

Consider that the second row in Table 3.9 contains the times at which each measurement occurred

comprising 12 instances (in this case, corresponding to the times at which the heart rate was measured).

One way to guarantee that it shifts 2 hours is by using the algorithm 2.

Since it is not possible to have the desired interval for every instance, due to the uneven sampled

rates, the prediction lies between ]1, 3[ hours instead, and this algorithm excludes all the data that does

49

Algorithm 2 Shifting the output by 2 hours for prediction purposes.

1: for i← 1 to n− 1 do2: ∆←∞3: for j ← i+ 1 to n do4: δij ← 2− (xj − xi)5: end for6: if min(δi:) < 1 then7: xnewi ← find(δij == min(δi:))8: else9: xnewi ← NaN

10: end if11: end for12: return xnewi

not have the capability to predict within the specified interval. Applying the algorithm 2 results in the last

line in Table 3.9.

Table 3.9: Example of the output shifting procedure.

instance # 1 2 3 4 5 6 7 8 9 10 11 12

time (hours) 0 1 2 2,5 3 5 10 11 13 13,5 15 18Minimum Distance 2 2 3 2,5 2 5 3 2 2 1,5 3 -

new time (hours) 2 3 NaN 5 5 NaN NaN 13 15 15 NaN NaN

As a result the instances [3,6,7,11] are deleted due to the impossibility to shift the output and have

information about the drug intake after ]1,3[ hours, as shown by the third row. The instance [12] is

deleted as well but the reason is that it is the last measurement and so there is no information about

what will happen after it. In Table 3.10 and 3.11 is an example of the data before and after applying the

algorithm, respectively. It could contain any number of input variables, the statement here is that the

input variable keeps the same positioning while the output is shifted upwards two hours (in reality it is

between ]1,3[) whenever it is possible and remove the lines that cannot be shifted.

Table 3.10: Before shifting the output.

instance hours Input Output

1 0 78 02 1 75 03 2 71 04 2,5 73 05 3 78 06 5 67 17 19 68 18 11 77 19 13 79 1

10 13,5 74 111 15 72 112 18 71 1

Table 3.11: After shifting the output.

instance hours Input Output

3 2 78 05 3 75 0

NaN NaN NaN NaN6 5 73 16 5 78 1

NaN NaN NaN NaNNaN NaN NaN NaN

9 13 77 111 15 79 111 15 74 1

NaN NaN NaN NaNNaN NaN NaN NaN

50

3.4 Preprocessing Data - Clinical State Evolution Analysis

This section describes the particularities of composing the datasets using the time feature in order

to have the evolution of the patient during its ICU stay. The constraints are different from the previous

case, as in this case it is required that each patient has at least two measures of each variable ( in the

previous case only one measure was necessary), and since it is intended to use demographic data, that

adds the requirement that every patient should have a record of it during their first ICU stay.

To follow the same line of reasoning, where one must predict with at least 2 hours in advance, all

data that was collected from 2 hours before the initiation of vasopressors intake until the end of the ICU

stay was removed from the dataset so it is composed of a time series for each variable as long as there

are a minimum of two measurements for each variable (pre-requisite to have a time-series).

It is expected to have a lower number of imputations in this case since it will be filled with all taken

measurements before starting filling up the time-series with estimated values using a ZOH approach.

The Figure 3.10 shows the major steps to reach the final datasets. In order to clarify in more detail

what is described in the bottom boxes of the flowchart, follows a brief description with respect to the PAN

dataset that can be extended for all datasets.

“No data before 1st admin:-80” - Means that at the moment of the first intake of vasopressors there

were no information on all input variables, removing 80 patients from the dataset.

“Deceased class 0 patient:-4” - There were 4 patients of class 0 that died during the first ICU stay,

and they were discarded from the dataset.

“Less than 2 entries for any variable:-33” - It is required at least two measures of each temporal vari-

able and there were 33 cases that did not meet such constraint and were discarded (note that these

filters are applied in the same order it is presented here, so it happens after removing patients that

do not have at least one measure).

“No data for all vars 2h before admin:-11” - This could be summed to the immediately above filter,

refers to patients that do not have two measures of temporal variable when the data collected

during the prediction window, from two hours before vasopressors intake to the moment of admin-

istration, are neglected (note that this time the dataset is already able to predict with 2 hours in

advance since the data in this time window is removed). There are 11 such cases.

“Missing static data:-4” - There are 4 patients that do not have all the input static data available and

they were discarded as well.

51

677

Pat

ien

ts37

71 P

atie

nts

4296

Pat

ien

ts*[

1]

Pan

cre

atit

isP

neu

mon

ia

677

Pat

ien

ts37

27 P

atie

nts

4252

Pat

ien

ts

> 15

yea

rs o

f ag

e>

15 y

ears

of

age

> 15

yea

rs o

f ag

e

222

Pat

ien

ts

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

1176

Pat

ien

ts

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

1326

Pat

ien

t

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

*[0

] P

atie

nt

ID g

oes

to

32

809

bu

t th

ere

are

so

me

IDs

mis

sin

g in

be

twe

en le

adin

g to

a t

ota

l of

32

535

pat

ien

ts.

*[1

] 6

77

(Pan

crea

titi

s) +

37

71 (

Pn

eu

mo

nia

) –

42

96 (

Pan

crea

titi

s &

Pn

eu

mo

nia

) =

15

2 P

atie

nts

wit

h b

oth

dis

eas

es.

*[2

] 1

st IC

U s

tay:

inte

rval

be

twe

en t

wo

co

nse

cuti

ve m

eas

ure

s o

f th

e va

ria

ble

he

art

ra

te <

24h

.*

[4]

Co

nd

itio

ns

to b

e co

nsi

de

red

as

cla

ss 1

: 1

) 6

ho

urs

of

colle

cted

da

ta b

efo

re v

aso

pre

sso

r ad

min

istr

atio

n;

2)

Len

gth

of

vaso

pre

sso

r a

dm

inis

trat

ion

> 2

h;

3)

Inte

rva

l bet

we

en

tw

o v

aso

pre

sso

r ad

min

istr

ati

on

eve

nt

< 1

h.

*[5

] P

atie

nts

th

at h

ad

vas

op

ress

ors

bu

t d

idn

’t s

atis

fy t

he

ab

ove

me

nti

on

ed

co

nd

itio

ns

in *

[4].

MIM

IC II

3253

5*[0

] P

atie

nts

No

dat

a be

fore

1st

ad

min

.: -8

0D

ece

ased

cla

ss 0

pat

ien

t: -

4Le

ss t

han

2 e

ntr

ies

for

any

vari

able

:-34

No

dat

a fo

r al

l var

s 2h

bef

ore

ad

min

.:-1

1M

issi

ng

dem

ogra

ph

ic/s

tati

c da

ta:-

422

cla

ss 1

pat

ien

ts55

cla

ss 0

pat

ien

ts

1) V

alue

0 r

epla

ced

by

NaN

exc

ept f

or v

aria

ble

‘Art

eria

l Bas

e Ex

cess

’; 2

) Ex

p.K

now

led

ge t

o r

emov

e ou

tlie

rs ;

3) R

emov

e pa

tien

ts t

hat

do

n’t

hav

e at

lea

st o

ne

mea

sure

fo

r ea

ch in

put

vari

able

(aft

er r

emo

ving

ou

tlie

rs)

2458

0 P

atie

nts

All

pat

ient

sP

ancr

eat

itis

& P

enu

mon

ia

4370

Pat

ien

ts

Cla

ss 0

: 12

08C

lass

1: 2

874

(28

8 n

ever

had

)*[5

]

At

leas

t 1

mea

sure

for

each

inp

ut v

aria

ble

Dur

ing

1st

ICU

sta

y*[2

]

Con

dit

ions

in *

[4]

Cla

ss 0

: 82

Cla

ss 1

: 12

8(1

2 r

emov

ed)*

[5]

Con

dit

ions

in *

[4]

Cla

ss 0

: 32

2C

lass

1: 7

99

(55

rem

oved

)*[5

]

Con

dit

ions

in *

[4]

No

dat

a be

fore

1st

ad

min

.: -5

36

Dec

eas

ed c

lass

0 p

atie

nt:

-37

Less

th

an 2

en

trie

s fo

r an

y va

riab

le:-

162

No

dat

a fo

r al

l var

s 2h

bef

ore

ad

min

.:-6

0M

issi

ng

dem

ogra

ph

ic/s

tati

c da

ta:-

1311

2 cl

ass

1 p

atie

nts

202

clas

s 0

pat

ien

ts

Cla

ss 0

: 37

7C

lass

1: 8

88

(61

rem

oved

)*[5

]

Con

dit

ions

in *

[4]

No

dat

a be

fore

1st

ad

min

.: -5

95

Dec

eas

ed c

lass

0 p

atie

nt:

-41

Less

th

an 2

en

trie

s fo

r an

y va

riab

le:-

187

No

dat

a fo

r al

l var

s 2h

bef

ore

ad

min

.:-6

8M

issi

ng

dem

ogra

ph

ic/s

tati

c da

ta:-

1412

4 cl

ass

1 p

atie

nts

236

clas

s 0

pat

ien

ts

3253

5 P

atie

nts

*[1]

> 15

yea

rs o

f ag

e

No

dat

a be

fore

1st

ad

min

.: -2

127

Dec

eas

ed c

lass

0 p

atie

nt:

-13

0Le

ss t

han

2 e

ntr

ies

for

any

vari

able

:-64

9N

o d

ata

for

all v

ars

2h b

efo

re a

dm

in.:

-18

3M

issi

ng

dem

ogra

ph

ic/s

tati

c da

ta:-

4826

9 cl

ass

1 p

atie

nts

676

clas

s 0

pat

ien

ts

Figure 3.10: Preprocessing steps and resulting datasets of the MIMIC II for the time-series data case.

52

After filtering the data to include only data that fulfil the requirements, there is the need to organize

the data to make it structurally equal for all patients. Since there are different number of measurements

for each variable and for each patient, it is desired to limit the quantity of data we are dealing with due

to computational constraints.

3.4.1 Data imputation

The algorithm that will be used to analyse the time-series dataset requires that all the time-varying

data has the same length. Since there are variables that only have two measurements and it is in-

tended to study different lengths, there arises the problem of imputation. While there are many solutions

for imputation, the one that was used is one that more closely resembles the approach taken by the

physicians, having the advantage of using the most recently recorded data and treating each variable

independently: there is no requirement to align the measurements.

The logic behind this method is that the sampling rates of each variable are related with their chang-

ing speed and aligning the time-series variables with each other would require a template similar to what

was done for the punctual state analysis. It happens that using a template for this case would increase

largely the number of imputations and would remove the evolution of some time series (maintaining a

constant value). Another drawback is that it would be needed to have at least the same number of

measurements as the length it is considered and, akin to the previous approach, all the data that were

collected before the moment of having at least one measurement for each variable would be unused,

i.e., considering a vector length of 10 as a template it would be needed to have at least 10 measure-

ments of the template variable (not hard to achieve if the template variable has a high sampling rate)

and if there is only two measurements available of another variable and the first one is measured after

the first measurement considered of the template value there will be empty slots of measurements and

that patient would have to be discarded. Two cases would have to be considered:

Select the Heart Rate as a template variable - This would result in having variables that only have

one real measure and the rest are imputations of the same measurement due to their inherent

low sampling rate when compared with the one of Heart Rate. One way to avoid this situation is

to increase the number of measures that are considered and as mentioned, the top limit is ten in

order to make it computationally viable.

Select a pre-defined time between measures - Due to the different sampling rates associated with

each variable, selecting a pre-defined time between samples would be a difficult problem that

would need to be balanced to make it fair (it could also be selected by selecting another variable

as a template). In the punctual state case there was a reason to select the Heart Rate as a

template variable: it is the variable with the highest sampling rate. In this case the choice would

be ’something in the middle’ in order to not favour the variables with the sampling rates more close

to the one selected, making it the variable with less imputations and the measurements more

accurate without any reason to do so. Using such approach would remove a considerable amount

of data that is more up-to-date than what would be used instead.

53

To avoid such problems, the simplest approach was taken which is to consider the last measurements

of each variable and every time there are no sufficient number of measures the approach taken is to fix

the times of the first and last measurement of the considered variable, divide the time in between by x

equal parts (being x the considered length) and impute the data using the ZOH approach having in mind

the time at which all the measurements were taken. This process of imputation and selecting the data

that constitute the time-series are shown in Figure 3.11.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

1 2 3 4 5

16 17 18 19 20 21 22 23 24 25

1 1 2 2 3 3 4 4 5 6

6

1 2 3 4

1 1 1 1 2 2 3 3 3 4

ICU stay length for class 0 OR ICU stay length until 2 hours before vasopressors administration for class 1 [time]

Feature 1:

Feature 2:

Feature 3:

NA

NA

NA

NA

NA

NA

NA

NA

Real

information

Used

information

10 last

records

Divide feature

length in 10 equal

parts and apply ZOH

Real

information

Used

information

Real

information

Used

information

Divide feature

length in 10 equal

parts and apply ZOH

Variable length [ rst measure, last measure]



Has to be more than 6 hours for class 1 patients

Figure 3.11: Procedure to adapt the real measurements to a vector of length 10 for all variables including dataimputation when needed.

The aforementioned approach leads to the percentages of imputation shown in Table 3.12 which are

far better than the percentages in the punctual data approach.

Table 3.12: Percentages of imputation given the vector length.

vector length % imputationPAN PNM BOTH ALL

2 0 0 0 04 1.8 1.78 1.83 2.556 5.08 4.49 4.66 6.488 9.19 7.30 7.68 10.6210 12.93 10.22 10.74 14.62

For the purpose of clarification, Table 3.13 shows the percentage of imputation and the number of

patients that are removed in order to have all the data aligned with a time template equally spaced with

intervals of 1,2,6 and 12 hours.

54

Table 3.13: Considering an interval between measurements of x hour (for ALL dataset)

vector length intervals between measurements

1h 2h 6h 12h

2 37.94 (7 rem.) 35.92 (15 rem.) 31.02 (42 rem.) 21.79 (84 rem.)

4 61.05 (19 rem.) 55.15 (42 rem.) 41.84 (110 rem.) 32.19 (205 rem.)

6 67.13 (38 rem.) 60.45 (71 rem.) 48.84 (174 rem.) 34.86 (316 rem.)

8 70.80 (50 rem.) 62.61 (94 rem.) 49.60 (242 rem.) 36.23 (415 rem.)

10 73.16 (59 rem.) 64.51 (109 rem.) 51.76 (292 rem.) 37.06 (487 rem.)

The results that come from this table are very easily interpretable. As the interval between two

measurements increase, the percentage of imputation is reduced because variables with low sampling

rate will have an increased chance to appear as real measurements and not as imputations. As the

length of time needed for each patient goes up (by increasing the vector length or the interval between

measurements or both) the number of removed patients increases because less and less patients have

a stay long enough to be considered. At first it might look inconsistent to have a different number of

patients removed, for instance the combination (length,interval) of (2,2) and (4,1) could lead to think that

it should remove the same number of patients since the interval considered would look like to be the

same, 4 hours. Nevertheless, the last time is fixed regarding the last measurement considered and so,

in the case of (2,2) the patients need to have information of 2 hours for all variables and for the case

(4,1) it is needed 3 hours of information because the length is the number of points and not the number

of intervals. (intervals = numberofpoints− 1).

As it has been stated, this approach would lead to a decrease in the volume of the dataset thus the

reason to stick to the former one, where there are fewer imputations (more dynamics included), more

patients are taken into account, and only the newest collected data is considered.

3.4.2 Enabling the dataset for prediction purposes

In this case there is no need to use any algorithm to shift the data. Each patient has a row of data for

each variable corresponding to its evolution along the time. Since the objective is to predict timely the

need of vasopressors, the data that was collected in the interval of 2 hours before the administration is

discarded, as shown in Figure 3.12.

Figure 3.12: How the data is enabled for prediction purposes.

55

Chapter 4

Results and Discussions

In the present chapter the main results obtained through what was described in chapter 2 are pre-

sented.

First the results concerning the punctual data analysis are presented and discussed, where it will be

proven that the way the raw database was tackled provides better results when compared to previous

studies under the same subject. Starts with the unsupervised clustering analysis in Section 4.1.1, then

follows Section 4.1.2 where results for feature selection are discussed, considering both overall perfor-

mance and single models’ performance. Finally, in Sections 4.1.3 and 4.1.4, the model assessment

results for covering both overall and singular models’ performance are presented and discussed.

Next, the results concerning the time-series coupled with static data analysis are approached. Firstly,

under Section , the results for feature selection are shown along with the fixed parameters, where it is

shown that the static variables have their place side by side with time-series and that the normalization

of the data should be selected consciously according to the dataset under study. Follows the model

assessment part in Section , where the results obtained through the four methods are shown and com-

pared.

4.1 Ensemble Modelling - Punctual Data Results

4.1.1 Unsupervised Clustering Validation

As presented previously the multimodel approach consists of two stages of clustering. First the

data is divided using unsupervised clustering and then supervised clustering is performed to each of

the clusters that resulted from the former partition. To avoid a grid search approach, which would be

computationally heavier, cluster validation analysis was performed to find the most suited parameters

for the unsupervised clustering algorithm FCM.

The unsupervised clustering validation is performed through two steps: evaluation of the clustering

validation indexes results (analytical methods) and a study concerned with the distribution of the data

points along the clusters. The latter turned out to be a necessity since the results attained by the formal

analytical approach do not prove to be clear about defining one of the parameters of the FCM algorithm,

56

the number of unsupervised clusters Kc. As a result, the other parameter, the fuzziness degree m, is

defined using the clustering validation methods whereas an analysis regarding the distribution of the

data points is conducted in order to come with an optimal number of clusters Kc.

Due to the time consuming task of getting an output from the analytical methods it is desired to re-

duce the quantity of input data and get a subset of the datasets in an attempt to represent the whole

set. Rather than reducing the datasets indiscriminately, the reduction of data volume followed the same

steps as in Section 4.1.2 and 4.1.3 to obtain the train set of data, where feature selection and model

assessment is conducted respectively. The idea is to keep the same amount of data for clustering vali-

dation as the data that will be presented to train the models. The stages to get the subsets for clustering

validation are depicted in Figure 4.1, where the dataset PAN serves as an example. Summarizing these

steps:

1. The data is divided into two subsets containing 50% of the data each while keeping the same

distribution of classes: this is done in order to replicate the division of the data into a Feature

Selection (FS) subset and Model Assessment (MA) subset.

2. The 50% of the data is divided into 10 folds (being the 10-fold cross-validation method the one that

is used here) and then only 9 folds are grouped, leaving one fold out (test data), to replicate the

amount of data that will be used to train the models.

3. In order to avoid a highly skewed subset (as shown in Table 4.1), the data is balanced to obtain a

balance of 70% of class 0 and consequently 30% of class 1. This reproduces what is performed

for the training data sets (the test data remains with the real balance of the initial dataset).

Table 4.1: Classes percentages in each dataset.

Dataset Number of samples Classes %Class 0 Class 1 Class 0 Class 1

PAN 25149 3409 88,06 11,94PNM 111049 10931 91,04 8,96

BOTH 124473 13214 90,40 9,60ALL 255777 34640 88,07 11,93

Now that the problem of high volume datasets is clarified and cleared up, follows the analysis of

clustering validation.

57

PAN dataset28553 samples

(normalized data)

3409 class 1samples

25149 class 0samples

~12575 class 0samples

~1705 class 1samples

Fold 1 Fold 2 Fold 3 Fold 10...

1 to 9 fold train data:~11318 class 0 samples~1535 class 1 samples

Final subset:~3582 class 0 samples~1535 class 1 samples

Divide into classes

Reduction of the data by 50% Reduction of the data by 50%

Divide data into 10-fold Divide data into 10-fold

Select train data only

Balanced the data to 70% class 0 and 30% class 1

Figure 4.1: Example on PAN dataset to get the subset data for clustering validation.

Clustering Validation based on analytical methods

In an attempt to solve the problem of finding the optimal parameters for the unsupervised FCM

algorithm (fuzziness parameter m and number of clusters Kc) a search through the methods presented

in Section 4.1.1 was performed for all datasets. The whole results of these methods are presented under

appendix D, whereas only the results from the dataset ALL are presented here (Table 4.2 since there is

no contrast between datasets to discuss each separately so the following analysis can be extended to

the missing datasets too).

It is easily observed that the best scores are attained using the lowest m. This comes as no surprise

since most penalize high degrees of fuzziness and over-lapping, and a low m delivers low degrees of

both. Only the methods designed for hard clustering (DI and ADI) do not penalize the fuzziness degree

and look only at the final result of the clustering evaluating inter and intra-clustering distances. The major

differences occur for PC, SC and S where m = 1.4 delivers better scores.

PC - This index measures the undesired amount of over-lapping between clusters and as m increases

a higher overlap is expected. So this method will penalize higher values for the m parameter

when compared with smaller ones by delivering a lower score. A higher number of clusters is

also expected to increase the score since the distances between clusters will be smaller, thus

58

Table 4.2: Clustering validation indexes for dataset ALL performed 10 times with different partitions. (+) means ahigher value is better and (-) means the opposite.

m Index Score according to number of Clusters2 3 4 5 6

PC(+) 7,58E-01 ± 1,17E-16 6,44E-01 ± 0,00E+00 5,74E-01 ± 1,17E-16 5,25E-01 ± 1,17E-16 4,88E-01 ± 1,17E-16CE(-) 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 0,00E+00SC(-) 8,53E+06 ± 1,96E-09 9,19E+06 ± 1,96E-09 5,95E+06 ± 9,82E-10 7,57E+06 ± 9,82E-10 3,25E+06 ± 4,91E-10

1.4 S(-) 1,64E+02 ± 3,00E-14 2,67E+02 ± 5,99E-14 1,70E+02 ± 3,00E-14 2,15E+02 ± 5,99E-14 9,37E+01 ± 1,50E-14XB(-) 2,12E+00 ± 4,68E-16 1,80E+00 ± 4,68E-16 1,61E+00 ± 4,68E-16 1,47E+00 ± 2,34E-16 1,37E+00 ± 2,34E-16DI(+) 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00

ADI(+) 5,19E-03 ± 9,14E-19 5,18E-03 ± 9,14E-19 5,17E-03 ± 0,00E+00 5,16E-03 ± 9,14E-19 5,14E-03 ± 9,14E-19

PC(+) 6,16E-01 ± 1,17E-16 4,63E-01 ± 5,85E-17 3,79E-01 ± 5,85E-17 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE(-) 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 0,00E+00 1,79E+00 ± 0,00E+00SC(-) 1,17E+10 ± 2,01E-06 1,52E+10 ± 0,00E+00 1,60E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 2,48E+09 ± 5,03E-07

1.7 S(-) 2,26E+05 ± 6,14E-11 4,77E+05 ± 6,14E-11 4,78E+05 ± 1,23E-10 2,12E+05 ± 0,00E+00 7,24E+04 ± 0,00E+00XB(-) 1,72E+00 ± 2,34E-16 1,30E+00 ± 0,00E+00 1,06E+00 ± 0,00E+00 9,07E-01 ± 1,17E-16 7,99E-01 ± 1,17E-16DI(+) 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00

ADI(+) 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19

PC(+) 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 5,85E-17 1,67E-01 ± 0,00E+00CE(-) 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 2,34E-16SC(-) 4,12E+10 ± 0,00E+00 1,13E+10 ± 2,01E-06 1,50E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 3,27E+09 ± 0,00E+00

2.0 S(-) 7,93E+05 ± 1,23E-10 3,56E+05 ± 6,14E-11 4,51E+05 ± 6,14E-11 2,12E+05 ± 0,00E+00 9,68E+04 ± 1,53E-11XB(-) 1,40E+00 ± 0,00E+00 9,33E-01 ± 1,17E-16 7,00E-01 ± 1,17E-16 5,60E-01 ± 0,00E+00 4,67E-01 ± 5,85E-17DI(+) 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,04E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 8,98E-03 ± 1,83E-18

ADI(+) 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 0,00E+00 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19

increasing the over-lapping.

CE - Similarly to PC, this method also penalizes the fuzziness of the cluster partitions. Despite of it not

being clear in its equation (2.28), since m as no direct influence in this score, an higher m will still

balance the membership degrees µij reducing the confidence in the classification based on the

partitioning, thus it is expected to attain higher scores with higher values of m parameter. However,

this does not occur.

SC - In the SC equation (2.29), since a better partitioning is given by attaining a lower value of SC, it

is desired to have low results in the numerator (that measures compactness) and high results for

the denominator (that measures separation). The parameter m impacts directly and indirectly the

numerator, and the smaller the m the smaller the numerator, whereas the denominator is strictly

related with the prototypes position. While the prototypes position change withm it is shown (under

Appendix G) that for slight variations of m (has the ones tested in these methods) the change in

the clusters centres can be neglected, and so the denominator can be thought of as being constant

and the numerator outweighs these minor differences in the denominator, thus the best results for

smaller m.

S - This method clearly penalizes a higher number of clusters due to its denominator in equation (2.30):

a higher number of clusters conducts to smaller inter-cluster distances. Nevertheless, changing

m, as stated above, does not impact considerably the positioning of the prototypes and similarly to

the precedent method, a lower m returns a lower score for the numerator of S which is the desired

result and so, the smaller the m the better should be the score.

XB - The numerator of this method is the same as in S and the denominator changes too with the cluster

centres being different only in the way it measures the separation, so when it comes to defining the

best value for the parameter m, the comments remain the same as in S. Nonetheless, the results

59

show improvements for higher values of m.

DI and ADI - In these two methods, designed for the validation of hard clustering algorithms, the only

elements that play a role are the data points and the cluster centres. The only influence of the

parameter m is the positioning of the clusters and as shown in Appendix G these differences can

be neglected. The scores obtained by these two methods can also serve as a proof that the

parameter m have a small influence in the positioning of the prototypes (and subsequently in the

division of the data points), as the scores are approximately constant while changing m. The same

is observed when increasing the number of clusters.

The CE and XB index do not follow the same trend as PC, SC and S. As mentioned, these measures

are not deterministic and an overall balance between complexity, versatility (to avoid over-fitting) and

quality of the partitioning, assessed using the validation measures, had to be performed.

After this study, and since there are more indexes agreeing with a lower m, it is clear that by lowering

this parameter better scores can be achieved. One could expect to reach even better scores by testing

out smaller values of m, nevertheless, since the least value tested was m = 1.4, this study proceeds

with this parameter fixed with that value. However, there is no clear evidence regarding the best number

of clusters. It is known that most methods increase or decrease monotonously with increasing number

of clusters, so it is desired to avoid having high number of clusters in order to find a clear peak to settle

with it.

Clustering Validation based on data distribution

The indecision about the optimal number of clusters called for this distribution analysis that has

its focus on classes (Figure 4.2(a) and Figure 4.2(c)), diseases (Figure 4.2(b) and Figure 4.2(d)) and

volume distributions along the clusters and inter-cluster distances (Table 4.3 and 4.4) 1.

1In the main text only the figures related to the dataset ALL are presented and analysed, however these observations can beextended to all datasets. Exception goes for the PAN dataset that shows that a third cluster might be viable as it can be seen inappendix E, however the study proceeded without considering that exception.

60

(a) Data divided into two clusters based on the output. (b) Data divided into two clusters based on the disease.

(c) Data divided into three clusters based on the output. (d) Data divided into three clusters based on the disease.

Figure 4.2: Distribution along clusters for the dataset ALL.

Table 4.3: Euclidian distances between clusters.

ALL cluster 1 cluster 2

cluster 1 - 4,30E-04

cluster 2 4,30E-04 -

Table 4.4: Euclidian distances between clusters.

ALL cluster 1 cluster 2 cluster 3

cluster 1 - 6,66E-04 7,98E-05

cluster 2 6,66E-04 - 5,90E-04

cluster 3 7,98E-05 5,90E-04 -

The trend observed when adding one more cluster (Kc = 2 to Kc = 3) is repeated with more added

clusters (Kc = 4, 5, 6...) and the following is achieved for each subsequent added cluster:

Volume distribution - The new cluster will have considerably low volume when compared to two of

them and even lower then the “previous” one.

Classes distribution - The new cluster has a tendency to mimic the distribution of classes of one of

the two clusters with higher volume.

Diseases distribution - Same behaviour as in classes distribution.

61

Inter-cluster distances - The added cluster is at least 10 times closer to one of the two high volume

cluster than the two higher volume clusters between them.

Some conclusions can be drawn upon this information. Knowing that the volume of a third cluster is

much lower than the other two, per se, does not allow one to conclude much but knowing it allied with

the fact that the inter-cluster distances shown by the smaller cluster is at least 10 times closer to one

of the clusters than the distance between the two main clusters shows that the new cluster is “stealing

data” from one of the clusters with higher volume. The job/task of dividing the unsupervised clusters

internally are part of the fuzzy modelling that performs supervised cluster in order to define antecedent

part of the rules. Note that by setting a high number of clusters Kc, it might be possible to reach an

even distribution of data (was tested with Kc = 8 and this behaviour was not achieved), although that

would bring some repercussions during the modelling stage due to the lack of data: no sufficient data is

shown during the training stages, leading to over-fitting related problems and the trustworthiness of the

validation would be conditioned. Given these observations, it seems appropriately to assume that the

best number of clusters Kc is two.

By inspecting Figure 4.2, it is clear that the distribution is not as expected: the clustering is not group-

ing data based on diseases nor in classes. Alternatively it is partitioning the data based on volumes of

data points that do not show any correlation to the diseases, at least when using FCM algorithm. This

does not invalidate that using other clustering algorithms would not make the distinction between dis-

eases or classes, for instance, using Gustafson-Kessel clustering (GK) algorithm might be more reliable

since the differentiation in physiological variables between the diseases might only be notorious for a

few selected variables. The FCM algorithm acts in every direction (every feature) uniformly assuming

the hypothesis that clusters are spherical, whereas GK algorithm associates each cluster its centre and

its covariance allowing it to identify ellipsoidal clusters. However, GK algorithm is out of the scope of this

thesis and it will not be covered in more detail. Another possibility is to conduct this study with FCM after

reducing the number of variables to the most informative ones.

4.1.2 Fixing parameters of the models & Feature Selection

The datasets were equally divided into two groups: feature selection (FS) group that was used to

find the best combination of parameters and furthermore to obtain the best set of features given some

criterion, and model assessment (MA) group which its function is to evaluate the prediction power of the

models with the fixed parameters and selected features obtained with the other half of the data (FS). The

procedure is the same depicted in Figure 4.1 where there is a reduction of 50% and then it is divided

into 10 folds to proceed to a 10-fold cross validation in which 9 of the folds (corresponding to the training

data) is balanced into 70% of class 0 and 30% of class 1 and the remaining fold (used for testing the

trained models) keeps the original balance2.

2Due to the low percentage of class 1 samples the classes were balanced for all datasets to respect the percentages of 70%for class 0 and 30% for class 1 . A better approach would be to perform a grid search where different balances would be testedby carrying the whole study to the end respecting it, however that is unattainable due to time restrictions. The idea of balancingto those percentages is to guarantee that there is a higher presence of class 1 data while keeping at some extent the primal idea:class 0 is more predominant. This will enable to show to the models, while in the training stage, more class 1 samples. Doing this

62

For fixing parameters (number of clusters cn and fuzziness parameters m) and select the most infor-

mative features only the data pertaining to FS part is used, this way none of the parameters nor features

are over-fitted to the data that will be used to assess the models, MA group of data, and this will allow to

rely more on the results that are obtained in the model assessment.

This procedure has two stages: one for fixing the parameters cn and m, and another to select the

features.

Fixing parameters for:

1. FS based on overall performance - In order to fix the FCM parameters a grid search was

performed ranging from [2,6] for cn and [1.4,2] for m (using the same random seed to create

the initial partition matrix and the train and test sets), and the combination that output the best

AUC was selected.

2. FS based on singular models’ performance - In this case, more parameters had to be fixed:

the unsupervised cluster centres (including all features), the fuzziness parameter for each

group of data (m1 and m1) and the number of clusters for each model (nc1 and nc2). Since

the feature selection is performed for each cluster of data individually, each subset of most

predictive features are fitted to some volume of data/particular feature space, hence the ne-

cessity to fix the prototypes position. The criterion to decide the best parameters for this

scenario is given by the maximization of the index (4.1).

[% of data in cluster 1]× [AUC cluster 1] + [% of data in cluster 2]× [AUC cluster 2] (4.1)

It is known that smaller sets of data conduct to high variation in performance measures and

this approach discourages unbalanced partitions, avoiding that a smaller cluster has the same

weight in the final decision. To emphasize the importance of having a balanced criterion here

is an exaggerated example: assume that the first cluster only has 10 points (10-fold) and it

classifies well every time having an AUC of 1, the other classifier performs randomly with 0.5

AUC with 10000 samples. The final result would be 0.75 AUC, though the classifier performed

randomly most of the time.

A grid search was performed ranging from [2,6] for cn and [1.4,2] for m testing every combi-

nation of these parameters possible for the two models.

Feature selection for:

1. FS based on overall performance - After having cn and m fixed, it was run 50 times with

that combination of parameters. The combination of features that occurred more often was

selected as being the best subset of features, bringing more information for the classification,

is risk-free due to the volume of the datasets this can be made without jeopardizing the statistical value of the results. This hasbeen performed for the clustering validation and for the training stage in the modelling part, the test part of the modelling was keptwith the real percentages.

63

this way the variability can be reduced. When there was no repetition in those 50 runs, the

set of feature with the best AUC was selected.

2. FS based on singular models’ performance - The features were fixed at the same time as

the parameters since changing the seed would lead to different partitions and consequently

features that are unrelated when using a different seed.

In order to perform the validation of results a 10-fold cross-validation procedure was used. The

parameters that resulted from the first step (fixing parameters) are shown in appendix F since it is not

much relevant for analysis purposes. Having fixed all the parameters for each criterion, feature selection

is now performed and the results are shown and discussed in the following sections.

1. Feature Selection based on overall performance

The purpose of the histograms and selection order of the features is most relevant to know which

variables are predominant observing the ones that occur more often and the ones that are selected first

since at some stage the order is more or less random, showing that those variables are not predominant

and add little to the prediction performance. In this study this happens a lot and is in accordance with

the fact that using all variables conduct to better results as it will be shown later.

Histograms showing the number of times that each feature was selected are shown in appendix H.

Below are presented tables showing the most selected feature given a order for each dataset. The

frequency was computed dividing the number of times the mentioned feature is selected by the number

of times a feature in that position was selected, i.e., feature selection runs that did not reach a certain

number of features is not accounted for the frequency computation.

Table 4.5: Most selected features for single model for dataset ALL.

Most selected feature by order

1st 2nd 3rd 4th 5th 6th

Feature # 29 26 19 19 13 7

Frequency 1 0,92 0,36 0,22 0,36 0,25

Table 4.6: Most selected features for a priori for datasetALL.



Feature # 29 26 19 12 3 3

Frequency 1 1 0,88 0,68 0,38 0,48

Table 4.7: Most selected features for a posteriori fordataset ALL.



Feature # 17 12 13 13 29 23

Frequency 1 0,5 0,38 0,45 0,59 0,14

Table 4.8: Most selected features for arithmetic meanfor dataset ALL.



Feature # 26 19 3 29 4 7

Frequency 1 0,54 0,66 0,48 0,32 0,22

Table 4.9: Most selected features for distance-weightedmean for dataset ALL.



Feature # 29 26 19 3 4 7

Frequency 1 1 0,84 0,74 0,26 0,24

64

Table 4.10: Most selected features for single model for dataset BOTH.



Feature # 26 19 29 13 12 12

Frequency 1 0,64 0,76 0,46 0,4 0,35

Table 4.11: Most selected features for a priori fordataset BOTH.



Feature # 26 3 13 12 20 29

Frequency 1 1 0,94 0,94 0,58 0,64

Table 4.12: Most selected features for a posteriori fordataset BOTH.



Feature # 26 20 8 14 14 5

Frequency 0,94 0,72 0,26 0,17 0,4 0,38

Table 4.13: Most selected features for arithmetic meanfor dataset BOTH.



Feature # 26 19 29 3 13 12

Frequency 1 0,78 0,72 0,6 0,74 0,66

Table 4.14: Most selected features for distance-weighted mean for dataset BOTH.



Feature # 26 3 13 12 19 29

Frequency 1 0,5 0,72 0,72 0,42 0,52

Table 4.15: Most selected features for single model for dataset PNM.



Feature # 26 3 29 12 20 7

Frequency 1 0,90 0,56 0,34 0,30 0,23

Table 4.16: Most selected features for a priori fordataset PNM.



Feature # 26 3 29 20 12 12

Frequency 1 1 1 0,84 0,52 0,48

Table 4.17: Most selected features for a posteriori fordataset PNM.



Feature # 26 19 27 22 2 27

Frequency 0,94 0,5 0,43 0,2 0,19 0,18

Table 4.18: Most selected features for arithmetic meanfor dataset PNM.



Feature # 26 20 3 29 13 12

Frequency 1 0,58 0,58 0,58 0,44 0,56

Table 4.19: Most selected features for distance-weighted mean for dataset PNM.



Feature # 26 3 29 20 13 12

Frequency 1 1 1 0,98 0,58 0,64

65

Table 4.20: Most selected features for single model for dataset PAN.



Feature # 19 25 4 12 13 15

Frequency 1 1 1 0,92 0,88 0,46

Table 4.21: Most selected features for a priori fordataset PAN.



Feature # 19 25 4 12 13 10

Frequency 1 1 0,82 0,82 0,8 0,24

Table 4.22: Most selected features for a posteriori fordataset PAN.



Feature # 20 4 4 7 13 28

Frequency 0,56 0,64 0,37 0,32 0,22 0,21

Table 4.23: Most selected features for arithmetic meanfor dataset PAN.



Feature # 4 12 20 25 15 25

Frequency 0,54 0,28 0,13 0,16 0,18 0,19

Table 4.24: Most selected features for distance-weighted mean for dataset PAN.



Feature # 19 25 4 12 14 15

Frequency 1 1 1 0,98 0,52 0,30

From the observation of the tables, above one common ground can be found: depending on the

criterion, the sequence of features obtained is more or less consistent. The single model, the a priori and

the distance-weighted mean criteria show a higher consistency, whereas the a priori and the arithmetic

mean are more irregular about the order it selects the features meaning that, by using such criteria, less

robustness is achieved being it highly dependant on the subset of data that are trained with and tested

on (more prone to overfit).

Table 4.25 shows the mean and standard deviation of the number of features selected by each criteria

for each dataset. The criteria that are able to make use of a higher number of features while improving

the performance are the ones that are more consistent.

Table 4.25: Mean and standard deviation of the number of features selected through the 50 runs.

Model PAN PNM BOTH ALL

Single model 8,4 ± 1,4 6,5 ± 1,0 7,2 ± 0,9 4,3 ± 1,4

a priori 13,9 ± 2,3 8,3 ± 1,2 8,6 ± 1,6 12,1 ± 2,4

a posteriori 5,9 ± 2,0 5,0 ± 1,8 4,6 ± 1,3 5,0 ± 2,1

mean 9,4 ± 3,4 8,9 ± 1,4 9,6 ± 1,4 9,0 ± 1,7

wgd avg distance 10,2 ± 2,6 9,4 ± 1,2 9,8 ± 1,3 8,1 ± 2,7

Finally, the set of features that are considered best for each criterion and dataset are presented in

Tables 4.26 - 4.29. If one had to mention the more characteristic variables of each dataset, due to their

ubiquity, that would be:

PAN - 4 (Arterial BP Dystolic), 12 (Platelets), 13 (WBC), 19 (NBP), 25 (Arterial pH);

66

PNM - 3 (Arterial BP), 12 (Platelets), 13 (WBC), 20 (NBP Mean), 26 (Arterial Base Excess), 29 (Lactic

Acid);

BOTH - 3 (Arterial BP), 12 (Platelets), 13 (WBC), 19 (NBP), 26 (Arterial Base Excess), 29 (Lactic Acid);

ALL - 3 (Arterial BP), 7 (Hematocrit), 13 (WBC), 19 (NBP), 26 (Arterial Base Excess), 29 (Lactic Acid).

Table 4.26: Features selected using SFS for the dataset PAN.

Criterion Selected Features

Single model 2, 4, 12, 13, 14, 15, 19, 25, 27, 30

a priori 1, 4, 5, 10, 11, 12, 13, 14, 15, 19, 23, 25, 27, 28, 29, 30

a posteriori 4, 12, 13, 19, 20, 28, 32

mean 1, 2, 3, 4, 5, 12, 13, 14, 15, 17, 18, 19, 20, 24, 25, 28

wgd avg distance 1, 2, 3, 4, 5, 9, 12, 13, 14, 15, 18, 19, 20, 25, 27,28, 29, 30

Table 4.27: Features selected using SFS for the dataset PNM.


Single model 3, 12, 13, 20, 26, 29

a priori 3, 7, 11, 12, 13, 19, 20, 23, 24, 26, 27, 29

a posteriori 4, 13, 15, 17, 19, 23, 29

mean 3, 7, 11, 12, 13, 20, 26, 29

wgd avg distance 3, 7, 11, 12, 13, 20, 23, 26, 27, 29

Table 4.28: Features selected using SFS for the dataset BOTH.


Single model 3, 10, 12, 13, 19, 22, 26, 29

a priori 1, 3, 5, 7, 11, 12, 13, 19, 20, 24, 26, 29, 31

a posteriori 4, 7, 15, 19, 21, 26, 30

mean 1, 3, 4, 12, 13, 14, 19, 20, 24, 26, 27, 29, 31, 32

wgd avg distance 3, 4, 7, 12, 13, 19, 20, 26, 29, 30, 31, 32

Table 4.29: Features selected using SFS for the dataset ALL.


Single model 3, 7, 11, 13, 19, 26,29, 30

a priori 1, 2, 3, 4, 7, 11, 12, 13, 19, 20, 21, 26, 29, 31, 32

a posteriori 10, 12, 13, 16, 17, 23, 29, 32

mean 3, 4, 7, 12, 13, 14, 19, 20, 26, 29, 32

wgd avg distance 3, 4, 5, 7, 11, 12, 13, 19, 20, 26, 29, 32

2. Feature Selection based on the singular models’ performance

As mentioned, due to the need to fix the cluster centres and the fact that each unsupervised cluster

of data goes through its own feature selection procedure, it does not make sense to present the results

for FS in the same manner as shown previously. In this case the FS process has an high variance in the

selected features and cannot be correlated since each run of the FS will give the most predictive features

67

for the resulting unsupervised prototypes/subgroups. So, it will only be presented the features that were

selected and the final score given by the equation (4.1). These results are shown in Tables 4.30-4.33.

Comparing the score obtained during the feature selection procedure to the model assessment results

in Tables 4.37-4.40 it can be concluded that the singular models’ FS approach has an high tendency

to over-fit, something that was already expected due to the high oscillation during the feature selection

procedure.

Table 4.30: Feature selection results for PAN dataset.

Cluster number Features c m Score

cluster 1 13, 28 5 1,70, 95× 0, 51+

0, 99× 0, 49 = 0, 97cluster 2 3, 6, 19, 21, 27 4 1,7

Table 4.31: Feature selection results for PNM dataset.


cluster 1 2, 11, 13, 14, 17, 18, 20 2 1,70, 90× 0, 50+

0, 95× 0, 50 = 0, 93cluster 2 8, 9, 11, 18, 21, 22, 26 2 1,7

Table 4.32: Feature selection results for BOTH dataset.


cluster 1 7, 13, 21, 27, 32 3 20, 99× 0, 50+

0, 97× 0, 50 = 0, 98cluster 2 1, 13, 22, 24, 25, 27 2 1,7

Table 4.33: Feature selection results for ALL dataset.


cluster 1 7, 13, 21, 27, 32 3 20, 90× 0, 50+

0, 98× 0, 50 = 0, 94cluster 2 1, 13, 22, 24, 25, 27 2 1,7

4.1.3 Model Assessment based on overall performance

Under this section the results obtained for model assessment based on overall performance are

presented. This will cover the results with and without feature selection. The parameters of the models

for the study without feature selection were obtained by extensive search, training and testing the models

68

with FS data and fixing the parameters that output the best AUC, while the procedure of fixing parameters

for the case with reduced feature space is described in 4.1.2. The results for the model assessment

considering all features are presented under Table 4.35 and with the reduced feature space in Table

4.36.

It must be pointed out that the AUC computed for the a priori and a posteriori criteria is an artefact

since it cannot be computed directly due to the nature of the output: dissimilar models are asked to

provide the answer/output. For these cases, the AUC was computed after fixing the output of the model

that was selected to provide the output and then changing the threshold from 0 to 1. While it is known

that this does not reflect the real AUC it will be treated as it was. An alternative would be to compute an

AUC for each model output and use the average for each threshold, either way none of these approaches

would reflect the sensitivity and specificity that are shown as results. The sensitivity and specificity reflect

the true value since they are computed using the final decision output vs the true value.

single model - As shown in previous studies (Appendix I), single model produces better results without

feature selection. The only real competitor when all variables are considered is the a priori criterion

that has an higher AUC at the cost of a lower sensitivity and specificity. On the other hand, for

selected features, while the a priori criterion is the best performer in terms of AUC, the mean and

weighted mean criteria outperform the single model with selected features and for some datasets

it performs equally or better than the single model with all features considered (see Table 4.34).

Table 4.34: Comparison against single model with all variables (no FS).

dataset criterion AUC ACC sensitivity specificity

PAN mean w/ FS ↓ ↓ ↓ ≈wgt mean w/ FS ↑ ↑ ↑ ↑

PNM mean w/ FS ↓ ↓ ↓ ↓wgt mean w/ FS ↑ ≈ ≈ ≈

BOTH mean w/ FS ↑ ≈ ↓ ≈wgt mean w/ FS ↑ ≈ ↓ ≈

ALL mean w/ FS ≈ ↓ ↓ ↓wgt mean w/ FS ↑ ≈ ≈ ↑

a priori - This criterion, similarly to the single model, loses in performance when working with selected

features. It is the criterion that selects more variables (exception goes to BOTH dataset where

mean includes one more variable from the FS procedure). While this might be good, meaning that

it can make use of more variables (always equal or less than half of the whole feature set), it also

means that the other criteria are capable of similar performances with fewer variables. Compared

with single model, the tendencies are the same (degradation of all performance measures with

FS) but the single model uses far less features.

a posteriori - This criterion outputs better results with feature selection. It is more balanced in terms of

sensitivity and specificity than when all variables are considered. Remember that the goal, when

defining the threshold, is to selected the one that has the lower difference between sensitivity and

specificity, meaning that these two statistical measures of the performance are not well balanced

69

and the most close to an equilibrium is for the PAN dataset where the difference between both

is close to 0.09, while for the other datasets the difference lies in between 0.24 and 0.44. Con-

sidering only the most informative variables these differences range in the values 0.02 to 0.17.

Nevertheless, both results are consistent, meaning that the standard deviation does show that this

criterion is willing to have spikes/high variation in the way it classifies and so it can be seen as a

conservative classifier since the specificity is mostly higher than the sensitivity, meaning that the

decision to administrate vasopressors is rarely met. The accuracy is high and coupled with the

high specificity it only means that there are more class 0 data. In terms of AUC, it improves with

selected features, however this is the least reliable classifier among the other criteria. This criterion

is also the one that stops improving earlier with added features.

mean - Using the mean of the outputs of both models is only better than the weighted mean when

considering all variables for the dataset PAN, however the it is a minor difference. This criterion

performs well overall and except for the dataset PAN its results improve with feature selection.

weighted mean - This is the criterion that performs best with feature selection, outperforming every

other criterion in terms of sensitivity and specificity. In terms of variables needed it is not the most

greedy (except for PAN dataset with 18 selected features), using a number of features in between

the mean and a priori criteria.

Overall, apart from a posteriori criterion, all the criteria perform comparably well. The single model

is best when using all the variables, however, similarly to a priori, it loses performance with FS while all

the remaining methods improve. The weighted mean criterion with feature selection is the best classifier

in this study, providing the best sensitivity and specificity with the use of less data. The single model is

still a good approach since it performs well with all variables, and after feature selection, it still outputs

good results with even less variables than any other criteria, except for a posteriori which is the worst

considered criterion.

70

Table 4.35: Results without FS

MIMIC II - Pancreatitis without FS

Model AUC Accuracy Sensitivity Specificity

Single model 0,91 ± 0,02 0,83 ± 0,02 0,83 ± 0,03 0,83 ± 0,02

a priori 0,93 ± 0,01 0,79 ± 0,01 0,80 ± 0,03 0,79 ± 0,01

a posteriori 0,84 ± 0,02 0,85 ± 0,01 0,77 ± 0,04 0,86 ± 0,02

mean 0,90 ± 0,02 0,82 ± 0,01 0,83 ± 0,04 0,82 ± 0,01

wgd avg dist 0,89 ± 0,02 0,82 ± 0,01 0,82 ± 0,04 0,82 ± 0,01

MIMIC II - Pneumonia without FS


Single model 0,85 ± 0,02 0,78 ± 0,01 0,78 ± 0,02 0,78 ± 0,01

a priori 0,86 ± 0,01 0,76 ± 0,01 0,76 ± 0,03 0,76 ± 0,01

a posteriori 0,69 ± 0,04 0,82 ± 0,03 0,53 ± 0,05 0,85 ± 0,03

mean 0,81 ± 0,02 0,75 ± 0,02 0,75 ± 0,02 0,75 ± 0,02

wgd avg dist 0,83 ± 0,01 0,76 ± 0,01 0,77 ± 0,02 0,76 ± 0,01

MIMIC II - Both without FS


Single model 0,84 ± 0,02 0,77 ± 0,01 0,77 ± 0,01 0,77 ± 0,01

a priori 0,86 ± 0,01 0,76 ± 0,02 0,76 ± 0,02 0,76 ± 0,02

a posteriori 0,70 ± 0,02 0,86 ± 0,01 0,51 ± 0,04 0,90 ± 0,01

mean 0,81 ± 0,02 0,75 ± 0,01 0,74 ± 0,02 0,75 ± 0,01

wgd avg dist 0,82 ± 0,01 0,76 ± 0,01 0,76 ± 0,02 0,75 ± 0,01

MIMIC II - All patients without FS


Single model 0,82 ± 0,02 0,75 ± 0,01 0,75 ± 0,01 0,75 ± 0,02

a priori 0,85 ± 0,01 0,73 ± 0,01 0,73 ± 0,01 0,73 ± 0,01

a posteriori 0,67 ± 0,02 0,82 ± 0,01 0,52 ± 0,04 0,86 ± 0,02

mean 0,78 ± 0,01 0,71 ± 0,01 0,71 ± 0,02 0,71 ± 0,01

wgd avg dist 0,80 ± 0,01 0,73 ± 0,01 0,73 ± 0,02 0,73 ± 0,01

71

Table 4.36: Results with FS

MIMIC II - Pancreatitis with FS

Model #FS AUC Accuracy Sensitivity Specificity

Single model 10 0,88 ± 0,03 0,81 ± 0,03 0,81 ± 0,03 0,81 ± 0,03

a priori 16 0,92 ± 0,01 0,78 ± 0,01 0,80 ± 0,02 0,77 ± 0,02

a posteriori 7 0,85 ± 0,02 0,83 ± 0,01 0,74 ± 0,05 0,84 ± 0,02

mean 16 0,89 ± 0,03 0,82 ± 0,01 0,82 ± 0,04 0,82 ± 0,01

wgd avg dist 18 0,91 ± 0,01 0,83 ± 0,01 0,84 ± 0,03 0,83 ± 0,01

MIMIC II - Pneumonia with FS


Single model 6 0,83 ± 0,03 0,76 ± 0,02 0,76 ± 0,02 0,76 ± 0,02

a priori 12 0,86 ± 0,01 0,75 ± 0,01 0,74 ± 0,02 0,75 ± 0,01

a posteriori 7 0,78 ± 0,03 0,77 ± 0,02 0,69 ± 0,04 0,78 ± 0,02

mean 8 0,83 ± 0,01 0,76 ± 0,01 0,75 ± 0,02 0,76 ± 0,01

wgd avg dist 10 0,85 ± 0,01 0,78 ± 0,01 0,78 ± 0,02 0,78 ± 0,01

MIMIC II - Both with FS


Single model 8 0,83 ± 0,01 0,76 ± 0,01 0,75 ± 0,01 0,76 ± 0,01

a priori 13 0,86 ± 0,01 0,74 ± 0,01 0,74 ± 0,02 0,74 ± 0,01

a posteriori 7 0,77 ± 0,01 0,76 ± 0,01 0,68 ± 0,02 0,77 ± 0,01

mean 14 0,84 ± 0,01 0,77 ± 0,01 0,77 ± 0,02 0,77 ± 0,01

wgd avg dist 12 0,85 ± 0,01 0,77 ± 0,01 0,77 ± 0,02 0,77 ± 0,01

MIMIC II - All patients with FS


Single model 8 0,79 ± 0,02 0,73 ± 0,01 0,73 ± 0,01 0,73 ± 0,01

a priori 15 0,85 ± 0,01 0,73 ± 0,01 0,72 ± 0,01 0,73 ± 0,01

a posteriori 8 0,77 ± 0,04 0,70 ± 0,02 0,74 ± 0,05 0,70 ± 0,03

mean 11 0,82 ± 0,01 0,74 ± 0,01 0,74 ± 0,01 0,74 ± 0,01

wgd avg dist 12 0,82 ± 0,01 0,75 ± 0,01 0,75 ± 0,01 0,75 ± 0,01

72

4.1.4 Model Assessment based on the singular models’ performance

The approach that is about to be analysed proved to be inappropriate for the datasets under study.

Follows a list of the problems that arose with the use of this method, where over-fitting is pervasive:

Feature selection do not bring any advantage - The feature selection procedure for this approach

has no power in reducing the number of features that will be used: all the variables must be

considered to compute the distance between the centre of the unsupervised clusters and the data

points, and as seen before, single models have a tendency to perform better with all features so

there is no improvement in performance with the use of less variables for each model separately

- the results during the feature selection procedure were incredibly high (see Tables 4.30-4.33)

but looking for the results of model assessment one can infer that that was because the models

were over-fitted to the presented data. Even with the use of cross-validation, the position of the

prototypes are fixed using all FS data in order to divide the data into subgroups, thus not changing

its position with the folds.

Data partition - The data was well partitioned during the feature selection procedure (see Tables 4.30-

4.33), more resemblant to the distribution shown in Section 4.1.1. However, since the centre of

the cluster had to be fixed (otherwise the selected features of the FS procedure would not make

sense), the result is that for a different group of data (MA data) the partition is highly unbalanced

(exception goes for the dataset BOTH). This procedure cannot be generalized and the cluster

centres did over-fit the FS data.

The procedure for selecting features and FCM parameters - The idea of having models suited to dif-

ferent zones of the feature space is very appealing, however since different clustering results ben-

efit from different variables and parameters (and this association is very sensible), fixing both the

features and parameters for a particular set of data compromises the generalization ability.

The model assessment results for each dataset are presented through the Tables 4.37-4.40. First

thing to notice is that every approach presented under this section performs worse than the single

model (shown in Tables 4.35 and 4.36). The percentages of data inside each cluster shown in Tables

4.37-4.40 clearly state that the fixed positions of the prototypes does not match the data presented

in model assessment (exception goes for BOTH that have an equilibrated distribution, which can be

a mere coincidence). The a priori and a posteriori criteria show high variance in the sensitivity and

specificity while the mean and weighted mean criteria keep being consistent. The latter ones are the

top performers in this case, having the highest AUC, sensitivity and specificity and both showing a very

similar behaviour.

This might be a good approach for well-behaved datasets which is not the current case where each

clustering initialization conduct to a different set of cluster centres, keeping nearly the same distribution

percentages. This transforms the clustering procedure into a push and pull game between prototypes

which is not the purpose. This also happens in the previous case (the datasets are the same), the

difference is that here the models are less flexible due to the over-fit that is transported from the FS data

73

to the MA data, namely the cluster centres and its selected features. This has the potential to produce

great results as shown during the FS stage, however it requires the MA data to be very identical to the

FS data because slight variations on the initial clustering show high differences in the final result (high

sensibility).

Table 4.37: Results for models based on the singular models’ performance: PAN dataset.

criterion AUC Accuracy Sensitivity Specificity % data cluster 1 % data cluster 2

a priori 0,72 ± 0,04 0,70 ± 0,03 0,64 ± 0,05 0,71 ± 0,03

75,9 24,1a posteriori 0,73 ± 0,05 0,48 ± 0,04 0,88 ± 0,07 0,43 ± 0,06

mean 0,75 ± 0,06 0,68 ± 0,05 0,67 ± 0,06 0,68 ± 0,05

wgd avg distance 0,75 ± 0,06 0,68 ± 0,05 0,67 ± 0,06 0,68 ± 0,05

Table 4.38: Results for models based on the singular models’ performance: PNM dataset.


a priori 0,74 ± 0,02 0,69 ± 0,01 0,68 ± 0,03 0,69 ± 0,02

89,1 10,9a posteriori 0,72 ± 0,02 0,70 ± 0,01 0,70 ± 0,02 0,70 ± 0,02

mean 0,76 ± 0,01 0,70 ± 0,02 0,70 ± 0,01 0,70 ± 0,02

wgd avg distance 0,76 ± 0,01 0,70 ± 0,02 0,70 ± 0,01 0,70 ± 0,02

Table 4.39: Results for models based on the singular models’ performance: BOTH dataset.


a priori 0,76 ± 0,01 0,76 ± 0,01 0,64 ± 0,02 0,78 ± 0,01

55,2 44,8a posteriori 0,73 ± 0,02 0,69 ± 0,02 0,71 ± 0,04 0,69 ± 0,02

mean 0,76 ± 0,01 0,70 ± 0,01 0,70 ± 0,03 0,70 ± 0,01

wgd avg distance 0,76 ± 0,01 0,70 ± 0,01 0,70 ± 0,03 0,70 ± 0,01

Table 4.40: Results for models based on the singular models’ performance: ALL dataset.


a priori 0,61 ± 0,01 0,73 ± 0,05 0,37 ± 0,07 0,78 ± 0,07

85,9 14,1a posteriori 0,59 ± 0,04 0,66 ± 0,04 0,69 ± 0,04 0,66 ± 0,05

mean 0,73 ± 0,01 0,68 ± 0,02 0,68 ± 0,02 0,68 ± 0,02

wgd avg distance 0,73 ± 0,01 0,68 ± 0,02 0,67 ± 0,02 0,68 ± 0,02

4.2 Mixed Fuzzy Clustering - Time-series data approach

In this section four different approaches to modelling will be considered. Under each of them two

ways of normalizing the data are studied resulting in eight combinations.

The modelling approaches were presented in Section 2.3: FCM FM, FCM-FCM FM, MFC FM

and MFC-FCM FM. The normalization methods were mentioned in Section 3.2.4 corresponding to the

method 1 and method 3 which are the ones that seem to be reasonable for the present cases. The

imputation method is the one presented in 3.4.1.

74

The adopted workflow is similar to the ensemble modelling: firstly the parameters and feature selec-

tion is performed and then the model assessment results are presented in order to show the behaviour

of the performance under each assumption.

4.2.1 Fixing parameters of the models & Feature Selection

The first noticed drawback compared to the previous approach is that all the methods presented un-

der this section, particularly the ones that use MFC, are computational heavier which led to the decision

of limiting the number of combinations for the parameters. The first limitation is that the length of the

time-series is ten3. Then, the fuzziness parameter took the values {1.2,2} and the number of clusters

were {2, 3, 4, 5}. For MFC FM and MFC-FCM FM there is an added parameter, λ, which took the

values {0,1,2,5}. Running 50 times the parameters that got the best results in order to fix the features

is impracticable for this scenario. So the parameters were fixed by the combination that led to the best

AUC and then that combination was run 10 times with a different seed each time. The final combination

of features is the one that resulted in the best AUC of the 10 runs.

The data was equally partitioned with FS data set and MA data set, however the balance between

each class was kept the same as the whole data set in order to avoid cuts in the amount of data, which

is less than the data available for the punctual case.

Table 4.41 shows the results for the conducted feature selection: best parameters and most predictive

features. At a first glance, there is an obvious improvement in AUC by using the normalization type 1,

while the strategy of using as input variables the resulting partition matrix has an higher impact in the

number of selected features. It also conducts to an higher number of predictive features, meaning that

this simple assumption (normalization type 1) enables the methods to explore more appropriately the

data that is available. Using the transformation also reduces the number of selected features which

is intuitive because none of the variables are used directly, instead they go through a transformation

process that reduces the dimensionality of the input to the models that are again transformed in the

fuzzification step. This makes the value of each variable by itself less important.

Non-variant features, from 33 to 37, make their presence amongst the time-varying ones, meaning

that these features play a role in prediction. The variables SAPS (36) and SOFA (37) were expected to

be of use since they compact information about physiological variables (SOFA even includes the age),

but variables such as gender (33) and age (34) are also selected. It is also observed that the combination

with higher value of λ is the one that does not select any static variable which is in accordance with its

meaning: best performance is achieved by giving a higher weight to the time-varying data.

3However, out of curiosity, smaller lengths for the time-series were tested and it has shown improved results. This observationsolely might mean that a punctual approach is better since the smaller the time-series the more similar to the first presentedapproach it gets.

75

Table 4.41: Best parameters and selected feature according to AUC.

Methods Norm. Type AUC m # of clusters lambda Features

FCM FM1 0,85 2 3 - 1, 7, 19, 24, 26, 29, 31 — 33

3 0.74 2 3 - 2, 7, 19, 24, 26, 27 — 36

FCM-FCM FM1 0.84 2 3 - 3, 19, 20, 29, 31

3 0.73 2 4 - 2, 22 — 33

MFC FM1 0.86 2 2 1 1, 5, 7, 15, 17, 19, 21, 23, 25, 26, 29, 31 — 33, 36

3 0.81 2 3 2 2, 6, 13, 15, 17, 19, 23, 24, 29 — 36, 37

MFC-FCM FM1 0,87 2 5 5 3, 4, 7, 19, 22, 29, 31

3 0.76 2 2 2 1, 4, 12, 17, 19, 31 — 34, 36, 37

Considering the 10 runs, a frequency analysis to the feature selection procedure was conducted,

strictly for the normalization type 1 (the normalization type 3 analysis for FS is dropped from now on).

Tables 4.42-4.45 show the preferred order and frequency by which the features are selected for each

method.

Table 4.42: Most selected features by FCM FM (meannumber of selected features: 5.8).



Feature # 19 29 10 7, 31 7 24

Frequency 0.70 0.40 0.22 0.29 0.29 0.40

Table 4.43: Most selected features by FCM-FCM FM(mean number of selected features: 4.3).



Feature # 19 3 29 18, 31 20 22, 32

Frequency 0.70 0.40 0.44 0.25 0.75 0.50

Table 4.44: Most selected features by MFC FM (meannumber of selected features: 8.2).



Feature # 19 7, 29, 31 31 7 24 4, 26

Frequency 1.00 0.30 0.60 0.50 0.40 0.30

Table 4.45: Most selected features by MFC-FCM FM(mean number of selected features: 6.5).



Feature # 19 7 3 29 31 4

Frequency 0.80 0.40 0.50 0.56 0.38 0.38

4.2.2 Model Assessment

This section is based on the model assessment of the four methods under study. For this purpose

half of the data, tagged as MA dataset, is used in two different scenarios: considering all the variables

with the fixed parameters tuned with the use of FS data with all variables (Table 4.46), and considering

only the variables obtained through FS and respective parameters (Tables 4.47-4.50).

Focusing in the results obtained for the normalization type 1, it is shown in Table 4.46 that the

transformation based on the partition matrix, both for FCM-FCM FM and MFC-FCM FM, improves the

results. Not only the performance measures are better but also the stability, in terms of variance and

difference between sensitivity and specificity, is improved.

Tables 4.47-4.50 show the results of model assessment considering the features that were selected

by FS. The results are compared in pairs: Table 4.47 shows the results of FCM FM and FCM-FCM FM

considering the features selected by FCM FM, and Table 4.48 for the features selected by FCM-FCM

76

Table 4.46: MA data using all variables with best FS data parameters according to AUC.

Methods Norm. Type AUC Accuracy Sensitivity Specificity

FCM FM 1 0.70 ± 0.11 0.68 ± 0.09 0.56 ± 0.14 0.72 ± 0.123 0.66 ± 0.06 0.64 ± 0.06 0.54 ± 0.11 0.68 ± 0.07

FCM-FCM FM 1 0.75 ± 0.08 0.72 ± 0.07 0.74 ± 0.15 0.71 ± 0.093 0.57 ± 0.07 0.55 ± 0.08 0.45 ± 0.17 0.59 ± 0.08

MFC FM 1 0.70 ± 0.11 0.68 ± 0.09 0.56 ± 0,14 0,72 ± 0.123 0.66 ± 0.06 0.64 ± 0.06 0.54 ± 0.11 0.68 ± 0.07

MFC-FCM FM 1 0.79 ± 0.09 0.77 ± 0.06 0.63 ± 0.08 0.82 ± 0.093 0.68 ± 0.09 0.65 ± 0.09 0.62 ± 0.11 0.66 ± 0.12

FM, while Table 4.49 shows the results of MFC FM and MFC-FCM FM for the features selected by MFC

FM, and Table 4.47 for the features selected by MFC-FCM FM.

As expected, each method performs better for their respective feature set. The methods that use

feature transformation, FCM-FCM FM and MFC-FCM FM, show closer values of sensitivity and speci-

ficity but also have a higher variance compared to when no transformation is considered. These same

methods show better results when no FS is considered, contrary to FCM FM and MFC FM that show

improved results with FS. Considering the results in Table 4.41, one might say that the models that use

feature transformation have higher tendency to overfit, and given the results without feature selection in

Table 4.46, it can be concluded that they are also less prone to be affected by noisy variables and that

the feature transformation functions as a filter, with the potential to extract conclusive information from a

group of variables that were not considered during FS. On the other hand, the methods that do not use

feature transformation are affected by the noise, thus showing better results with the selected features.

Overall, the best performer after feature selection is FCM FM.

Table 4.47: FCM FM vs FCM-FCM FM with FCM FM selected features.


FCM FM 0.80 ± 0.06 0.75 ± 0.05 0.70 ± 0.15 0.77 ± 0.06FCM-FCM FM 0.71 ± 0.08 0.66 ± 0.09 0.67 ± 0.12 0.66 ± 0.12

Table 4.48: FCM FM vs FCM-FCM FMwith FCM-FCM FM selected features.


FCM FM 0.71 ± 0.08 0.68 ± 0.03 0.60 ± 0.12 0.71 ± 0.06FCM-FCM FM 0.72 ± 0.12 0.69 ± 0.08 0.68 ± 0.17 0.69 ± 0.10

Table 4.49: MFC FM vs MFC-FCM FM with MFC FM selected features.


MFC FM 0.78 ± 0.08 0.72 ± 0.06 0.63 ± 0.06 0.76 ± 0.07

MFC-FCM FM 0.62 ± 0.11 0.68 ± 0.07 0.11 ± 0.09 0.91 ± 0.10

77

Table 4.50: MFC FM vs MFC-FCM FM with MFC-FCM FM selected features.


MFC FM 0.75 ± 0.10 0.70 ± 0.06 0.65 ± 0.18 0.72 ± 0.06

MFC-FCM FM 0.77 ± 0.09 0.73 ± 0.09 0.72 ± 0.13 0.73 ± 0.12

78

Chapter 5

Conclusions

The present thesis was developed with the aim of predicting patients’ need of vasopressors in ICUs,

based on patients’ physiological variables and demographics/static information measured during their

hospitalization.

It proposes some changes to the preprocessing proposed in [24] for the clinical actual state / punctual

data analysis for which ensemble modelling was applied. The preprocessing solely, has contributed to

better results of the single model, and some configurations of the ensemble modelling proved to be

effective for this case study. The study conducted for the unsupervised clustering, in order to divide the

data into subgroups, has shown that the patients are not divided by their disease has it was thought

before, neither by the need of vasopressors, meaning that the disease label per se is not discriminatory

for the purpose of this study. It shows that each subgroup has closely the same data balance in terms

of diseases and vasopressors administration.

Regarding the ensemble modelling, two different approaches were conducted: one tuned with re-

spect to the overall performance, and another where the subgroups are tuned individually. Although the

modelling based on single models’ performance did not show improvements due to the high tendency it

has to overfit to the training data, the modelling based on overall performance did improve the results.

The ensemble modelling has shown to take advantage of different features depending on the subgroups.

Two strategies were used for the combination of classifiers: classifier selection, that integrates the a pri-

ori and a posteriori approach, and classifier fusion integrating the mean and weighted mean criteria.

The a posteriori did not perform well when compared to their peers that delivered better results against

the single model when feature selection is used. However, when all features are considered the single

model is the best performer showing a high level of versatility and generalization.

Another preprocessing was proposed to suit the analysis of the data following its temporal evolution

combined with demographic/static data. This approach allows for a low level of data imputation, making

it more reliable when compared to the former pre-processing.

Four different models were constructed for which feature selection and model assessment was per-

formed. These models can be divided into two groups: FCM FM and FCM-FCM FM, where the former

receives as input the time-series and static data pointwise and the latter receives the partition matrix

79

that results from the FCM clustering of the same data, and then MFC FM and MFC-FCM FM, where the

first uses MFC (instead of FCM) to construct the model, and MFC-FCM FM in which MFC is used to

transform the data and feed the FCM FM the resulting partition matrix.

For the mixed data approach, regarding the effectiveness of the use of the feature transformation

based on the resulting partition matrices, it was concluded that, for this particular case, the transfor-

mation brings consistency making a better use of higher dimensional inputs when compared to their

counterparts (feeding the models directly with the data). However, after feature selection, the models

that do not use feature transformation perform better, showing that the proposed transformation act as

a filter with the capability to extract predictive information from variables that are not considered after

feature selection while avoiding being affected by noise. These advantages do not make up for the poor

results against the simplicity and performance of FCM FM, which is the best performer after feature

selection.

Comparing both approaches cannot be done directly due to the difference in the structure of the data

and information contained in each one. However, in order to conclude, both results were satisfactory

and the best results for each of the approaches, in terms of data structure, are shown in Table 5.1 and

both make use of the feature selection results.

Table 5.1: Best results for punctual and evolution state analysis for the dataset ALL.

Model # of features AUC Accuracy Sensitivity Specificity

wgd avg dist 12 0.82 ± 0.01 0.75 ± 0.01 0.75 ± 0.01 0.75 ± 0.01

FCM-FCM FM 8 0.80 ± 0.06 0.75 ± 0.05 0.70 ± 0.15 0.77 ± 0.06

It is important to note that the mixed data is less voluminous due to the added constraints of its

structure. It is also more reliable for the purpose of this study: the imputation of data is around 14.6%

versus 74.7% and it predicts the initiation of the vasopressors intake, contrary to the punctual state

data, that contain data until the end of the first vasopressor administration, predicting the continuous

administration of the drug and not only its initiation.

5.1 Limitations

Throughout this thesis several assumptions had to be taken. Those assumptions are taken on the

basis of hypotheticals, bringing limitations at each step. During the pre-processing:

Septic shock filter - Since MIMIC II does not contain an identifier for the septic shock condition, the

filter was applied to diseases where the incidence of septic shock is higher: pancreatitis and

pneumonia. However, this does not guarantee that every patient are under septic shock condition.

Eliminating deceased patients that did not had vasopressors - While this assumption eliminates un-

desired bias, it raises the question of how will the model behave when it is presented with such

cases in real-time.

80

Data alignment - When clinical state data is used, it is desired to align the data. The approach consists

on picking up the variable with higher sample frequency and align the data based on ZOH. The

lower sampling rate of most of the variables leads to repetition of data or data that is equal in most

features and only the highly sampled features change. The impact of it was not computed, however

it might be the cause for such good classification results. This conducts to the high percentage of

imputations.

Most likely, the initiation of the administration is not predicted - Again, for the clinical state/punctual

data, the class 1 data is composed by the data that is in the interval between the starting point of

vasopressor administration and its final of continuous administration. This means that most of the

class 1 data does not indicate that vasopressors will start in a given time-window (only one class

1 data point has this information), instead most of the data says that the patient will continue to be

administrated with vasopressors from that point on. Considering only the point that will predict the

administration timely would require to have only one data point of class 1 per patient that had va-

sopressors. This would be impracticable due to the high unbalanced classes. So, the clinical state

scenario is answering the question ”will the patient be prescribed OR continue with vasopressors

administration?”. The latter question is probably the one that is answered correctly most of the

time.

Comparison between clinical state vs evolution state is made impossible - Given the data struc-

ture for each scenario, it is impossible to make a viable comparison. It would even be unfair for the

time-series analysis since, in this case, the two immediately above aspects benefit performance

of the clinical state analysis. The time-series case is less prone to inconsistencies and represents

better a real case.

The feature selection was also problematic. Due to the observed variation during the FS procedure,

it is only known which variables work best when alone or in conjunction with a few other features. This is

shown by observing that only the first selected variables are constant while the next ones highly depend

on the randomness inherent to the training data and initiation of the clustering algorithms. This idea is

reinforced when the single model performs best when all features are available. So, the greedy algorithm

approach for feature selection has proved to be inefficient in highlighting the most predictive variables

from a certain point.

5.2 Future Work

There is always room for improvement in a study and as so, the following future work is proposed:

Feature Selection - The feature selection should be explored by other algorithms. It has been done

before for single model, but finding a more consistent algorithm would point out the differences

between each criteria as far as features are concerned.

81

Find a way to compare both analysis - At the end of this thesis, the problem of comparing both ap-

proaches arises. It would be interesting being able to compare which kind of analysis performs

best: clinical state vs evolutionary state. This would require to reconstruct the clinical state data

to avoid the repetition of data (at least during the test phase) and consider only data points that

predict the initiation of vasopressors. The time-series analysis is well approached.

Length of the time-series - It would be interesting to have an analysis of the implications of increasing

and reducing the time-series.

Same methods using Dynamic Time Warping (DTW) - Unfortunately, due to the heavy computational

effort of this algorithm the study was not fully conducted, but from what was tested it is expected

to conduct to better results.

Normalization - As shown, the normalization that was proposed led to worse results. However, from

the tests that were conducted, one might want to give it a try and use this normalization only for

variables that are more likely to have different ”healthy” values for each patient.

Eliminating class 0 deceased patients - An analysis can be conducted to evaluate how the models

behave in the presence of such patients. While it makes sense to remove these patients from the

train data, in a real case scenario they will still be there, so adding them to test data and analyse

the output might be interesting. Finding the difference between these patients and patients that

need vasopressors might be interesting too.

Enable different sizes for each feature in time-series - The algorithm that is available at the moment

does not allow different sizes of the time-series for different features. This could be of use since it

was observed that different variables start tend towards ”failure” at different times and rates.

Insightful analysis of the features behaviour - Enabling the algorithm for different time-series sizes

is useless if one does not know the differences between each variable behaviour.

Clustering algorithm - Instead of using FCM, it should be tried to use Gustafson-Kessel algorithm, due

to its flexibility it might do better.

Other modelling approaches - The ensemble modelling can be applied solely to the mixed data ap-

proach or ensemble modelling that encompasses both approaches: punctual state data and state

evolution data. The unsupervised clustering might be done through MFC instead of FCM.

82

Bibliography

[1] MIMIC II V2.6, 2011 (accessed June 6, 2015). URL http://mimic.physionet.org/schema/

latest/.

[2] D. C. Angus and T. van der Poll. Severe sepsis and septic shock. New England Journal of Medicine,

369(9):840–851, 2013.

[3] D. C. Angus, W. T. Linde-Zwirble, J. Lidicker, G. Clermont, J. Carcillo, and M. R. Pinsky. Epidemiol-

ogy of severe sepsis in the united states: analysis of incidence, outcome, and associated costs of

care. Critical care medicine, 29(7):1303–1310, 2001.

[4] R. Babuska. Fuzzy systems, modeling and identification. 2001.

[5] A. M. Bensaid, L. O. Hall, J. C. Bezdek, L. P. Clarke, M. L. Silbiger, J. A. Arrington, and R. F.

Murtagh. Validity-guided (re) clustering with applications to image segmentation. Fuzzy Systems,

IEEE Transactions on, 4(2):112–123, 1996.

[6] J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Pub-

lishers, 1981.

[7] J. C. Bezdek, R. Ehrlich, and W. Full. Fcm: The fuzzy c-means clustering algorithm. Computers &

Geosciences, 10(2):191–203, 1984.

[8] R. Brause, F. Hamker, and J. Paetz. Septic shock diagnosis by neural networks and rule based

systems. In Computational intelligence processing in medical diagnosis, pages 323–356. Springer,

2002.

[9] P. Bromage. Vasopressors. Canadian Journal of Anesthesia/Journal canadien d’anesthesie, 7(3):

310–316, 1960.

[10] Y.-W. Chen and C.-J. Lin. Feature extraction, foundations and applications. Guyon, Isabelle and

Gunn, Steve and Nikravesh, Masoud and Zadeh, Lofti Combining SVMs with various feature selec-

tion strategies, Studies in Fuzziness and Soft ComputingSpringer-Verlag, pages 315–324, 2006.

[11] K. J. Cios and G. W. Moore. Uniqueness of medical data mining. Artificial intelligence in medicine,

26(1):1–24, 2002.

83

http://mimic.physionet.org/schema/latest/

http://mimic.physionet.org/schema/latest/

[12] F. Cismondi, A. S. Fialho, S. M. Vieira, J. M. Sousa, S. R. Reti, M. D. Howell, and S. N. Finkel-

stein. Computational intelligence methods for processing misaligned, unevenly sampled time se-

ries containing missing data. In Computational Intelligence and Data Mining (CIDM), 2011 IEEE

Symposium on, pages 224–231. IEEE, 2011.

[13] F. Cismondi, A. L. Horn, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. Sousa, and S. Finkelstein.

Multi-stage modeling using fuzzy multi-criteria feature selection to improve survival prediction of icu

septic shock patients. Expert Systems with Applications, 39(16):12332–12339, 2012.

[14] T. D. Correa, M. Vuda, A. R. Blaser, J. Takala, S. Djafarzadeh, M. W. Dunser, E. Silva, M. Lensch,

L. Wilkens, and S. M. Jakob. Effect of treatment delay on disease severity and need for resuscitation

in porcine fecal peritonitis. Critical care medicine, 40(10):2841–2849, 2012.

[15] R. P. Dellinger, M. M. Levy, A. Rhodes, D. Annane, H. Gerlach, S. M. Opal, J. E. Sevransky, C. L.

Sprung, I. S. Douglas, R. Jaeschke, et al. Surviving sepsis campaign: international guidelines for

management of severe sepsis and septic shock, 2012. Intensive care medicine, 39(2):165–228,

2013.

[16] R. P. Dellinger et al. Cardiovascular management of septic shock. Critical care medicine, 31(3):

946–955, 2003.

[17] V. Y. Dombrovskiy, A. A. Martin, J. Sunderram, and H. L. Paz. Rapid increase in hospitalization and

mortality rates for severe sepsis in the united states: A trend analysis from 1993 to 2003*. Critical

care medicine, 35(5):1244–1250, 2007.

[18] C. M. DUNHAM, J. H. SIEGEL, L. WEIRETER, M. FABIAN, S. GOODARZI, P. GUADALUPI, L. GET-

TINGS, S. E. LINBERG, and T. C. VARY. Oxygen debt and metabolic acidemia as quantitative

predictors of mortality and the severity of the ischemic insult in hemorrhagic shock. Critical care

medicine, 19(2):231–243, 1991.

[19] A. P. Engelbrecht. Computational intelligence: an introduction. John Wiley & Sons, 2007.

[20] T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.

[21] M. P. Fernandes, C. F. Silva, S. M. Vieira, and J. Sousa. Multimodeling for the prediction of patient

readmissions in intensive care units. In Fuzzy Systems (FUZZ-IEEE), 2014 IEEE International

Conference on, pages 1837–1842. IEEE, 2014.

[22] M. C. Ferreira, C. M. Salgado, J. L. Viegas, H. Schafer, C. S. Azevedo, S. M. Vieira, and J. M. C.

Sousa. Fuzzy modeling based on mixed fuzzy clustering for health care applications. In Fuzzy

Systems (FUZZ-IEEE), 2015 IEEE International Conference on. IEEE, 2015.

[23] A. Fialho, L. Celi, F. Cismondi, S. Vieira, S. Reti, J. Sousa, S. Finkelstein, et al. Disease-based

modeling to predict fluid response in intensive care units. Methods Inf Med, 52(6):494–502, 2013.

84

[24] A. S. Fialho, F. Cismondi, S. M. Vieira, J. Sousa, S. R. Reti, L. Celi, M. D. Howell, S. N. Finkelstein,

et al. Fuzzy modeling to predict administration of vasopressors in intensive care unit patients. In

Fuzzy Systems (FUZZ), 2011 IEEE International Conference on, pages 2296–2303. IEEE, 2011.

[25] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge discovery in databases: An

overview. AI magazine, 13(3):57, 1992.

[26] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical inference and data mining. Com-

munications of the ACM, 39(11):35–41, 1996.

[27] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine

Learning Research, 3:1157–1182, 2003.

[28] J. Han, M. Kamber, and J. Pei. Data mining, southeast asia edition: Concepts and techniques.

Morgan kaufmann, 2006.

[29] J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques: concepts and techniques.

Elsevier, 2011.

[30] D. J. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT press, 2001.

[31] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating char-

acteristic (roc) curve. Radiology, 143(1):29–36, 1982.

[32] S. Herget-Rosenthal, F. Saner, and L. S. Chawla. Approach to hemodynamic shock and vasopres-

sors. Clinical Journal of the American Society of Nephrology, 3(2):546–553, 2008.

[33] S. M. Hollenberg. Inotrope and vasopressor therapy of septic shock. Critical care nursing clinics of

North America, 23(1):127–148, 2011.

[34] A. L. Horn, F. Cismondi, A. S. Fialho, S. M. Vieira, J. M. Sousa, S. Reti, M. Howell, and S. Finkel-

stein. Multi-objective performance evaluation using fuzzy criteria: Increasing sensitivity prediction

for outcome of septic shock patients. In Proceedings of 18th world congress of the international

federation of automatic control (IFAC), volume 18, pages 14042–14047, 2011.

[35] H. Izakian, W. Pedrycz, and I. Jamal. Clustering spatiotemporal data: An augmented fuzzy c-

means. Fuzzy Systems, IEEE Transactions on, 21(5):855–868, 2013.

[36] A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.

[37] J.-S. R. Jang, C.-T. Sun, and E. Mizutani. Neuro-fuzzy and soft computing; a computational ap-

proach to learning and machine intelligence. 1997.

[38] U. Kaymak and M. Setnes. Extended fuzzy clustering algorithms. ERIM Report Series Reference

No. ERS-2001-51-LIS, 2000.

[39] D.-W. Kim, K. H. Lee, and D. Lee. On cluster validity index for estimation of the optimal number of

fuzzy clusters. Pattern Recognition, 37(10):2009–2025, 2004.

85

[40] T. W. Liao. Clustering of time series data—a survey. Pattern recognition, 38(11):1857–1874, 2005.

[41] J.-H. Lin and P. J. Haug. Data preparation framework for preprocessing clinical data in data mining.

In AMIA Annual Symposium Proceedings, volume 2006, page 489. American Medical Informatics

Association, 2006.

[42] W. T. Linde-Zwirble and D. C. Angus. Severe sepsis epidemiology: sampling, selection, and society.

Critical Care, 8(4):222, 2004.

[43] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures.

In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 911–916. IEEE, 2010.

[44] F. J. Marques, A. Moutinho, S. M. Vieira, and J. M. Sousa. Preprocessing of clinical databases to

improve classification accuracy of patient diagnosis. In World Congress, volume 18, pages 14121–

14126, 2011.

[45] G. S. Martin, D. M. Mannino, S. Eaton, and M. Moss. The epidemiology of sepsis in the united

states from 1979 through 2000. New England Journal of Medicine, 348(16):1546–1554, 2003.

[46] U. Maulik and S. Bandyopadhyay. Performance evaluation of some clustering algorithms and valid-

ity indices. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(12):1650–1654,

2002.

[47] T. Pang-Ning, M. Steinbach, V. Kumar, et al. Introduction to data mining. In Library of Congress,

page 74, 2006.

[48] W. Pedrycz. Fuzzy multimodels. Fuzzy Systems, IEEE Transactions on, 4(2):139–148, 1996.

[49] R. Polikar. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6

(3):21–45, 2006.

[50] E. Rivers, B. Nguyen, S. Havstad, J. Ressler, A. Muzzin, B. Knoblich, E. Peterson, and M. Tom-

lanovich. Early goal-directed therapy in the treatment of severe sepsis and septic shock. New

England Journal of Medicine, 345(19):1368–1377, 2001.

[51] M. Saeed, C. Lieu, G. Raber, and R. Mark. Mimic ii: a massive temporal icu patient database to

support research in intelligent patient monitoring. In Computers in Cardiology, 2002, pages 641–

644. IEEE, 2002.

[52] C. M. Salgado, C. S. Azevedo, J. Garibaldi, and S. M. Vieira. Ensemble fuzzy classifiers design

using weighted aggregation criteria. In Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International Con-

ference on. IEEE, 2015.

[53] N. Sanchez-Marono, A. Alonso-Betanzos, and M. Tombilla-Sanroman. Filter methods for feature

selection–a comparative study. In Intelligent Data Engineering and Automated Learning-IDEAL

2007, pages 178–187. Springer, 2007.

86

[54] J. M. Sousa and U. Kaymak. Fuzzy decision making in modeling and control, volume 27. World

Scientific, 2002.

[55] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and

control. Systems, Man and Cybernetics, IEEE Transactions on, (1):116–132, 1985.

[56] J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In Proceedings

of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages

877–886. ACM, 2009.

[57] X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis

and machine intelligence, 13(8):841–847, 1991.

[58] S. C. Yusta. Different metaheuristic strategies to solve the feature selection problem. Pattern

Recognition Letters, 30(5):525–534, 2009.

[59] L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning—i.

Information sciences, 8(3):199–249, 1975.

87

88

Appendix A

Outliers - Expert Knowledge versus

Inter-quartile method

89

Figure A.1: Dispersion of data points and boundaries given by expert knowledge plus visual inspection (greendashed line) and inter-quartile (black dashed line) method for the input variables 1 to 4.

90


91


92


93


94


95


96


97

Appendix B

Removing Deceased Patients

98

Figure B.1: Temporal evolution of class 1, class 0 and class 0 deceased patients for variables 1-6. Each pointcorresponds to mean value of a 2 hours window.

99


100


101


102


103


Figure B.7: Related to the above figures. Left figure: Number of measurements considered each 2 hours timewindow. Right figure: Number of patients taken into account for each 2 hours time window.

104

Appendix C

ID’s of the same variable

Table C.1: List of de IDs associated with each variable that were grouped into one.

Variable Name Primary ID Secundary ID

WBC 861 1542, 1127, 4200Arterial Base Excess 776 74, 3740, 4196

Arterial pH / pH 780 865, 1126, 4202, 4753Arterial PaO2 779 1155

Arterial PaCO2 778 1156Calcium 786 1522

Central Venous Pressure (CVP) 113 1103Ionized Calcium 816 1350, 4453, 8177, 8325

INR 815 1530Lactate or Lactid Acid 818 1531

Phosphorous 827 1534PTT 825 1533, 739NBP 455 751, 1149

Magnesium 821 1532Sodium 837 1536, 3803

RBC 833 3797, 4197Chloride 788 1523Platelets 828 3790

BUN 781 1162Creatinine 791 3750, 1525

Glucose 811 1529Potassium 829 1535, 3792Hematocrit 813 3761Arterial BP 51 6, 6701, 6926

Temperature C 676 677

105

Appendix D

Clustering Validation Analysis -

Methods

The next tables show the for the clusters validation methods for each of the datasets: PAN,PNM,BOTH

and ALL.

Table D.1: Clustering validation indexes score for dataset PAN performed 10 times with different partitions.


PC 7,71E-01 ± 1,17E-16 6,57E-01 ± 0,00E+00 5,85E-01 ± 1,17E-16 5,35E-01 ± 0,00E+00 4,98E-01 ± 5,85E-17CE 6,62E-01 ± 0,00E+00 1,06E+00 ± 2,34E-16 1,35E+00 ± 0,00E+00 1,58E+00 ± 2,34E-16 1,76E+00 ± 2,34E-16SC 5,84E+01 ± 1,50E-14 6,85E+01 ± 1,50E-14 8,40E+01 ± 1,50E-14 9,84E+01 ± 0,00E+00 1,11E+02 ± 0,00E+00

1.4 S 1,14E-02 ± 1,83E-18 1,60E-02 ± 3,66E-18 1,96E-02 ± 0,00E+00 2,26E-02 ± 0,00E+00 2,50E-02 ± 3,66E-18XB 1,58E+00 ± 2,34E-16 1,35E+00 ± 0,00E+00 1,20E+00 ± 2,34E-16 1,10E+00 ± 0,00E+00 1,02E+00 ± 2,34E-16DI 1,91E-02 ± 3,66E-18 1,84E-02 ± 0,00E+00 1,67E-02 ± 3,66E-18 4,76E-03 ± 9,14E-19 1,95E-02 ± 0,00E+00

ADI 4,78E-02 ± 7,31E-18 3,98E-02 ± 7,31E-18 5,33E-03 ± 0,00E+00 6,42E-03 ± 9,14E-19 4,80E-03 ± 9,14E-19

PC 6,16E-01 ± 1,17E-16 4,63E-01 ± 0,00E+00 3,79E-01 ± 5,85E-17 3,24E-01 ± 0,00E+00 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 0,00E+00 1,79E+00 ± 2,34E-16SC 4,50E+09 ± 1,01E-06 2,73E+09 ± 5,03E-07 2,97E+09 ± 5,03E-07 1,13E+09 ± 0,00E+00 8,32E+08 ± 0,00E+00

1.7 S 8,79E+05 ± 1,23E-10 8,88E+05 ± 0,00E+00 8,69E+05 ± 0,00E+00 3,25E+05 ± 6,14E-11 2,45E+05 ± 3,07E-11XB 1,25E+00 ± 2,34E-16 9,42E-01 ± 0,00E+00 7,70E-01 ± 0,00E+00 6,59E-01 ± 1,17E-16 5,80E-01 ± 1,17E-16DI 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18

ADI 6,59E-02 ± 1,46E-17 6,59E-02 ± 1,46E-17 6,59E-02 ± 1,46E-17 1,15E-02 ± 1,83E-18 9,65E-03 ± 1,83E-18

PC 5,00E-01 ± 0,00E+00 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 0,00E+00 1,67E-01 ± 0,00E+00CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 2,34E-16SC 7,06E+09 ± 2,01E-06 3,47E+09 ± 0,00E+00 4,33E+09 ± 1,01E-06 2,34E+09 ± 0,00E+00 1,99E+09 ± 5,03E-07

2.0 S 1,38E+06 ± 2,45E-10 1,13E+06 ± 2,45E-10 1,27E+06 ± 2,45E-10 6,76E+05 ± 1,23E-10 5,86E+05 ± 1,23E-10XB 1,02E+00 ± 0,00E+00 6,78E-01 ± 1,17E-16 5,08E-01 ± 0,00E+00 4,07E-01 ± 1,17E-16 3,39E-01 ± 5,85E-17DI 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18 1,91E-02 ± 3,66E-18

ADI 6,59E-02 ± 0,00E+00 6,59E-02 ± 0,00E+00 1,15E-02 ± 0,00E+00 1,15E-02 ± 1,83E-18 9,65E-03 ± 0,00E+00

106

Table D.2: Clustering validation indexes score for dataset PNM performed 10 times with different partitions.


PC 7,60E-01 ± 1,17E-16 6,47E-01 ± 1,17E-16 5,78E-01 ± 0,00E+00 5,39E-01 ± 1,17E-16 5,08E-01 ± 1,17E-16CE 6,89E-01 ± 0,00E+00 1,09E+00 ± 0,00E+00 1,37E+00 ± 2,34E-16 1,56E+00 ± 2,34E-16 1,72E+00 ± 2,34E-16SC 4,93E+02 ± 0,00E+00 4,08E+02 ± 5,99E-14 2,42E+02 ± 3,00E-14 6,59E+01 ± 1,50E-14 4,09E+01 ± 7,49E-15

1.4 S 3,01E-02 ± 0,00E+00 3,01E-02 ± 0,00E+00 1,78E-02 ± 3,66E-18 4,80E-03 ± 9,14E-19 2,93E-03 ± 4,57E-19XB 1,82E+00 ± 4,68E-16 1,55E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,29E+00 ± 2,34E-16 1,22E+00 ± 2,34E-16DI 6,84E-03 ± 9,14E-19 1,80E-02 ± 0,00E+00 1,09E-02 ± 1,83E-18 1,38E-02 ± 0,00E+00 6,27E-03 ± 9,14E-19

ADI 3,49E-03 ± 9,14E-19 8,43E-03 ± 1,83E-18 4,46E-03 ± 9,14E-19 3,97E-04 ± 0,00E+00 5,01E-04 ± 1,14E-19

PC 6,16E-01 ± 0,00E+00 4,63E-01 ± 0,00E+00 3,79E-01 ± 0,00E+00 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 2,34E-16SC 1,58E+10 ± 0,00E+00 5,02E+09 ± 1,01E-06 1,91E+10 ± 0,00E+00 5,08E+09 ± 1,01E-06 2,27E+09 ± 5,03E-07

1.7 S 9,66E+05 ± 0,00E+00 4,14E+05 ± 6,14E-11 1,50E+06 ± 2,45E-10 4,84E+05 ± 0,00E+00 2,28E+05 ± 0,00E+00XB 1,47E+00 ± 0,00E+00 1,10E+00 ± 0,00E+00 9,02E-01 ± 0,00E+00 7,72E-01 ± 0,00E+00 6,79E-01 ± 1,17E-16DI 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19 6,84E-03 ± 9,14E-19

ADI 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19 4,19E-03 ± 9,14E-19 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19

PC 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 0,00E+00 2,00E-01 ± 2,93E-17 1,67E-01 ± 2,93E-17CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 0,00E+00 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 0,00E+00SC 1,68E+10 ± 0,00E+00 1,46E+10 ± 0,00E+00 2,00E+10 ± 4,02E-06 9,47E+09 ± 0,00E+00 2,62E+09 ± 5,03E-07

2.0 S 1,02E+06 ± 1,23E-10 1,19E+06 ± 0,00E+00 1,77E+06 ± 2,45E-10 9,02E+05 ± 1,23E-10 2,65E+05 ± 6,14E-11XB 1,19E+00 ± 2,34E-16 7,94E-01 ± 0,00E+00 5,95E-01 ± 1,17E-16 4,76E-01 ± 1,17E-16 3,97E-01 ± 5,85E-17DI 6,84E-03 ± 9,14E-19 1,48E-02 ± 3,66E-18 1,63E-02 ± 3,66E-18 1,63E-02 ± 3,66E-18 6,84E-03 ± 9,14E-19

ADI 4,19E-03 ± 9,14E-19 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19 4,19E-03 ± 0,00E+00 4,19E-03 ± 9,14E-19

Table D.3: Clustering validation indexes score for dataset BOTH performed 10 times with different partitions.


PC 7,61E-01 ± 1,17E-16 6,48E-01 ± 1,17E-16 5,79E-01 ± 0,00E+00 5,31E-01 ± 1,17E-16 4,96E-01 ± 0,00E+00CE 6,85E-01 ± 1,17E-16 1,09E+00 ± 0,00E+00 1,37E+00 ± 2,34E-16 1,59E+00 ± 2,34E-16 1,76E+00 ± 0,00E+00SC 2,62E+02 ± 5,99E-14 2,56E+02 ± 5,99E-14 2,33E+02 ± 0,00E+00 1,90E+02 ± 3,00E-14 1,39E+02 ± 3,00E-14

1.4 S 1,32E-02 ± 0,00E+00 1,57E-02 ± 0,00E+00 1,42E-02 ± 1,83E-18 1,14E-02 ± 0,00E+00 8,18E-03 ± 1,83E-18XB 1,76E+00 ± 4,68E-16 1,49E+00 ± 2,34E-16 1,33E+00 ± 2,34E-16 1,22E+00 ± 2,34E-16 1,14E+00 ± 2,34E-16DI 7,15E-03 ± 0,00E+00 1,04E-02 ± 1,83E-18 8,44E-03 ± 1,83E-18 1,43E-02 ± 0,00E+00 1,43E-02 ± 0,00E+00

ADI 7,86E-02 ± 1,46E-17 1,15E-02 ± 1,83E-18 6,71E-03 ± 9,14E-19 5,29E-03 ± 9,14E-19 1,19E-03 ± 2,29E-19

PC 6,16E-01 ± 1,17E-16 4,63E-01 ± 5,85E-17 3,79E-01 ± 5,85E-17 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 2,34E-16SC 1,07E+10 ± 2,01E-06 8,20E+09 ± 2,01E-06 9,31E+09 ± 0,00E+00 6,62E+09 ± 0,00E+00 2,05E+09 ± 2,51E-07

1.7 S 5,39E+05 ± 0,00E+00 5,99E+05 ± 0,00E+00 6,77E+05 ± 0,00E+00 5,47E+05 ± 0,00E+00 1,70E+05 ± 3,07E-11XB 1,43E+00 ± 0,00E+00 1,08E+00 ± 0,00E+00 8,83E-01 ± 0,00E+00 7,55E-01 ± 1,17E-16 6,65E-01 ± 1,17E-16DI 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18

ADI 8,60E-02 ± 0,00E+00 1,54E-02 ± 0,00E+00 1,28E-02 ± 1,83E-18 1,28E-02 ± 1,83E-18 1,22E-02 ± 1,83E-18

PC 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 2,93E-17 1,67E-01 ± 2,93E-17CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 4,68E-16SC 1,34E+10 ± 0,00E+00 1,36E+10 ± 0,00E+00 1,10E+10 ± 2,01E-06 6,97E+09 ± 2,01E-06 2,71E+09 ± 5,03E-07

2.0 S 6,74E+05 ± 1,23E-10 9,85E+05 ± 0,00E+00 8,12E+05 ± 1,23E-10 5,77E+05 ± 1,23E-10 2,25E+05 ± 3,07E-11XB 1,17E+00 ± 2,34E-16 7,77E-01 ± 1,17E-16 5,83E-01 ± 1,17E-16 4,66E-01 ± 0,00E+00 3,88E-01 ± 5,85E-17DI 1,37E-02 ± 1,83E-18 1,37E-02 ± 1,83E-18 7,37E-03 ± 9,14E-19 1,37E-02 ± 1,83E-18 7,15E-03 ± 0,00E+00

ADI 8,60E-02 ± 0,00E+00 1,54E-02 ± 3,66E-18 1,28E-02 ± 1,83E-18 1,28E-02 ± 1,83E-18 8,09E-03 ± 1,83E-18

Table D.4: Clustering validation indexes score for dataset ALL performed 10 times with different partitions.


PC 7,58E-01 ± 1,17E-16 6,44E-01 ± 0,00E+00 5,74E-01 ± 1,17E-16 5,25E-01 ± 1,17E-16 4,88E-01 ± 1,17E-16CE 6,93E-01 ± 1,17E-16 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 4,68E-16 1,79E+00 ± 0,00E+00SC 8,53E+06 ± 1,96E-09 9,19E+06 ± 1,96E-09 5,95E+06 ± 9,82E-10 7,57E+06 ± 9,82E-10 3,25E+06 ± 4,91E-10

1.4 S 1,64E+02 ± 3,00E-14 2,67E+02 ± 5,99E-14 1,70E+02 ± 3,00E-14 2,15E+02 ± 5,99E-14 9,37E+01 ± 1,50E-14XB 2,12E+00 ± 4,68E-16 1,80E+00 ± 4,68E-16 1,61E+00 ± 4,68E-16 1,47E+00 ± 2,34E-16 1,37E+00 ± 2,34E-16DI 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00

ADI 5,19E-03 ± 9,14E-19 5,18E-03 ± 9,14E-19 5,17E-03 ± 0,00E+00 5,16E-03 ± 9,14E-19 5,14E-03 ± 9,14E-19

PC 6,16E-01 ± 1,17E-16 4,63E-01 ± 5,85E-17 3,79E-01 ± 5,85E-17 3,24E-01 ± 5,85E-17 2,85E-01 ± 5,85E-17CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 0,00E+00 1,79E+00 ± 0,00E+00SC 1,17E+10 ± 2,01E-06 1,52E+10 ± 0,00E+00 1,60E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 2,48E+09 ± 5,03E-07

1.7 S 2,26E+05 ± 6,14E-11 4,77E+05 ± 6,14E-11 4,78E+05 ± 1,23E-10 2,12E+05 ± 0,00E+00 7,24E+04 ± 0,00E+00XB 1,72E+00 ± 2,34E-16 1,30E+00 ± 0,00E+00 1,06E+00 ± 0,00E+00 9,07E-01 ± 1,17E-16 7,99E-01 ± 1,17E-16DI 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00

ADI 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19 5,22E-03 ± 9,14E-19

PC 5,00E-01 ± 1,17E-16 3,33E-01 ± 5,85E-17 2,50E-01 ± 5,85E-17 2,00E-01 ± 5,85E-17 1,67E-01 ± 0,00E+00CE 6,93E-01 ± 0,00E+00 1,10E+00 ± 2,34E-16 1,39E+00 ± 2,34E-16 1,61E+00 ± 2,34E-16 1,79E+00 ± 2,34E-16SC 4,12E+10 ± 0,00E+00 1,13E+10 ± 2,01E-06 1,50E+10 ± 0,00E+00 6,45E+09 ± 0,00E+00 3,27E+09 ± 0,00E+00

2.0 S 7,93E+05 ± 1,23E-10 3,56E+05 ± 6,14E-11 4,51E+05 ± 6,14E-11 2,12E+05 ± 0,00E+00 9,68E+04 ± 1,53E-11XB 1,40E+00 ± 0,00E+00 9,33E-01 ± 1,17E-16 7,00E-01 ± 1,17E-16 5,60E-01 ± 0,00E+00 4,67E-01 ± 5,85E-17DI 1,03E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 1,04E-02 ± 0,00E+00 1,03E-02 ± 0,00E+00 8,98E-03 ± 1,83E-18

ADI 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19 5,22E-03 ± 0,00E+00 5,22E-03 ± 0,00E+00 5,22E-03 ± 9,14E-19

107

Appendix E

Clustering Validation Analysis -

Distributions

In order to validate the number of clusters that could not be obtained through the clustering validation

methods presented in 2.6.2, due to lack of clear evidence, a study based on distribution of data along

the clusters and inter-cluster distances was conducted delivering the following results (comments on this

can be found in section 4.1.1):

108



Figure E.1: Division in clusters for the dataset PAN.

Table E.1: Euclidian distances between clusters.

PAN cluster 1 cluster 2

cluster 1 0,00E+00 1,66E-01

cluster 2 1,66E-01 0,00E+00


PAN cluster 1 cluster 2 cluster 3

cluster 1 0,00E+00 1,84E-01 3,94E-04

cluster 2 1,84E-01 0,00E+00 1,85E-01

cluster 3 3,94E-04 1,85E-01 0,00E+00

109



Figure E.2: Division in clusters for the dataset PNM.


PNM cluster 1 cluster 2

cluster 1 0,00E+00 4,93E-04cluster 2 4,93E-04 0,00E+00


PNM cluster 1 cluster 2 cluster 3

cluster 1 0,00E+00 4,65E-05 1,17E-03cluster 2 4,65E-05 0,00E+00 1,13E-03cluster 3 1,17E-03 1,13E-03 0,00E+00

110



Figure E.3: Division in clusters for the dataset BOTH.


BOTH cluster 1 cluster 2



BOTH cluster 1 cluster 2 cluster 3


111



Figure E.4: Division in clusters for the dataset ALL.


ALL cluster 1 cluster 2



ALL cluster 1 cluster 2 cluster 3


112

Appendix F

Fixing ensemble modelling

parameters

Table F.1: Best parameters all features; c=2:5; m=1.1:2.

criterion PAN PNM BOTH ALL

single model c=4; m=2.0 c=4; m=2.0 c=5; m=2.0 c=4; m=2.0a priori c=2; m=2.0 c=5; m=2.0 c=2; m=1.9 c=2; m=2.0

a posteriori c=2; m=2.0 c=4; m=1.3 c=2; m=2.0 c=2; m=1.9arithmetic mean c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=5; m=1.8

distance-weighted mean c=5; m=1.9 c=2; m=2.0 c=2; m=2.0 c=5; m=2.0

Table F.2: Best parameters with feature selection; c=2:5; m=1.1:2.

criterion PAN PNM BOTH ALL

single model c=2; m=1.5 c=3; m=1.3 c=3; m=2.0 c=4; m=1.4a priori c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=2; m=2.0

a posteriori c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=2; m=2.0arithmetic mean c=4; m=1.9 c=2; m=1.6 c=2; m=2.0 c=3; m=1.9

distance-weighted mean c=2; m=2.0 c=2; m=2.0 c=2; m=2.0 c=3; m=1.7

113

Appendix G

Influence of the fuzziness parameter

in FCM clusters centres

From what is seen one can conclude that it might be desirable to have a low m parameter, akin to the

clustering K-means. Knowing that the architecture of the multimodel highly depends on the distance

of the data to the prototypes/centroids that result from this partition, one may draw the conclusion that

a higher distance between those centroids will turn the weighting process less unambiguous. Another

aspect that should be noted is that in terms of partition of the data there’s no considerable difference

for m ≤ 10 and higher values were only used in this short analysis just out of curiosity. Nonetheless,

the decision for the most suited m will be taken by evaluating the scores obtained with the mentioned

validation measures.

The possibility to choose m adds one more degree of freedom (DOF) and it can be useful for exten-

sive examination which might lead to better results in some cases, so the use of K-means as opposed

to FCM is not yet justifiable with only this short example. Hopefully this study cleared out the influ-

ence of such parameter m in this context. The following figures show the tracking of the position of the

prototypes by varying m:

114

115

116

Appendix H

Histograms feature selection for

punctual data

117

(k) Frequency a feature was selected during 50 runs of single

model FS.

(l) Frequency a feature was selected during 50 runs of a priori

FS.

(m) Frequency a feature was selected during 50 runs of a pos-

teriori FS.

(n) Frequency a feature was selected during 50 runs of arith-

metic mean FS.

(o) Frequency a feature was selected during 50 runs of

distance-weighted mean FS.

Figure H.1: Frequency a feature was selected during 50 runs for dataset ALL.

118

(a) Frequency a feature was selected during 50 runs of single

model FS.

(b) Frequency a feature was selected during 50 runs of a priori

FS.

(c) Frequency a feature was selected during 50 runs of a pos-

teriori FS.

(d) Frequency a feature was selected during 50 runs of arith-

metic mean FS.

(e) Frequency a feature was selected during 50 runs of


Figure H.2: Frequency a feature was selected during 50 runs for dataset BOTH.

119


model FS.


FS.


teriori FS.


metic mean FS.



Figure H.3: Frequency a feature was selected during 50 runs for dataset PNM.

120


model FS.


FS.


teriori FS.


metic mean FS.



Figure H.4: Frequency a feature was selected during 50 runs for dataset PAN.

121

Appendix I

Previous Results

(a)

(b)

Figure I.1: Tables extracted from (a) [24] (b) [23] for results comparison.

122

Documents

Fuzzy Modeling for the Prediction of Vasopressors ... · Fuzzy Modeling for the Prediction of Vasopressors Administration in the ... Severe sepsis and ... 4.15 Most selected features