Feature Selection for Adverse Event Prediction

Feature Selection for AdverseEvent Prediction

A dissertation submitted to The University of

Manchester for the degree of Master of Science by Research

in the Faculty of Engineering and Physical Sciences

2011

Elisabeta Marinoiu

School of computer Science

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Copyright Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Local Feature Selection-towards personalized medicine . . . . . . . . 2

1.3 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Project Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature review 6

2.1 Feature Selection. Introduction . . . . . . . . . . . . . . . . . . . . . 6

2.2 Feature Selection using Information Theory . . . . . . . . . . . . . . 9

2.2.1 Overview of basic Information Theory Concepts . . . . . . . . 9

2.2.2 Relevancy. Redundancy. Relevancy in context . . . . . . . . . 11

2.2.3 Ranking criterion . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4 Mutual Information Feature Selection Criterion . . . . . . . . 12

2.2.5 Double Input Symmetrical Relevance Criterion . . . . . . . . . 13

2.2.6 Joint Mutual Information Criterion . . . . . . . . . . . . . . . 14

2.3 Local Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 15

i

Feature Selection for Adverse Event Prediction2.3.1 Natural clustering . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Measuring dissimilarity of subproblems . . . . . . . . . . . . . 17

2.3.3 Local feature selection using a clustering-like approach . . . . 18

2.3.4 Local feature selection and dynamic integration of classifiers . 20

2.4 Class-Specific Feature Selection . . . . . . . . . . . . . . . . . . . . . 21

3 Data Preprocessing and Initial Experiments 25

3.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Adverse Events Data Set (not used in the experiments) . . . . 26

3.1.2 Subjects Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.3 Concomitant Medication Data set . . . . . . . . . . . . . . . . 27

3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Converting string discrete variables into numbers . . . . . . . 27

3.2.2 Discretization of continuous variables . . . . . . . . . . . . . 27

3.2.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.4 Sparse Features and Special Cases . . . . . . . . . . . . . . . . 29

3.3 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Experiment 1: Ranking features according to Mutual Infor-

mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.2 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.3 Experiment 2: Local analysis of individual feature importance 33

3.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Analysis of feature importance within subsets 39

4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 Feature Selection Criterion . . . . . . . . . . . . . . . . . . . . 41

4.3 Identifying the most discriminant features . . . . . . . . . . . . . . . 41

4.3.1 Consistency Index for feature selection . . . . . . . . . . . . . 42

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ii

Feature Selection for Adverse Event Prediction

4.4 Local Analysis of biomarkers . . . . . . . . . . . . . . . . . . . . . . . 46

4.4.1 Description of the method . . . . . . . . . . . . . . . . . . . . 47

4.4.2 Computing the scores . . . . . . . . . . . . . . . . . . . . . . . 47

4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4.4 Summary and conclusions . . . . . . . . . . . . . . . . . . . . 52

5 Predictive Model Building 54

5.1 Measures of assessing performance . . . . . . . . . . . . . . . . . . . . 55

5.2 Local vs. Global Analysis . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.1 Local predictive models . . . . . . . . . . . . . . . . . . . . . 57

5.3 Model building-Phase I . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 Model Building -Phase II . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.1 Balancing class distribution . . . . . . . . . . . . . . . . . . . 64

5.4.2 Ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 70

5.5 Chapter summary and Conclusions . . . . . . . . . . . . . . . . . . . 73

5.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Conclusions 75

6.1 Summary of the research and conclusions . . . . . . . . . . . . . . . . 75

6.1.1 Data Preprocessing and Initial Experiments . . . . . . . . . . 75

6.1.2 Analysis of feature importance within subsets . . . . . . . . . 76

6.1.3 Predictive model building . . . . . . . . . . . . . . . . . . . . 78

6.1.4 How can the proposed techniques be transferred to new data

sets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.1.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.6 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . 84

References 88

iii

List of Figures

2.1 A unified view of the feature selection process [11]. . . . . . . . . . . 7

2.2 Multivariate feature selection [18] . . . . . . . . . . . . . . . . . . . . 13

2.3 Natural clustering of data with regard to the pathogens [12]-an ex-

ample of problem decomposition for a particular microbiological data

using prior information. . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Schematic view of general wrapper approach to class-dependent fea-

ture selection [17]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Feature ranking according to normalized mutual information for Ap-

petite and Neutropenia . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Feature ranking according to normalized mutual information for Nail

disorder and Neuropathy . . . . . . . . . . . . . . . . . . . . . . . . . 32


petite in the subset of people who had Large Cell Carcinoma . . . . . 34


petite in the subset of Females . . . . . . . . . . . . . . . . . . . . . . 35

3.5 Feature ranking according to normalized mutual information for Nail

disorder in the subset of Caucasian people . . . . . . . . . . . . . . . 35

3.6 Feature ranking according to normalized mutual information for Neu-

tropenia disorder in the subset of Males . . . . . . . . . . . . . . . . . 36


petite in different clusters . . . . . . . . . . . . . . . . . . . . . . . . 37

iv


4.1 Schema for computing the feature scores within subsets . . . . . . . . 47

4.2 Locally important biomarkers for Appetite when splitting the data

on Body Mass Index (left) and on Number of Cycles (right). . . . . . 50

4.3 Locally important biomarkers for Neutropenia when splitting the data

on lmsite9 ( metastasis in Lymph Nodes) -left and on prt25 (prior

chemo therapy with Vinorelbine )-right . . . . . . . . . . . . . . . . 50

4.4 Locally important biomarkers for Neutropathy when splitting the

data on cm163 (H2-receptor antagonists)-left and on lmsite8 (metas-

tasis in Hepatic System including Gall Bladder)-right. . . . . . . . . . 51

4.5 Locally important biomarkers for Nail Disorder when splitting the

data on cm133 (combinations of penicillin)-left and on lmsite3( metas-

tasis in Bone or Locomotor System )-right. . . . . . . . . . . . . . . . 51

5.1 Schema for building a local predictive model . . . . . . . . . . . . . . 58

5.2 Appetite disorder prediction using Logistic regression. Left-Negative

Predictive value in Local vs. Global Models; Right-ROC points for

models built varying the number of features. . . . . . . . . . . . . . . 60

5.3 Neutropenia prediction using Logistic regression. Left-Negative Pre-

dictive value in Local vs. Global Models; Right-ROC points for mod-

els built varying the number of features. . . . . . . . . . . . . . . . . 60

5.4 Nail Disorder prediction using Logistic regression. Left-Negative Pre-



5.5 Neuropathy prediction using Logistic regression. Left-Negative Pre-



5.6 Variation of Sensitivity as the number of features increase. Left : Ad-

aboost (base classifier-Logistic Regression) for predicting Neutrope-

nia. Right: Random forest for predicting Neutropenia. . . . . . . . . 66

v


5.7 Neutropenia prediction using Adaboost. Left-Negative Predictive

value in Local vs. Global Models; Right-ROC points for models built

varying the number of features. . . . . . . . . . . . . . . . . . . . . . 67

5.8 Appetite prediction using Adaboost. Left-Negative Predictive value

in Local vs. Global Models; Right-ROC points for models built vary-

ing the number of features. . . . . . . . . . . . . . . . . . . . . . . . . 67

5.9 Nail Disorder prediction using Adaboost. Left-Negative Predictive

value in Local vs. Global Models; Right-ROC points for models built

varying the number of features. . . . . . . . . . . . . . . . . . . . . . 68

5.10 Neuropathy prediction using Adaboost. Left-Negative Predictive value

in Local vs. Global Models; Right-ROC points for models built vary-

ing the number of features. . . . . . . . . . . . . . . . . . . . . . . . . 69

5.11 Neuropathy (Left) and Neutropenia (Right) prediction using SVM . . 71

5.12 Nail Disorder (Left) and Appetite (Right) prediction using SVM . . 72

vi

List of Tables

4.1 Top 5 most discriminant features for Appetite . . . . . . . . . . . . . 44

4.2 Top 5 most discriminant features for Neutropenia . . . . . . . . . . . 45

4.3 Top 5 most discriminant features for Nail Disorder . . . . . . . . . . . 46

4.4 Top 5 most discriminant features for Neuropathy . . . . . . . . . . . 46

5.1 Global and Local performance obtained using Nave Bayes for Ap-

petite, Neutropenia, Neuropathy and Nail disorder . . . . . . . . . . . 62

5.2 Global and Local performance obtained using Decision Trees for Ap-

petite, Neutropenia, Neuropathy and Nail disorder . . . . . . . . . . . 63

5.3 Global and Local performance obtained using Random Forest for Ap-

petite, Neutropenia, Nail Disorder and Neuropathy . . . . . . . . . . 70

Word Count: 20 548

vii


Abstract

This document presents an investigation into applying machine learning techniques

to predict the occurrence of four adverse events (Appetite Disorder, Neutropenia,

Nail Disorder and Neuropathy) in lung cancer patients participating in a clinical

trial conducted by the pharmaceutical company AstraZeneca.

The focus of the project is to investigate the hypothesis that biomarkers show a dif-

ferent importance in different subareas of the input space and to develop techniques

that will identify what biomarkers are only locally predictive. This is a step towards

personalized medicine, which attempts to tailor the medical practices to the needs

of each patient. The first research area proposes a method for discovering the most

discriminant features based on Kuncheva Consistency Index for feature subsets. A

discriminant feature is considered one that splits the original data into two subsests

such that the features that are predictive for a specific adverse event in one sub-

sets are different than the features that are predictive for the same adverse event

in the other subset. The second investigation proposes a technique for highlighting

biomarkers that are only locally important in the subsets previously identified.

The last part of the thesis develops a method for building local predictive models

and comparing their performance with the global ones. The research showed that

the only adverse event that could be predicted from the measurements provided

was Neutropenia. For this, the local models always had a better negative predictive

value than the global ones, while maintaining a similar or better sensitivity and

specificity, depending on the particular learning algorithm used. The methodology

developed during this project should be immediately transferable to new data sets.

viii


Declaration

No portion of the work referred to in the dissertation has been submitted in support

of an application for another degree or qualification of this or any other university

or other institute of learning.

Copyright Statement

i The author of this dissertation (including any appendices and/or schedules to

this dissertation) owns certain copyright or related rights in it (the ”Copyright”)

and s/he has given The University of Manchester certain rights to use such

Copyright, including for administrative purposes.

ii Copies of this dissertation, either in full or in extracts and whether in hard or

electronic copy, may be made only in accordance with the Copyright, Designs

and Patents Act 1988 (as amended) and regulations issued under it or, where

appropriate, in accordance with licensing agreements which the University has

entered into. This page must form part of any such copies made.

iii The ownership of certain Copyright, patents, designs, trade marks and other in-

tellectual property (the ”Intellectual Property”) and any reproductions of copy-

right works in the dissertation, for example graphs and tables (”Reproductions”),

which may be described in this dissertation, may not be owned by the author

and may be owned by third parties. Such Intellectual Property and Reproduc-

tions cannot and must not be made available for use without the prior written

permission of the owner(s) of the relevant Intellectual Property and/or Repro-

ductions.

iv Further information on the conditions under which disclosure, publication and

commercialization of this dissertation, the Copyright and any Intellectual Prop-

erty andor Reproductions described in it may take place is available in the Uni-

versity IP Policy (see http://documents.manchester.ac.uk/display.aspx?

ix


DocID=487), in any relevant Dissertation restriction declarations deposited in

the University Library, The University Librarys regulations (see http://www.

manchester.ac.uk/library/aboutus/regulations) and in The University’s

Guidance for the Presentation of Dissertations.

x


Acknowledgements

First of all, I would like to thank Dr. Gavin Brown for giving me the opportunity

to work on this project and for his continuous guidance and constructive feedback

throughout the dissertation. I would also like to thank Dr. Diederik Pietersma, and

the entire staff at AstraZeneca for their helpful support and for making this project

possible. Secondly, I would like to thank my parents for their moral and financial

support. I am also grateful to Dinu Patriciu Foundation, for awarding me the ’Open

Horizons’ Scholarship which helped fund my postgraduate studies.

xi

Chapter 1

Introduction

1.1 Motivation

The development of a novel and useful drug extends over many years and involves

efforts of specialists in different domains, from medical to data engineering. One of

the most important steps is the clinical trial, when patients (volunteers) are given

different doses of the new drug, and at different time intervals while observing the

possible unexpected reactions. Clinical trials are used to determine the efficacy

and safety of a new product as well as provide valuable information in the early

development regarding cost effectiveness of the drug [19]. Moreover, clinical trials

can be a chance for patients to have access to the latest therapy available. On

the other hand, conducting a clinical trial can be a very costly procedure for a

company as it involves both financial resources for payment of the volunteers and

human resources for gathering information about the possible adverse reactions and

monitoring the participants health.

An adverse drug reaction is defined in [7] as an appreciably harmful or unpleasant

reaction, resulting from an intervention related to the use of medical product, which

predicts hazard from the future administration and warrants prevention or specific

1


treatment or alteration of the dosage regimen, or withdrawal of the product and

can range from minor alteration of a patient’s health to death. Thus, it would

be desirable to be able to identify in a first step which are the biomarkers that

influence the occurrence of an adverse event and then to predict in an automated

manner whether a new patient will experience a particular adverse drug reaction.

This involves gathering data by means of measuring different characteristics of a

patient and using specialized algorithms to extract the desired meaning from it,

usually by performing feature selection followed by a classification task.

Designing a classifier from a biomedical dataset is a Machine Learning task and

has been an active research field in the recent years. However, the quality of the

prediction depends heavily on the quality of the data used. Nowadays the datasets

produced (form DNA sequencing or clinical measures) can have hundreds or thou-

sands of attributes. This enormous quantity of information puts more pressure on

developing efficient algorithms to extract meaning from it. Of course that in general,

the actual number of characteristics that are important for making a classification

is much smaller and the others act as a noise, hindering the classifier and causing

misleading results. This is why an effective feature selection step is essential before

applying a classification procedure.

1.2 Local Feature Selection-towards personalized

medicine

Local Feature Selection attempts to formalize the intuition that each person is unique

and thus the key characteristics that should be looked at for one patient might be dif-

ferent than the characteristics meaningful for another patient in order to predict an

adverse effect. An intuitive yet simplified example is that what determines asthma

to appear in a child might be different than the causes of asthma of an old man.

Features like smoking and the pollution degree of the working environment might

2


be meaningful for the old man but irrelevant for the child. Here, what differentiates

the two categories is age and it is obvious that we should treat them as separate

problems and look for different medical measurements for a good prediction.

However, when the possible characteristics to choose form are thousands and a

certain subcategory of people is defined by a combination of those, efficient and

intelligent automated computation should be employed. Thus, the aim of a local

feature selection algorithm is to identify for each person or subgroup of persons which

is the most informative set of features that will lead to an accurate prediction. This

is in fact a step towards developing personalized medicine which intends to tailor

each medical action to the specific needs of individual patients. In our framework,

(predicting adverse effects for lung cancer patients) this can be seen as determining

what are the features to be used for each person (subgroup of persons) in order to

assure better classification results.

1.3 Aims and Objectives

1.3.1 Aims

• To investigate the hypothesis that biomarkers can hold a different predictive

importance when considered within different groups of people ;

• To investigate how local feature selection can be integrated in building predic-

tive models and provide a comparative analysis of the results obtained.

1.3.2 Objectives

• Identifying the degree of statistical dependence between individual features

and each of the target variables in order to gain initial insights into the rela-

3


tionship between the measurements and the adverse event to be predicted;

• Modeling a local feature selection procedure that will be able to identify for

each adverse event what are the biomarkers that are only locally important

along with the subspaces (groups of people) where they are meaningful;

• Developing a procedure for building local classification models and comparing

their performance in predicting the occurrence of an adverse event with the

global ones integrating the information obtained in the previous step.

1.4 Project Outline

The structure of the document is as follows:

Literature review. The Literature review chapter starts by providing an intro-

duction in feature selection techniques. Then it makes a description of information

theoretic measures and how these are integrated in feature selection methods. The

last part of the chapter focuses on reviewing current literature on local feature se-

lection algorithms (instance based and class specific).

Data preprocessing and initial experiments. This chapter provides a detailed

description of the data sets supplied for carrying this research project along with

the motivation behind the choices made in preprocessing them. In addition to this,

it analyzes the results of 3 initial experiments which aimed at gaining early insights

into the data sets. The experiments were designed to understand the degree of

information that is contained in individual features relative to each of the adverse

event to be predicted. They also attempted to carry a preliminary local analysis on

subsets obtained by performing different splits of the data and applying clustering

algorithms.

Analysis of feature importance within subsets. This chapter is structured in

4


two sections. The first one proposes a method for identifying which are the splits

that generate the most distinctive subsets of data (in the sense that the features

that are predictive of an adverse event in one subset are different than those that

are predictive in another subset). Based on the results obtained at this stage, the

second part of the chapter proposes a method for highlighting what biomarkers are

only locally important and in what group of people.

Predictive model building. This chapter proposes a method for building local

predictive models and explores the performance of different classification algorithms

both locally and globally in two stages. In the first one, in order to keep the model

simple, possibly interpretable less complex classifiers are used while the second one

employs more advanced classification algorithms together with a method for balanc-

ing the class distribution.

Conclusions. This chapter provides a summary of the research conducted, high-

lighting the conclusions drawn together with possible directions for further investi-

gation.

5

Chapter 2

Literature review

2.1 Feature Selection. Introduction

Feature selection is part of the field of Machine Learning that aims to select from the

input variables those that are the most relevant and have the best predictive power in

a classification task. With the advancement of the techniques that generate datasets

with thousands of features (web stream, gene expression, etc) performing feature

selection before applying a classification algorithm has become an indispensable

preprocessing step [8]. The aims of feature selection are:

• To reduce the computational costs associated with the prediction process and

lower the storage requirements;

• To reduce the prediction time;

• To help improve the model comprehensibility;

• To provided a higher accuracy of the prediction by removing irrelevant, redun-

dant or noisy features [8] [11] [3].

6


A general classification process that uses feature selection is shown in the image

bellow and was introduced in [11]. Phase I represents the actual feature selection

process, while phase II summarizes the model fitting and performance evaluation.

Figure 2.1: A unified view of the feature selection process [11].

Performing feature selection involves iterating through 3 steps:

1. Generating a feature subset candidate using a search strategy;

2. Evaluating and adjusting (adding/removing features) the set by means of a

selection criterion;

3. Deciding when the current set is good enough to further be used in the Model

fitting phase [11].

Model fitting consists mainly of training a chosen learning algorithm with the pre-

viously selected features and testing its performance using a testing data set that

has not been used in the training step.

7


Depending on the search strategy used, the techniques developed for feature selection

fall into three categories:

1. Filter approaches;

2. Wrapper approaches;

3. Embedded methods.

Feature selection algorithms of filter models are based on analyzing the intrinsic

relationship between features and target. In evaluation of a candidate set there is

no learning algorithm involved [8]. The evaluation is carried employing information

theory measures or probabilistic approaches. Among the most important advan-

tages of filter methods is the fact that they can be used regardless of the learning

algorithm and that the actual algorithm used has a very simple structure (usually

forward selection or backward elimination) and therefore it is easy to understand

[11]. Moreover, filters are generally faster than the other types of feature selection

methods.

On the other hand, wrapper approaches use a learning algorithm to judge the per-

formance of a candidate set of features. The most popular method for searching the

space in a wrapper approach is Greedy (Forward Selection or Backward Elimina-

tion) [9]. In the Forward Selection method we start with the empty set of features

and repeatedly add each available feature, evaluate the performance of the learning

algorithm and retain the feature that yield the highest gain in performance. In the

Backward approach, we start with all the feature set and progressively discard those

that allow the smallest drop in performance [9].

Integrating the learning algorithm in the feature selection process has both positive

and negative implications. The major drawback is that now the feature selection

model is no longer independent of the learning process and as a consequence a set

of features obtained once cannot be reused with another algorithm. Moreover, as

8


the learning task is carried for every new modified set, the whole feature selection

process is slow.

However, the important advantage of wrapper methods is that generally they lead

to a higher performance in prediction accuracy. Embedded methods use the same

idea of assessing the features usefulness by a learning algorithm, but the difference

to filter and wrapper methods is in the way feature selection and learning interact,

as there is no separation between those two steps [9]. Embedded methods are more

prone to overfitting than filters and thus if only small amounts of data are available

it is expected that filters will perform better than embedded methods, whereas

embedded methods will outperform filters as the training data increase[9].

2.2 Feature Selection using Information Theory

This section presents an overview of different attempts to design a filter feature se-

lection algorithm based on the mutual information shared between variables. These

can be then used in the process of performing local feature selection, which is in-

troduced in the next section. In all approaches presented a multi-class classifi-

cation problem is considered. Given a set of m examples{xk, yk} (k = 1 . . .m)

where xk = (xk1 . . . xki . . . xkn) is the kth instance consisting of n input features and

Y = {y1 . . . yi . . . yc} are the possible classes that each of the input instances can

belong to, the problem is to select from the n features those that are the best in

predicting the true class of an unseen(testing) set of examples.

2.2.1 Overview of basic Information Theory Concepts

1. Entropy. The entropy of a random variable X measures the degree of uncer-

tainty (randomness) in the distribution of X [4]. It is defined in the following

9


way:

H(X) = −∑x∈X

p(x) log p(x) (1)

The entropy has a maximum value when all the events have the same probabil-

ity of occurrence (for example rolling a dice: each face has the same probability:

1/6) as the uncertainty is maximum.

2. Conditional Entropy of X given Y measures the uncertainty that still re-

mains in X when we know the outcome of Y [4].

H(X|Y ) = −∑y∈Y

p(y)∑x∈X

p(x|y) log p(x|y) (2)

3. Mutual Informationdenotes the information shared between two random

variables [5] and it is defined as follows:

I(X, Y ) = H(X)−H(X|Y ) =∑x∈X

∑y∈Y

p(xy) logp(xy)

p(x)p(y)(3)

4. Conditional Mutual Information measures the information still shared

between the variable X and Y when the value of Z is known [4].

I(X;Y |Z) = H(X|Z)−H(X|Y Z) =∑z∈Z

p(z)∑x∈X

∑y∈Y

p(xy|z) logp(xy|z)

p(x|z)p(y|z)(4)

A possible way of assessing the usefulness of a feature set in a classification problem

is to rank the features according to a defined criterion that measures the intrinsic

relation between each feature and the target [4]. In the past 20 years many different

approaches based on information theory measures have been proposed. The fol-

lowing section introduces the basic notions taken into consideration when building

a criterion and presents some of the most important filter criteria based on them,

highlighting their strengths and weaknesses and explaining why they might be of

interest in the context of the project.

10


2.2.2 Relevancy. Redundancy. Relevancy in context

The measures introduced above can be used to quantify the usefulness of a featureXk

in relation to the target Y. The features selected to further be used in classification

should be relevant to the target class and not redundant. Having redundant features

means adding computational burden without adding relevant information.

Relevancy. The relevancy of a single featureXk with respect to the output class Y

is the mutual information shared between Xk and Y. (Equation 3)

Redundancy. The redundancy of a feature Xk is computed with respect to the

already selected features. If we denote by S the set of the features already selected

then Xk is redundant if it has high mutual information with the elements in S.

Relevancy in context. This measures takes into account the already selected

features, thus the relevancy of a feature Xk to the output, when a subset S of

features has already been selected is denoted by the conditional mutual information

between Xk, Y given each of the features in S [3]. (Equation 4)

Using these notions, in [4] it has been shown that most heuristic criteria that attempt

to increase relevancy and at the same time lower redundancy follow a general form:

J = I(Xk, Y )− β∑j∈S

I(Xj, Xk) + γ∑j∈S

I(Xj;Xk|Y ) (5)

where the first term accounts for relevancy, the second for redundancy and the third

for relevancy in the context of other features. The parameters and stress the

importance put on each of the terms they determine.

11


2.2.3 Ranking criterion

A simple way of selecting relevant features is to rank them according to the mutual

information between each feature and the target variable in descending order and

then keep selecting features until either a certain threshold has been reached or the

performance of a classifier starts degrading. This is in fact considering and in

equation (5) zero which means it measures only the individual relevancy of a feature

to the output class. Although the criterion is simple to implement and understand

as well as very fast, its major drawback is the fact that it assumes that all variables

are independent of each other and doesn’t take into account that features may be

redundant or may be relevant only in the context of others [4].

2.2.4 Mutual Information Feature Selection Criterion

In order to overcome some of the problems of the ranking criterion, in [2] Battiti

proposed a criterion that attempts to avoid selection of redundant features. This is

achieved by keeping the idea of maximizing mutual information between each feature

and the class variable, but adding a penalty if the current investigated features have

a high mutual information with the already selected ones. If we denote by Xk the

feature we want to asses a score and by S the set of already selected features, then

the Mutual Information Feature Selection Criterion can be expressed as:

J = I(Xk, Y )− β∑Xj∈S

I(Xk;Xj)

The first term accounts for the information shared between feature Xk and the

target Y, while the second one is a summation over the information shared between

Xk and the already selected features in S. The parameter has to be chosen by the

user. Even if this criterion penalizes redundancy, it fails to consider the fact that

that there are cases when feature useless by itself can be useful in the context of

12


other features [8].

2.2.5 Double Input Symmetrical Relevance Criterion

In [3] is proposed a criterion that attempts to take into account the fact that some-

times a set of variable can have higher mutual information with the output class

than the sum of the variables taken individually. Thus, the authors introduce a new

idea of variable complementarity. The rationale behind trying to formalize this is

also very clearly expressed in [8]. The example presented below was given in [18]

and expresses in a simple, yet intuitive way the importance of taking into account

variable interaction.

Figure 2.2: Multivariate feature selection [18]

The first graph from the left shows a two-class classification problem with two

features,x1 and x2. If we consider only x1, then we can see that there is much

overlap between the classes. Considering only x2 will result in even worst classifica-

tion accuracy, as the two classes overlap perfectly. However, if we look at the graph

considering both features together, then we can notice that the two classes can be

separated with high accuracy. In this way, we can say that x1 is more relevant once

x2 has been considered.

In the same way, the right graph shows that two features that are useless considered

individually can be very relevant when considered together. This is what the Dou-

13


ble Input Symmetrical Relevance criterion attempts to take into account, when the

variables complement each other and are more useful considered jointly than indi-

vidually. The complementarity of two random variables with respect to the output

class Y is defined as:

CY (Xi;Xj) = I(Xi,j;Y )− I(Xi;Y )− I(Xj, Y )

Another idea that led to the final formulation of the criterion was the authors intu-

ition that when we have no knowledge about how to combine subsets of d variables,

the best subset can be obtained by combining subsets of d-1 variables[3]. This

heuristic was theoretically proved in the article and the Double Input Symmetrical

Relevance criterion was defined as:

XDISR = arg maxXi∈X−S

{∑

Xj∈XS

SR(Xi,j;Y )}

where SR(X;Y ) = I(X,Y )H(X,Y )

is the symmetrical relevance between X and Y. The

normalization term does not follow a theoretical background but is motivated by

the fact that mutual information is biased towards higher arity features. The most

important advantage of Double Input Symmetrical Relevance criterion is that it

favors selecting a variable that is complementary with an already selected one[3].

However, it does not take explicitly into account the problem of selecting redundant

features.

2.2.6 Joint Mutual Information Criterion

Another criterion that takes into account the complementarity of features is Joint

Mutual information proposed in [21]. For a feature Xk the criterion associates the

following score:

Jjmi =n−1∑k=1

I(XnXk;Y )

14


The criterion is expressed as the sum of the mutual information between the target

variable Y and a joint random variable XnXk obtained by paring the current feature

under investigation with each of the already selected ones. The idea is to select a

feature that carries complementary information to the ones that have already been

selected.

2.3 Local Feature Selection

Instance-Based Feature Selection

The previous presented techniques can be very efficient for some problems and when

applied globally as a preprocessing step in a classification problem can substantially

improve the accuracy. However, there are cases when a global attempt to select

relevant features is not suitable, but we should take into account the fact that

sometimes features are more important in specific regions of the whole space and

less important in others [6]. Ignoring this aspect can lead to discarding features

that though are irrelevant in most of the feature space, are very important in a

small regions. Alternatively, we can select those features that are relevant in most

of the space, but still hinder the classifier in certain regions [6]. In order to deal

with this problem, different solutions for identifying a heterogeneous problem and

then applying local feature selection have been proposed.

2.3.1 Natural clustering

In [10] the authors explore the effects of local feature selection compared global fea-

ture selection using natural clusters. Here, the problem of decomposing the space

has been solved using experts knowledge about the dataset (microbiological data)

and thus the promising results obtained after applying local feature selection can

serve as a motivation for finding methods to cluster the data in an automated way

15


to form homogeneous feature subspaces. Their study was meant to investigate the

impact of incorporating knowledge of domain experts in the preprocessing step on

classifying antibiotic resistance (sensitive, resistant, intermediate) and in particular

how the classification accuracy differs when applying local versus global dimension-

ality reduction techniques.

Though in the above mentioned article, both feature extraction and feature selec-

tion techniques were applied, only the results obtained using feature selection are

presented below as this is the focus of this research project.

The distribution of data after clustering is shown in the figure below:

Figure 2.3: Natural clustering of data with regard to the pathogens [12]-an exampleof problem decomposition for a particular microbiological data using prior informa-tion.

The techniques used for local feature selection are of a wrapper type: Forward

Feature Selection, Backward Feature Elimination and Bidirectional Search. The

evaluation of selected features was done using knn classifier (k=7). The results

revealed that feature selection applied locally at the second level of splitting (gram+

and gram-) improved the classification accuracy. Moreover, the total number of

features selected locally is always smaller than the number of features selected when

applying global feature selection.

16


Though the results are encouraging, practically the problem is more complex as

in general we do not have such knowledge about how to cluster the data, but we

have to use traditional clustering techniques or other methods for decomposition

of heterogeneous problems. Moreover, the evaluation was done only with wrapper

methods which are computationally expensive and using knn classifier which requires

large memory resources as the model is the training data itself. The fact that good

results were obtained only when splitting the data in two clusters and applying

feature selection locally suggest that the decomposition in subproblems should be

carefully analyzed as there is a risk of overfitting. Moreover, ways of analyzing how

different are two subsets in terms of feature relevance would be useful.

2.3.2 Measuring dissimilarity of subproblems

An attempt to measure the degree of dissimilarity between two given subproblems

is given in [1]. The author’s idea was to define for each subproblem a vector of

dimension f (the number of features) where the ith element is a measure of the

importance (merit) of the ith feature. The angle between the two vectors (called

the Importance Profile Angle) will denote how different the two regions are as far

as feature importance is considered. Formally, the IPA is defined in the following

manner:

IPA =2

πarccos

∑fi=1MaiMbi

(∑f

i=1M2ai)

12 (∑f

i=1M2bi)

12

WhereMa1,Ma2, ,Maf is the merit vector for the first subproblem andMb1,Mb2, ,Mbf

the corresponding vector for the second subproblem. The IPA defined above is the

angle between these vectors, normalized. A threshold above which to consider that

the two problems are different should be set experimentally. In order to measure the

feature importance, the authors proposed three methods based on Gini index and

entropy. Though these measures are faster and easier to compute, they also share

the great disadvantage that they only measures the correlation between a single

feature and the class target, without taking into consideration feature interaction

17


(as discussed in the first part of this chapter)[1].

The authors define IPA firstly for categorical features with binary values. It involves

generating a split for each feature, computing the IPA and then choosing the split

with the largest angle between merit vectors. The process is similar with building a

tree, but the difference is that the aim of splitting is to obtain homogeneous problems

and from this point any other classifier can be used together with a feature selection

method. Though a method for dealing with multiple valued features is proposed, this

involves computing more splitting points which can be computationally expensive

and infeasible for large datasets with thousands of features. For numerical features,

the authors proposed using firstly a discretization method.

However, this method is based on a global analysis of the data and may not be

suitable for heterogeneous problems [1]. Another issue that has to be taking into

account when applying IPA is the stopping criterion. The authors have not proposed

a specific method, but they mention as guidelines that the criterion should probably

include a threshold for IPA and one for the number of instances in the subproblems

[1]. Moreover, as it was also expressed in [12] the fragmentation of the initial dataset

should not go too far in order to avoid overfitting [1]. Some possible extensions of

this method would be using IPA to assess the splitting obtained by other clustering

techniques and adapt it to compute the degree of dissimilarity between multiple

clusters.

2.3.3 Local feature selection using a clustering-like approach

An algorithm that attempts to perform local feature selection using a clustering-like

approach is proposed in [6]. The method is of a wrapper type, using knn classification

algorithm with k equals 1 and aims to select different relevant features for each new

instance to be classified. This is done by constructing an instance space from the

initial training data in the following manner. At the beginning the instance space

18


is exactly the training and an initial accuracy is computed. Each instance finds the

nearest example that has the same class and assumes that the features that differ

by more than one standard deviation are not relevant. These features are dropped

from the instance and the classifier is run again, comparing the current accuracy

with the accuracy obtained by keeping the features. If there is an improvement then

the features will be deleted, else they are kept and the algorithm will not attempt

to do any feature selection for this instance [6]. The algorithm is stopped when no

improvement can be achieved by performing feature selection on any instance.

The distance to the closest example has to be adapted to the fact that by deleting

some features from instances, we obtain vectors with different size. Thus, the author

employs normalized Euclidian distance for numeric features and a version of the

distance proposed by Stanfill and Waltz [6] to deal with symbolic ones (considering

that when a feature is discarded from an instance, this is marked by a special value

*) [6].

A great disadvantage of this method, as with any other wrapper feature selection

algorithm that also uses knn as a classifier is the computational cost of running the

algorithm each time feature selection is attempt. Moreover there is also a mem-

ory cost associated with retaining all instances which may result in the algorithm

not being suitable for large datasets and large feature sets. The average number

of features kept by this algorithm is generally higher than the number of features

retained using Backward Sequential Selection or Forward Feature Selection, because

the algorithm will drop a feature from all instances only if it is irrelevant in all sub-

spaces (globally irrelevant). The accuracy reported using this type of local feature

selected was higher than the accuracy using global feature selection algorithms and

in order to verify that this difference comes from the fact that the algorithm takes

into account the idea that some features are relevant only in parts of the instance

space, artificial data sets were created and used in testing.

19


2.3.4 Local feature selection and dynamic integration of clas-

sifiers

A further step in the field of local feature selection was done by Tsymbal et al.

in [15]. Besides proposing a method of selecting features relevant for each new

instance, the authors also investigate how the most suitable classifier can be used in

each case. The main idea behind their approach is to use classifiers built on different

feature subsets, store information about the predicted errors of them and for a new

instance to be classified a meta level classifier (weighted nearest neighbor) will be

used to determine which classifier is best. In order to restrict the possible classifiers,

a decision tree is constructed on the entire dataset and for a new instance x, the

classifiers built on features that are not in the path followed by x in the tree will be

discarded.

The algorithm consists of a learning phase and an application phase. In the learning

part possible features subsets are generated and cross-validation is used to estimate

the errors of a classifier on a specific feature set. For the application phase two

versions are proposed, depending on how each classifier contributes to the final

assignment of a class. In the static version the classifier with the smallest predicted

error is chosen to make the final classification, while in the dynamic version, each

classifier has a weight and the final classification is obtained using weighted voting

[15].

The experimental results showed that with this technique a 10% accuracy improve-

ment can be reached using less than a half of the initial features. However, the

datasets used for testing the proposed method are relatively small, with at most

432 instances and no more than 57 features. It would be interesting to see how the

algorithm handles much larger datasets. In particular, in the generation of features

subsets step a method for avoiding exhaustive enumeration should be used (the

authors mention the possibility of employing heuristic methods).

20


2.4 Class-Specific Feature Selection

A different type of local feature selection is class-dependent feature selection. Here,

the focus changes from selecting different features for each instance to be classified,

to selecting a possibly different feature subset for each class, depending on their

discriminating properties [17][14].

An intuitive and motivating example why such an approach would be useful is given

in [17] : supposing that we are dealing with medical data and for each patient we

have measurements like blood pressure, weight, cholesterol, age, height etc and that

we have to diagnose whether the patient suffers from disease A, B or C. Assume that

if someone has the blood pressure above a certain limit it means it has disease A,

looking at the weight we can tell if he has disease B and a certain level of cholesterol

means he suffers from disease C [17]. A general feature selection algorithm will select

blood pressure, weight and cholesterol as being important. If the patient suffers from

disease A, then the weight and the blood pressure will act as noise (analogous in the

other cases) and we can end up with misleading results [17]. On the other hand, if

we know what features are relevant for each class then, we can build three classifiers,

each of them distinguishing a possible disease from all the rest [17].

Considering the project framework, that is performing feature selection for predict-

ing adverse events, class specific feature selection may be important when dealing

with different degrees of severity of an adverse event. For example, if a drug may

produce nausea, headache and heart attack, then we might be interested firstly in

what are the important features in predicting heart attack and we would like a high

accuracy in prediction. For this scenario, class specific feature selection might be

more useful. Furthermore, we may have classes that have very few examples and so,

performing global feature selection would advantage the richer class.

A general wrapper approach for selection of class dependent features is proposed in

[17]. The algorithm uses the idea of one against all, meaning that it will construct

21


C classifiers (where C is the number of classes), with classifier i distinguishing be-

tween class i and all the others. The process is illustrated in Figure 2.4. For each

of those classifiers a hybrid feature selection method is used, that combines wrap-

per approach with a filter one. Firstly all features will be ranked according to an

importance measure such as RELIEF [10] weight measure and Class Separability

Measure (CSM)[17]. Any other ranking measure can be used, including those based

on information theory. After that, following a forward selection search, different

subsets of features are added for each class in the order of ranking [17] using SVM

as a classifier. As a stopping condition can be used either checking when the val-

idation accuracy starts to decrease or when all the features have been added [17].

For classifying an instance a heuristic method is proposed that asserts weights to

the output of each model before carrying out the comparison between them and

selecting the final output [17].

22


Figure 2.4: Schematic view of general wrapper approach to class-dependent featureselection [17].

The method proposed in [17] is very flexible as it allows customizing the class-

dependent feature selection algorithm choosing what classifier to use as well as what

filter method for ranking of the features. The results reported by the authors using

RELIEF [10], CSM[17] and mRMR (Minimal-Redundancy-Maximal-Relevancy)[13]

as ranking measures and SVM as classifier are promising, as the accuracy using

this method is always higher than performing class-independent feature selection.

Moreover, the average number of features selected is smaller in the case of the

proposed method. The drawback is the extra computational cost added by using a

wrapper approach for each of the binary classifier.

23


A similar approach is proposed in [14] where again the idea of transforming a C-

class classification problem into C binary problems is used. Unlike the previous

method, here the problem of obtaining imbalanced binary problems (one class having

considerably more examples than the other) is addressed. Their idea was to use an

oversampling method before applying a conventional feature selection method on

a binary problem [14]. Moreover, the authors experimented with both filters and

wrappers as feature selection methods on the subproblems and using Nave Bayes,

C4.5, MLP and knn (k=1 and k=3) as classifiers.

For classification of a new instance, the authors have also used a heuristic method to

select between the outputs of the C classifiers. They compared the results obtained

with no feature selection, traditional feature selection and class specific feature selec-

tion on 15 datasets and concluded that usually class specific feature selection yields

better results than traditional feature selection which in turn helps obtaining better

results than applying a classifier without any feature selection as a preprocessing

step. Even if some of the datasets used in the experiments reached a large number

of instances (12960-the largest), the number of features is relatively small (at most

64) and there is no record on how the proposed framework deals with large feature

sets.

24

Chapter 3

Data Preprocessing and Initial

Experiments

This chapter aims at giving a detailed description of the supplied data sets and

the choices made in preprocessing them in order to be able to conduct the desired

experiments in Matlab. Moreover, as the choices made in early steps will influence

the final results, the motivation behind each step is given. The initial experiments

were designed to gain first insights into the data which will guide the choices of

further possibilities of experimentation.

3.1 Data Overview

The data on which the experiments were carried out come from a clinical study

conducted by the pharmaceutical company AstraZeneca and consisted of 3 distinct

datasets along with the explanations of the measurements recorded. A general

description of each of them is given below.

25


3.1.1 Adverse Events Data Set (not used in the experi-

ments)

This dataset contains information about the occurrences of adverse events within

the patients that were included in the clinical trial along with an internationally

accepted classification of adverse events obtained from MedDRA ontology. The set

contained 6868 incidents of adverse events and 593 different types of adverse events.

Though the actual data set was not used in the experiments, the 4 adverse events

were grouped according to this ontology in order to increase the occurrences of

positive cases. The grouped adverse events were also supplied.

3.1.2 Subjects Data Set

The data set is made up of 129 measurements recorded for 613 patients, including the

variables associated with the occurrences of the 4 adverse events to be investigated.

Their distribution among the patients is listed below:

• Anorexia- occurs 186 times, with a severity over 3 in 7 cases.

• Neutropenia- occurs 219 times, with a severity over 3 in 197 cases.

• Nail disorder -occurs 67 times, with a severity over 3 in 0 cases.

• Neuropathy- occurs 71 times, with a severity over 3 in 3 cases.

One of the particularities of this data set was that it contained mixed types of

features:

• Continuous (Age, Baseline Weight, Baseline Height, Baseline Body Mass In-

dex, Baseline Body Surface, Baseline BFGF, etc);

• Binary discreet (Sex, Smoking status, tumor stage, location of the metastasis

26


(lmsite1-lmsite15), prior chemo therapy treatment, reduction of doses(aered1-

aered8),etc);

• Categorical data(Race, Country, Histology Type, etc).

3.1.3 Concomitant Medication Data set

Concomitant Medication data set contained 390 features recorded for the 613 pa-

tients where each feature was a binary variable representing whether a patient took

or not a particular medicine that could possibly be linked with the occurrence of

one of the adverse events. The number of Cycles of doses a person received was also

recorded along with another binary variable accounting for whether a patient had a

dose reduction or not. This data set was used along with the Subjects data set.

3.2 Data Preprocessing

3.2.1 Converting string discrete variables into numbers

The binary variables had as possible values the strings Y or N. Variables Race,

Histology type and Country had multiple categories. All string values were attached

a numeric label starting from 0, in alphabetical order according to the string values

denoting the categories. (e.g. for Race, there were 4 categories : Black, Caucasian,

Oriental, Other which were converted into 0, 1, 2,3, respectively).

3.2.2 Discretization of continuous variables

The project focuses on a mutual information framework and thus, the whole data

has to be discrete. Moreover, the number of continuous variables was much smaller

27


(only 9) than the number of already discrete variables. The following techniques

were taken into consideration for this preprocessing step:

• Discretization using the mean value into two binary classes. One

of the advantages of using this technique is that the categories resulted are

easy to interpret and that since most of the variables are binary, the mutual

information computed using that variable will not be biased. However, the

major drawback and the main reason why this technique was not employed is

that by simply binarization we may lose important information.

• Discretization using prior knowledge (manual discretization). This

implies choosing both the number of categories and the minimum and maxi-

mum values allowed for each of them according to intuition and general infor-

mation (for example, considering 3 categories for Age: [20-40] years, [41-60]

years, over 60 years). Although this technique may seem reasonable it is highly

dependent on personal experience and restricted by the available prior knowl-

edge. Therefore, it cannot be employed for all continuous features (such as

BBFGF).

• Discretization by minimization of the information loss. This technique

uses the target variable in the process of choosing the optimal thresholds. In

order to explain it, the variable Age will be used as an example. For all other

continuous variables, the process was analogous.

Firstly, the range of the variable was computed (20-82 years). Then, 20 initial

possible thresholds were placed at equal distance. The number of categories was

chosen to be 5 (4 thresholds) as this number turned out to be at the best trade-

off between minimizing information loss on the one hand (fewer categories: more

information loss) and dealing with the problem of the mutual information’s bias

towards high arity features as well as a significant increase in computational time,

on the other hand. All the possible ways of placing the 4 thresholds were generated

28


(4845) and for each of them the mutual information between one candidate and the

target variable was computed. The final discretization was chosen to be the one

that maximized the dependency between the variable and the adverse event.

3.2.3 Missing Values

In order to be able to use in experiments the data points that had missing values, two

methods of approximating the lack of information were analyzed. The first method

was to compute the mean of that particular value for all the other patients and use

it to fill in the missing one. This approximation is poor as it does not consider

the particular characteristics of the patient and it will fill use the same value (the

mean) for all patients that had a missing record on a particular characteristic. The

second one and the one that was employed in obtaining the preprocessed data was

to compute for each data point that had a missing value on ith feature, the closest

10 data points in the sense of Hamming distance. Then, among these 10 closest

neighbors, the value that occurred most often for the ith feature was voted to fill in

the missing one.

3.2.4 Sparse Features and Special Cases

Sparse features were considered those that had more that 50% missing values and

were not used in the experiments as they do not carried sufficient information for a

valid analysis.

Some of the features needed to be treated separately as they exhibited special prop-

erties:

• The variable Country initially had 25 values. Since the mutual information

is biased towards high arity features and the majority of features are binary,

we decided to make this feature binary as well, by grouping the countries into

29


European and Non-European.

• For mixed continuous and categorical data (such as variable GENERLT (EGFR

gene amplification)), a value of NO RESULT was treated as a missing value.

• For each adverse event, the features that represented whether it occurred with

a severity over 3 ( e.g for anorexia: aeg3p1, aeg3p5) were not included in the

experiments where the target variable was that particular event. The reason

behind this choice is that the variable representing the degree of severity of

an adverse event is conditioned on knowing that that particular adverse event

happened. Therefore, its value can be found only after we know if the adverse

event occurred or not.

3.3 Initial Experiments

The initial experiments were designed to help gain first insights into the data set

and in the way the features are correlated with the target variables (the 4 adverse

events). The results obtained at this step influenced the choices considered for next

steps.

3.3.1 Experiment 1: Ranking features according to Mutual

Information

This experiment aims at understanding what amount of information is shared be-

tween an individual biomarker and the target variable which was considered in turn:

Appetite, Neutropenia, Nail Disorder and Neuropathy. After computing the mutual

information between each feature and the target variable, the features were sorted

in descending order of the mutual information and the top most important were

displayed.

30


3.3.2 Permutation test

Computing the mutual information between a feature and the target variable in-

volves making an approximation of the probability distribution in two random vari-

ables. The accuracy of the approximation is influenced by the number of available

data points as well as by the noise present in the data [20]. Therefore, in order to be

able to say to what extent a feature is useful, a threshold has to be set [20]. One way

of choosing this threshold is performing a permutation test that will also involve a

formal hypothesis test [20].

A permutation test aims to investigate the question of how likely is the value ob-

tained for a statistic θ̂i computed over the vectors of length n, xi and y if we suppose

that they are independent and as a consequence, the value for the statistic should

be zero? [20]. This is done by estimating the distribution of the random variable θ̂

by the values θ̂i obtained for all possible permutations of the vectors x (or y) and

then computing the proportion of the values obtained for θ̂ that are larger than θ̂i

[20]. In our context, the permutation test can be employed in order to assess the

significance of the mutual information between each feature and the variable to be

predicted and to automatically discard those that do not pass the permutation test.

The level of significance was chosen to be 1 and since computing the total number

of permutations (n!) would add a significant computational cost, only 500 permuta-

tions were considered. In addition to this, the mutual information was normalized

to be between 0 and 1 in the following manner:

NormalizedMI(X, Y ) =I(X, Y )

min(H(X), H(Y )),

where H(X) is the entropy of X.

31


Figure 3.1: Feature ranking according to normalized mutual information for Ap-petite and Neutropenia

Figure 3.2: Feature ranking according to normalized mutual information for Naildisorder and Neuropathy

In Figure 3.1 it can be noticed that the maximum value reached for Normalized

Mutual Information computed for Appetite and Neutropenia is very low (0.23 for

Appetite, 0.19 for Neutropenia, when maximum possible is 1) which means that

there is little predictive power in individual features as far as Appetite disorders or

Neutropenia are concerned. For Nail and Neuropathy (Figure 3.2), the maximum

value of Normalized Mutual Information is higher (>0.4). However, for all adverse

events, it should be noticed that the first two features are those referring to the

reduction of doses in the case of occurrence of that adverse event (the grouped

or the specific one): aered7 and aered3 for Nail Disorder, aered8 and aered4 for

32


Neuropathy, aered2 and aered6 for Neutropenia and aered1 and aered5 for Appetite.

In the case of Nail Disorder and Neuropathy, all other features have a considerably

lower value. Moreover for these two adverse events only 5 respectively 4 features

have passed the permutation test, which indicates again a low predictive power in

individual features.

3.3.3 Experiment 2: Local analysis of individual feature im-

portance

The previous experiment revealed that the mutual information between individual

features and the adverse event is small when considering the whole data set. In

the second one, the hypothesis that the importance of features may be different in

different subsets is taken into consideration.

The experiment framework is the same as in previous one: computing mutual in-

formation between all the features and the target variables and then displaying the

ones that passed the permutation test in descending order. The difference is that the

mutual information is computed firstly by considering subsets of the data defined

by splitting on particular features and then by applying clustering algorithms.

This experiment does not attempt to analyze all the possible subsets that can be

obtained by splitting the data on the categories defined by a feature, or to propose

an optimal split. The purpose is to gain more information about the structure of

the data and to offer some possible pathways to be investigated rigorously in the

next chapters. This is why the subsets analyzed are only a small selection and

were obtained by splitting the data on the following features: Sex, Race, hstltyp

(Histology type), Smoking habits and Tstage (Cancer Stage).

Some of the results are displayed below. Though the increase in the mutual infor-

mation is small, the tendency to rise can still be noticed. For example, in Figure 3.1,

33


the maximum normalized mutual information between the features and Neutrope-

nia was in the previous experiment less than 0.19, while considering only the subset

of people who had Large Cell Carcinoma (Figure 3.3), the most predictive feature

has normalized mutual information higher than 0.35. Moreover, another thing that

should be noticed is that the ranking of the features differs from one set to another

when considering a particular adverse event.

For subsets smaller than 20 data points a stability check was done along with a

permutation test in order to ensure the validity of the experiments. The stability

check consisted in removing each data point in turn from the subset and running

the permutation test over the new subset, recording at each iteration the features

that passed it. Only those that passed the test for every iteration are displayed as

being statistically valid.

Figure 3.3: Feature ranking according to normalized mutual information for Ap-petite in the subset of people who had Large Cell Carcinoma

34


Figure 3.4: Feature ranking according to normalized mutual information for Ap-petite in the subset of Females

Figure 3.5: Feature ranking according to normalized mutual information for Naildisorder in the subset of Caucasian people

35


Figure 3.6: Feature ranking according to normalized mutual information for Neu-tropenia disorder in the subset of Males

The next step was to use automated techniques to cluster the data and then to

compute the mutual information and rank the features accordingly within the sub-

sets. For this purpose K-means clustering was employed with k equals 4. Since the

data is categorical, the metric employed was Hamming distance which computes for

two data points the number of mismatches across all features. The results are sim-

ilar with those considering the subsets in that the normalized mutual information

slightly increases as compared to that computed on the entire data set and that

the same features seem to change their predictive power for an adverse event when

considering different clusters.

36


Figure 3.7: Feature ranking according to normalized mutual information for Ap-petite in different clusters

One of the greatest disadvantages when applying clustering techniques is that the

interpretability behind the clusters is not very transparent. Since the problem to

be investigated is in the context of medical field, the project aims in a first stage at

developing local techniques that will preserve the meaning of the subsets analyzed.

For this reason, in the following chapter, the focus will be more on analyzing subsets

of patients resulted from splitting the data on a particular variable that can be

associated a clear meaning (such as Males and Females) rather than using automated

clustering methods.

37


3.3.4 Conclusions

These initial experiments revealed that individual features have very small mutual

information with the occurrence of any of the four adverse events. Analyzing mu-

tual information in subsets defined by splitting the data on different features or in

clusters resulted in a small increase in the features predictivity. Moreover, the rela-

tive ranking of the features in the case of a particular adverse event changes within

subsets. These results leads to the next steps of the project which are considering

features jointly rather than individually and proposing methods to analyze the dif-

ference in features importance in one subset as opposed to another as well as to

automatically detect which splits generates subsets that differ the most in terms of

what features are important in predicting one of the four adverse events.

38

Chapter 4

Analysis of feature importance

within subsets

This chapter aims at understanding how the importance of features varies within

different subsets of data. Firstly, the choice for a feature selection criterion is mo-

tivated. Then, a method is proposed to choose the features for splitting the data

in such a way that the resulted subsets are the most dissimilar as far as the top

10 predictive features are concerned. A further step is done in order to identify

the local important biomarkers by assigning to each feature two scores for the two

subsets it appears in. A heuristic is then applied to identify the features that are

only locally important and change their predictive power within the subsets.

4.1 Definitions

A discriminant feature1 is one that splits the original data into 2 subsests with

the following characteristic: the features that are predictive for a specific adverse

event in one subset are different than the features that are predictive for the same

1In some disciplines the term discriminant feature can be used generically for an importantfeature. In this context it will be employed with a different meaning, as mentioned in the definition.

39


adverse event in the other subset.

A locally important feature is a feature that is predictive of an adverse event

only in a sub area of the input space.

Two classification subproblems are considered different/ dissimilar if:

• the data sets on which the classification tasks are defined come from the same

initial set and were obtained by splitting it on a particular feature ;

• the variable to be predicted is the same for both subsets (one of the adverse

events);

• the set of features that are the most relevant in predicting the tar-

get computed on the first subset is different than the set of most

important features computed on the second subset.

4.2 Assumptions and Limitations

In this study the local analysis is performed by dividing the initial data set each time

only in two subsets. The rationale behind this choice was that making a further split

would result in producing small-size subsets that could not be used for a reliable

analysis.

Moreover, for the features that have multiple categories, a grouping of those was

used in order to create only two subsets. For example, the feature Race had 4

categories (Caucasian, Black, Oriental and Others) and since the Caucasian subset

had a size of 360, the grouping was done in the following manner: Caucasians −

first subset, Other Races (Black, Oriental, Others) − second subset.

40


4.2.1 Feature Selection Criterion

The previous chapter revealed a very small amount of information in individual

features for all 4 adverse events. As a consequence, the following analysis will be

done by adopting a feature selection criterion that takes into account the possibility

that features may carry more information when considered jointly (that is they

are complementary to one another). The criterion adopted in this study was Joint

Mutual Information [21] which is defined as:

Jjmi =n−1∑k=1

I(XnXk;Y )

In order to select the next feature the criterion computes the information between

the targets and a joint random variable, defined by pairing the candidate Xn with

each current feature.

4.3 Identifying the most discriminant features

The first step in performing a local analysis and building local predictive models

is to identify the sub-demographics present in the data that define two different

problems. One way to measure the degree of heterogeneity between two subsets

and the one adopted in this research is to analyze how the most predictive features

relative to each of the 4 adverse events differ in one subset compared to another.

Steps in this direction have been done in [1] where for each possible split the angle

between the 2 vectors containing the mutual information for all features within the

2 subsets was computed. A small angle would indicate that the features have similar

importance in the 2 subsets. The major drawback of this method in the context

of the supplied data set is that it requires using a merit measure that would assess

features individually. As the mutual information of individual features is very small

41


and knowing that only a small number of them passed the permutation test, the

significance of the results returned can be affected.

4.3.1 Consistency Index for feature selection

The method proposed for computing the dissimilarity between 2 subsets of data

is based on the Consistency Index for feature selection introduced in [22]. The

index asses how similar are two sequences of features of the same length obtained

at different runs of a features selection algorithm. The formal definition of the

Consistency index for two subsets A ⊂ X and B ⊂ X such that |A| = |B| = k,

where 0 < k < |X| = n and r = |A ∩B| is:

IC =r − k2

n

k − k2

n

=rn− k2

k(n− k)[22]

The Consistency Index satisfies the following 3 properties:

• Monotonicity. The larger the intersection of the two feature subsets, the

highest the value of the index;

• Limits.The index is bound by constants (−1 ≤ IC ≤ 1) that do not depend

on k or n. The maximum value of the consistency index is 1 and is reached

when r = k, that means when S1 is the same as S2. The minimum value (-1)

is obtained when the intersection of S1 and S2 is the empty set (r = 0) and

their size is half of the total number of available features (k = n2);

• Correction for chance. IC(A,B) will have a value around 0 when A and B

are independently drawn as r is expected to be around k2

n[22].

A major advantage of employing this index in the analysis of feature importance in

subsets for the current problem is that since it compares sets of features and not

the individual mutual information carried by each attribute, it is compatible with

42


a feature selection criterion that considers features jointly, such as JMI. Having

these properties, the Consistency Index has been chosen as a measure of similarity

between the sets of features obtained by running a feature selection algorithm on

each of the two subpopulation defined by splitting the original data on a particular

feature.

Splitting the data in subpopulations implies a considerable reduction in the size of

the available data, which affects the reliability of the distribution approximation of

each feature. In an attempt to overcome this problem, in the analysis the feature

selection algorithm is run 20 times on 20 bootstrap samples of the original data

set, and in order to reduce the variance the mean of the Consistency Index will be

considered as a final value. The procedure is done in turn for every possible split and

the output will be features in ascending order of the Consistency Index computed on

the two subproblems that each feature defines. The method is summarized below:

Compute 20 bootstrap samples of the initial data set

For each feature f that defines a valid split

For each bootstrap sample

Split the data into subpopulation A and subpopulation B

Compute the set S1 of the 10 most predictive features using JMI

for the subset A

Compute the set S2 of the 10 most predictive features using JMI

for the subset B

Compute the consistency index (IC) for sets S1 and S2

Average the consistency index on feature f on the 20 bootstrap samples

Sort all the features in ascending order of IC and display the first 5.

43


4.3.2 Results

This section shows the top most discriminant features, that is the features that

produced the most dissimilar subsets for a particular adverse event. In this context,

two subsets are considered different (i.e. they define different problems that may be

tackled separately) if the top 10 features important in predicting a certain adverse

event are different. The results are shown for each of the 4 adverse events in turn

together with a brief explanation of what the name of the feature means.

Appetite Disorder

Feature BLBSAM BLBMI Ncycles Lmsite8 AgeExplanation Baseline

BodySurfaceArea

BaselineBody MassIndex

Number ofCycles Re-ceived dur-ing Study

LocAdv/MetaSite:Hepatic

Age

Averaged Con-sistency Index

0.4236 0.4290 0.4500 0.4710 0.4762

Table 4.1: Top 5 most discriminant features for Appetite

It can be noticed that the features which are the most significant to split the data

on in order to perform a local analysis are intuitively related to Appetite Disorder

/Anorexia (body mass index, body surface area, metastasis in Hepatic System).

For example, the second smallest consistency index was achieved when the data

was split on body mass index, thus resulting two categories of people: light-weight

(BMI ≤ 25) and heavy-weight (BMI > 25). This indicates that the features which

are predictive for Appetite disorder in the subset of light-weight people are different

that those that predicts Appetite disorder in the subset of heavy-weight people.

44


Neutropenia

Feature Cm71 Prt25 Cm223 Lmsite8 Lmsite9Explanation Benzo

diazepinederivatives

PriorCancerTherapy:Vinorel-bine

Opium al-kaloids andderivatives


LocAdv/MetaSite:LymphNodes


0.4133 0.4238 0.4290 0.4448 0.4605

Table 4.2: Top 5 most discriminant features for Neutropenia

The extent to which the results are given a medical interpretation is limited by

experience and prior knowledge in understanding the medical terms involved. Table

4.2 shows that among the 5 most useful features to split the data on in order to obtain

more homogeneous problems for predicting Neutropenia is the one that indicates

whether the patient have metastasis in Lymph nodes. (’Lymph nodes are found all

through the body, and act as filters or traps for foreign particles. They are important

in the proper functioning of the immune system. They are packed tightly with white

blood cells .’ [27]).

On the other hand, Neutropenia is the type of disease which affects the number

of white blood cells in the blood.(’Neutropenia, [. . . ] is a hematological disorder

characterized by an abnormally low number of neutrophils, the most important type

of white blood cell. Neutrophils usually make up 50-70% of circulating white blood

cells and serve as the primary defense against infections by destroying bacteria in the

blood. ’ [28]). The small value of the averaged consistency index indicates that the

features important in predicting Neutropenia for the patients who had metastasis

in the lymph nodes are different than those considered for patients who have not

experienced this condition.

The same type of results (i.e. the top 5 features that defines the most different

subsets) is shown in Table 4.3 and Table 4. 4 for Nail Disorder and Neuropathy.

45


Nail Disorder

Feature Ncycles Cm113 Lmsite3 Cm3 Prt19Explanation Number of

Cycles Re-ceived dur-ing Study

Any CM:Combs ofpenicillinsincl beta

LocAdv /MetaSite:Bone andLocomotor

LocAdv/Any CM:Acetic acidderivatives

PriorCancerTherapy:Paclitaxel


0.3190 0.3924 0.4029 0.4133 0.4290

Table 4.3: Top 5 most discriminant features for Nail Disorder

Neuropathy

Feature Cm163 Lmsite8 Ncycles NOORGAN Lmsite9Explanation Any

CM: H2-receptorantago-nists


Number ofCycles Re-ceived dur-ing Studyr

Number ofOrgans

LocAdv/MetaSite:LymphNodes


0.3767 0.3976 0.4081 0.4343 0.4395

Table 4.4: Top 5 most discriminant features for Neuropathy

4.4 Local Analysis of biomarkers

The previous steps indicated which are the splits that define subsets of data where

features importance changes the most. This section attempts to take the analysis

further and inspect which are the features that are important in one subset and less

important in the other. The purpose of this analysis is gaining a deeper understand-

ing of what biomarkers influence the occurrence of an adverse event conditioned on

the fact that a patient belongs to a specific group or has taken a certain treatment.

46


4.4.1 Description of the method

As in the previous step, in order to increase the level of validity of the results, the

analysis was taken considering 20 bootstrap samples of the initial data set. The

main idea was to split the data on a particular feature, run a feature selection

algorithm (JMI) on each of the two resulted subsets and record how many times a

particular feature appears in top k most important features over the 20 bootstrap

samples. The method is summarized in Figure 4.1 where the split considered was

Males/Females.

Figure 4.1: Schema for computing the feature scores within subsets

4.4.2 Computing the scores

Each attribute can appear at most 20 times in the top k most important features for

a subset. As a consequence, the maximum score is 20 and the minimum is 0. This

was normalized and the final measure was limited by 0 and 1. The value of k was

iterated only between 2 and 5, as the analysis is primarily focused on identifying

47


difference in the behavior of the biomarkers that are the most informative. For a

fixed k , the features that appeared either in the first subset or in the second were

recorder along with their scores. If a feature appeared only in one of the subsets,

then the score for the other subset was set to 0.

In order to consider whether a feature changes its behavior from one subset to

another a threshold had to be set between the two scores assigned for each of the

two subsets. If we denote by s1 the score a feature have in the first subset and by

s2 the score in the second subset, then a feature is considered to have a different

importance in the two subsets in the following 2 cases:

1. min (s1, s2) = 0 and max (s1, s2) ≥ 0.5

2. (s1 > 0.7 or s2 > 0.7) and max(s1,s2)min(s1,s2)

> 1.5

The first case means that if a feature appears 50% of the times in top k important

features for one subset, but never in top k important features for the other subset

then it is considered as a biomarker that is relevant only locally. The second situation

accounts for the case when none of the scores is 0. In this case, in order to consider

a biomarker only locally important, the ratio between the score associated in one

subset has to be at least 50% higher than the score associated with the biomarker

in the other subset. Moreover, in order to avoid situations where both of the scores

are very small, but the ration between them is higher than 1.5 (e.g. s1 = 0.1 and

s2 = 0.2) another condition was set such that at least one of the scores is higher

than 0.7 which means that feature has a high importance in that particular subset.

4.4.3 Results

This section presents which are the locally important biomarkers identified by the

method described above. For each of the 4 adverse events, the analysis was carried

by splitting the data on each of the 5 features described in Tables 4.1-4.4. However,

48


the results will be displayed only for the splits obtained using 2 of the 5 features,

that were considered representative for pointing out how the biomarkers change their

predictive power in subsets.

Appetite Disorder

Figure 4.2 shows which are the biomarkers that are important only locally, in one

of the subsets returned by splitting the data on Baseline Body Mass index and on

Number of Cycles. It can be noticed that Age occurs in top 5 predictive features

for Appetite in the subset of people whose BMI is greater than 25 more than 70%

of the time, whereas in the subgroup of people whose BMI is less or equal with 25,

Age occurs only 30% of the time. A different behavior within the 2 subsets can

be observed for the feature denoting whether a patient had a treatment involving

electrolyte solutions(cm 147) which is present more than 50% of the time in top 5

predictive features in the group of more heavy weight persons (BLBMI>25), but

never in for the group of more light weight people. The same type of behavior

is exhibited by the biomarkers BBFGF and cm229 (Osmotically acting laxatives)

which are present more than 50% of the time in the top 5 predictive features for

Appetite considering the group of people who received more than 3 doses, but never

occur as being important for those who received less than 3 doses. On the other

hand, Cancer histology is more predictive of Appetite disorder in the group of people

receiving less than 3 doses compared to those who received more.

49


Figure 4.2: Locally important biomarkers for Appetite when splitting the data onBody Mass Index (left) and on Number of Cycles (right).

Neutropenia

The results obtained for Neutropenia (Figure 4.3) revealed that a particular biomarker

(cm108-colony stimulating factors) can be very important for two groups of people

(those who had metastasis located in Lymph nodes and those who have not had

prior chemo therapy with Vinorelbine) while carrying a significantly smaller impor-

tance for the other two, complementary groups. Moreover, for the group of people

who had Vinorelbine as a prior chemo therapy medicine, whether they also had

H2-receptor antagonist (cm163) appear in top 4 predictive features for Neutropenia

aprox. 50% of the time, while for those who did not have Vinorelbine, it never

occurs as being important.

Figure 4.3: Locally important biomarkers for Neutropenia when splitting the dataon lmsite9 ( metastasis in Lymph Nodes) -left and on prt25 (prior chemo therapywith Vinorelbine )-right

50


Neuropathy and Nail Disorder

Analogous with the results shown for Appetite and Neutropenia are those for Neu-

ropathy (Figure 4.4) and Nail Disorder (Figure 4.5). It can be noticed that there

are biomarkers which tend to change their predictive power more (e.g Figure 4.4

-left aered8-reduction of doses appears 80% of the times in top 5 predictive features

for Neuropathy in the subset of people who had H2-receptor antagonists and never

for the subset of people who did not have) and other who shows a smaller difference

in the percent of times appeared as being important in the two subsets (e.g.Figure

4.5-right, the case of BLBSAM)

Figure 4.4: Locally important biomarkers for Neutropathy when splitting the dataon cm163 (H2-receptor antagonists)-left and on lmsite8 (metastasis in Hepatic Sys-tem including Gall Bladder)-right.

Figure 4.5: Locally important biomarkers for Nail Disorder when splitting the dataon cm133 (combinations of penicillin)-left and on lmsite3( metastasis in Bone orLocomotor System )-right.

51


4.4.4 Summary and conclusions

This chapter proposed a method of identifying which are the splits that create the

most different 2 subsets (in the sense of having different predictive features for

a particular target). Then a further analysis was done to identify what are the

biomarkers that change their importance the most when the data is restricted to

one of the subsets or the others. The main findings can be summarized as follows:

Cm229: Osmotically acting laxatives:

• For No. of Cycles> 3 cm 229 is in the top 4 predictive features for Appetite

60% of the time

• For No. of Cycles ≤ 3 cm 229 is never in the top 2 predictive features

Cm147: Electrolyte Solutions

• For BLBMI>25 cm147 is in the top 4 predictive features for Appetite 60% of

the time

• For BLBMI<=25 cm147 is never in the top 2 predictive features

Cm108: Colony Stimulating factors

• For metastasis in lymph nodes (lmsite9): cm108 is in the top 2 predictive

features for Neutropenia 100% of the time

• For no metastasis in lymph nodes (lmsite9): cm108 is in the top 2 predictive

features for Neutropenia 5% of the time

• For no prior chemo 25 (Vinorelbine): cm108 is in the top 4 predictive features

for Neutropenia 100% of the time

52


• For prior chemo 25 (Vinorelbine): cm108 is never in the top 4 predictive

features for Neutropenia

Cm163: H2-receptor antagonists

• For no prior chemo 25 (Vinorelbine): cm163 is never in the top 4 predictive

features for Neutropenia

• For prior chemo 25 (Vinorelbine): cm163 is in the top 4 predictive features for

Neutropenia 50% of the times

BLBMI: Baseline Body Mass Index

• For metastasis in Hepatic System (Gall Bladder included): BLBMI is in the

top 5 predictive features for Neuropathy 15% of the times

• For no metastasis in Hepatic System : BLBMI is in the top 5 predictive features

for Neuropathy 80% of the times

• For people who had cm163 (H2-receptor antagonists): BLBMI is in the top 5

predictive features for Neuropathy 75% of the times

• For people who have not had cm163 (H2-receptor antagonists): BLBMI is in

the top 5 predictive features for Neuropathy 20% of the times

The results confirmed the fact that sometimes features carry different importance

in different parts of the input space and that adopting a local analysis can reveal

relationships between the features and the target variable that would have been

hidden by the averaging effect in a global analysis.

53

Chapter 5

Predictive Model Building

This chapter aims at building and assessing the performance of different predictive

models for the four adverse events. The analysis was carried both globally and

locally on different subsets of the initial data, as computed in the previous chapter.

The outline of the chapter is as follows: firstly, an overview of different measures for

assessing performance along with the choices made in this study is presented. Then,

the method proposed for building local classifiers is explained. The third part of

the chapter will focus on the actual classifiers employed and the results obtained.

This part is build in two phases: while the first one is concerned with keeping a

relatively simple, construction of the classifier in order to maintain interpretability,

the second part will attempt to improve the performance by using more advanced

classifiers and techniques to deal with the particularities of the data set.

Since the problem under investigation involves predicting four different adverse

events, this chapter has a more exploratory character, by experimenting with dif-

ferent classifiers, different numbers of features or changing parameters, rather than

focusing only on a specific setting.

54


5.1 Measures of assessing performance

In order to have a meaningful assessment of a classifier’s performance, the way the

outcome of a model is assessed has to consider the context of the problem as well

as the particularities of the data set on which the classification problem is defined.

In the case considered in this study, the choice of measures have to reflect the fact

that the classification task is set in a medical framework and is concerned with

predicting whether a person will develop an adverse event or not. The reason why

this is important is that the real cost of misclassifying a data point greatly depends

on what the true class of that data point was, that is the cost of predicting that

someone will not develop an adverse event when in fact he will is greater than the

cost of predicting that a person will develop an adverse event when he will not.

Moreover, for all four events the class distribution is imbalanced with a greater

number of examples for the negative class than for the positive one.

As a consequence, the measures used in this study to assess the performance of

the different predictive models are sensitivity, specificity and negative predictive

value. A short overview of each of them along with a motivation for why they are

appropriate is given below. The following notations are used:

• TP=True Positives (the number of positive examples correctly classified);

• TN=True Negatives (the number of negative examples correctly classified);

• FP=False Positives (the number of negative examples incorrectly classified as

positive);

• FN=False Negatives (the number of positive examples incorrectly classified as

negative).

Sensitivity measures how good the classifier was at predicting the positive class,

that is it reports how many of the positive cases it has actually predicted as being

55


positive. In our context, the sensitivity will inform about the rate at which the

classifier identifies the occurrence of an adverse event.

Sensitivity =TP

TP + FN

Specificity measures the ability of the classifier to identify the negative class. In the

context of predicting adverse events, the specificity will inform us about how many

of the negative cases, the classifier correctly identified as being negative.

Specificity =TN

TN + FP

In the case of imbalanced class distribution, the rare class has a smaller impact

on accuracy than the prevalent class [23] and for this reason recording separately

the performance on each of the two classes (sensitivity and specificity) is a more

significant way of assessing the real performance of the classifier. For example, in

the case of Neuropathy and Nail Disorder where the positive class has only 10%

of the examples, a trivial classifier that will predict all the time the negative class

(0-no adverse event) will have an accuracy of 90%. However the sensitivity in this

case would be 0.

The negative predictive value indicates the proportion of the persons classified as

negative (will not experience an adverse event) who are correctly classified. In a

medical framework a high negative predictive value is desirable as it means that

when the model indicates that a person will not experience an adverse event, it is

highly probably that this is a correct result.

NegativePredictiveV alue =TN

TN + FN

For the visualization of the results, ROC curves will be mainly used, as they allow

a good representation of the classifier’s performance in terms of sensitivity (on x

56


axis) and false positive rate (1-specificity) on the y axis. Thus, the ideal point on

an ROC curve would be (0, 1), while a point situated on the line x=y represent the

case of randomly guessing the class.

5.2 Local vs. Global Analysis

The results in the previous chapter indicated that features have different importance

in different subsets of the data. As a consequence, this chapter will not only asses

the performance of classifier in predicting one of the four adverse events globally,

(on the whole data set) but will also propose a method of building local models and

compare their performance with the global one.

5.2.1 Local predictive models

In order to allow for a fair comparison between local and global models all the

choices made for the global one (number of features to use, classifiers, methods for

splitting the data into training and testing, etc) were also implemented in the local

models in the following manner:

• The initial data set was split in two subsets P1 and P2 on a particular feature

(as indicated by the consistency index in the previous chapter);

• Each of the subsets and the corresponding targets were treated as a different

problem by separating locally the training and the testing sets and building

two local models, one for each subset;

• The testing set on which all the measures (sensitivity, specificity, etc) were

computed was obtained by concatenating the two testing sets resulted from

splitting both P1 and P2 into training and testing data. For each of the testing

points, the predicted value was obtained using the corresponding classifier.

57


A schematic view on building the local models is shown in Figure 5.1, considering

that the split has been done on Males/Females. The analysis is made analogous for

any other chosen split of the initial data.

Figure 5.1: Schema for building a local predictive model

Though the local models will benefit from a more homogeneous data set, their

major drawback that can potentially influence the accuracy of the results is that the

available training data for training each of the classifier is only half of the available

data for the global model

5.3 Model building-Phase I

In the first phase, the choice of classifiers reflected the intention to keep the resulting

models simple, easy to interpret. Moreover, the number of features selected using

JMI was varied only between 2 and 11 for the same reason. In the second phase,

those restrictions will be removed and more powerful classifiers will be assessed.

58


5.3.1 Logistic regression

The logistic regression is a model used to predict the probability of a class, given the

current configuration of the input variables, by focusing on the relative probability

(odds) of obtaining one of the two categories. The general form for computing the

posterior probability of a class is:

p(C1|φ) = y(φ) = σ(wTφ) [26]

where φ is the feature vector and σ() is the logistic sigmoid function defined as:

σ(a) =1

1 + exp(−a)[26]

The inverse of the logistic function is given by a = ln ( σ1−σ ). As a consequence, the

natural logarithm of the odds is expressed as linear function of the features used

and the resulting model is linear and may be attached a possible interpretation.

ln(p(C1|φ)

1− p(C1|φ)) = wTφ [26]

The results obtained are shown in Figures 5.2-5.5 for each of the 4 adverse events.

Each point in the ROC curve corresponds to one of the models obtained by varying

the number of features between 2 and 11. The splitting in training/testing data

was done by allowing 2/3 of the available set for training and 1/3 for testing. The

procedure was repeated 20 times, each time shuffling the data in order to allow

for the classifier to learn and test on different data points and the resulted plotted

(for negative predictive value) are the 95% confidence intervals computed on the

20 outcomes obtained. This method was preferred to measuring the results using

cross-validation because of the imbalanced distribution in the classes (For example

in the case of Nail disorder, the positive class is 10% of the total examples and

employing a 10-fold cross validation technique may cause misleading results as it is

59


highly probable that in a testing fold none or very few positive cases will be present

which will affect the computation of sensitivity and specificity). This methodology

will be maintained throughout the experiments.

Figure 5.2: Appetite disorder prediction using Logistic regression. Left-NegativePredictive value in Local vs. Global Models; Right-ROC points for models builtvarying the number of features.

Figure 5.3: Neutropenia prediction using Logistic regression. Left-Negative Predic-tive value in Local vs. Global Models; Right-ROC points for models built varyingthe number of features.

60


Figure 5.4: Nail Disorder prediction using Logistic regression. Left-Negative Predic-tive value in Local vs. Global Models; Right-ROC points for models built varyingthe number of features.

Figure 5.5: Neuropathy prediction using Logistic regression. Left-Negative Predic-tive value in Local vs. Global Models; Right-ROC points for models built varyingthe number of features.

The results indicate in all 4 cases a poor ability of the classifier to correctly iden-

tify the positive class (low sensitivity) both for the global and for the local model.

However, it can still be noticed that in the case of Neutropenia the sensitivity is

higher (it reaches 60% for a specificity of 70%) as compared to the other adverse

events where it rarely raises above 20%. Moreover, for all adverse events the local

model had a higher negative predictive value than the global one, while maintaining

61


comparable results for sensitivity and specificity which indicates a higher degree

of certainty when predicting that a person will not experience a particular adverse

event. The results shown are only for splitting the data on the first feature indicated

by the consistency index in the previous chapter as the others were similar.

Under the same restrictions of attempting to keep the model interpretable, with

decision rules that are easy to explain, Naive Bayes and Decision Trees were also

applied as classifiers. However, the results obtained did not differ significantly from

those obtained using logistic regression. A summary is shown in Table 5.1 and 5.2,

where the results are averaged over the 20 shuffling rounds for 10 features.

Adverse Event Type of Analysis Sensitivity Specificity NegativePredictiveValue

AppetiteGlobal 0.2145 0.8493 0.6578

Local(BLBSAM) 0.2031 0.8527 0.7091

NeutropeniaGlobal 0.5793 0.8364 0.7424

Local(cm71) 0.5829 0.8130 0.7743

Nail disorderGlobal 0.1615 0.9672 0.8909

Local(ncycles) 0.1369 0.9675 0.9002

NeuropathyGlobal 0.1529 0.9887 0.8978

Local(cm163) 0.1652 0.9601 0.9013

Table 5.1: Global and Local performance obtained using Nave Bayes for Appetite,Neutropenia, Neuropathy and Nail disorder

62




Local(BLBSAM) 0.2679 0.7658 0.7017


Local(cm71) 0.5644 0.7954 0.7640


Local(ncycles) 0.1355 0.9403 0.8966


Local(cm163) 0.1706 0.9497 0.8922

Table 5.2: Global and Local performance obtained using Decision Trees for Appetite,Neutropenia, Neuropathy and Nail disorder

The results obtained for Decision Trees and Naive Bayes indicate the same poor

performance of classifiers in correctly identifying the positive class. As far as local

vs. global analysis is concerned, for sensitivity and specificity the values obtained are

comparable. However, it can be noticed that the negative predicted value is higher

for the local model in all cases. Among the possible explanations for the small

sensitivity obtained are the inappropriateness of the classifiers chosen, the number

of features considered and also the highly imbalanced class distribution especially

for Nail Disorder and Neuropathy.

It can be noticed that as the number of positive examples increase (Appetite (30%),

Neutropenia(35%)), so does the sensitivity. On the other hand for these adverse

events the results also indicate a smaller value for specificity and negative predicted

value. However, as problem is concerned with predicting adverse events, the focus

of a predictive model is identifying as many people as possible from those who will

develop an adverse event while (high sensitivity) while maintain a reasonable number

of false positives (which implies a high specificity).

63


5.4 Model Building -Phase II

In the second part of model building, in the attempt to improve performance and

address the possible problems identified in the first part, the constraints imposed to

maintain interpretability are removed. Therefore, the classifiers applied will make

use of more complex rules for creating the nonlinear boundaries that separate the

data. Moreover, since the data is highly imbalanced towards the positive class,

methods for balancing the training data will be considered.

5.4.1 Balancing class distribution

This section briefly overviews some of the most popular existing techniques for deal-

ing with the imbalanced class problem and explains the chosen method implemented

in the following experiments. The techniques for balancing class distribution can

be broadly classified into under-sampling, over-sampling and techniques that create

new data points.

• In random undersampling technique, the number of the majority class data is

decreased by removing randomly chosen points belonging to the majority class

from the data set [24]. Though this technique has empirically good results, its

major disadvantage is that it discards possible important information.

• The complementary random oversample technique works by randomly increas-

ing the minority class data points simply by replicating the existing ones.

Though this technique shares one of the advantages of random undersam-

pling, which is simplicity, it has been argued that in fact it does not add any

new information to the data set as it only copies existing information [24].

• A technique that attempts to oversample the minority class by adding new

information is SMOTE (Synthetic Minority Oversampling Technique)[27].The

main idea of SMOTE is to choose an existing data point form the minority

64


class, and find its closest n neighbors. From those, one neighbor is chosen

randomly and a new data point is created as a random point on the line

segment which joins the initial data point and its neighbor [27]. However, as

we deal with categorical data this technique is not suitable in this framework.

The oversampling method employed in the experiments also adds new information

by approximating the distribution of each feature from the available examples in the

positive class and generating a new data point of the same class which will follow

that distribution. The technique is based on the Inverse Transformation Method

for discrete random variables [25]. Its main drawback, that will be considered for

future work is that each of the features is considered independently of the others, an

assumption which may not hold in the real case. In order to assure the validity of the

results, the oversampling was computed only for the training data. The performance

of the model was assessed on the training data that came from the original data set.

5.4.2 Ensemble classifiers

AdaBoost-Overview

Adaboost (adaptive boosting) is the most widely used form of boosting algorithm

that combine multiple base classifiers to produce the final outcome of a classification

task [8]. This type of algorithm was chosen in this task as it is known to give good

results even when the performance of the base classifiers is only slightly better than

random [8] (which is the case of the classifiers applied in the first phase).

In Adaboost a series of weak classifiers are trained in sequence, each of them on a

weighted training set where the weights are updated considering the performance of

the previous classifier in the following manner: the points that were misclassified by

the previous classifier will be assigned higher weights when it will be used to train

the next classifier in the sequence [8]. In order to output the final prediction, the

65


outcomes of all the classifiers will be combined through a weighted majority voting

scheme [8].

Experimental design and results

In experiments, logistic regression was used as a base classifier. The number of fea-

tures was varied initially between 2 and 50 adding at each step 5 features. However,

as shown in Figure 5.6, the sensitivity slowly decreases as the number of features in-

crease above 10. Consequently, the number of features used was as in the first phase,

between 2 and 11. The same observation can be made for the other classifiers em-

ployed in this set of experiments: increasing the number of features, confuses more

the classifier in the ability to correctly identify the positive class. The separation

between training and testing was done as in the previous section, allowing 2/3 of the

data for training and 1/3 for testing, but the class distribution in the training data

was balanced using the oversample technique described in section 5.4.1, so that the

number of the positive and negative examples would be equal. The results for each

of the four events are shown in Figures 5.7, 5.8, 5.9 and 5.10.

Figure 5.6: Variation of Sensitivity as the number of features increase. Left : Ad-aboost (base classifier-Logistic Regression) for predicting Neutropenia. Right: Ran-dom forest for predicting Neutropenia.

66


Figure 5.7: Neutropenia prediction using Adaboost. Left-Negative Predictive valuein Local vs. Global Models; Right-ROC points for models built varying the numberof features.

Compared to the initial results in phase I, it can be noticed an increase in sensitivity,

associated with only a small drop in specificity. The negative predicted value has

also shown a small increase from aprox. 70% (for the local model) to around 80%.

In both phases the local model had a higher negative predicted value than the global

one. If in the first phase there is not a significant difference between local and global

models as far as sensitivity and specificity is concerned, here it can be noticed the

tendency that the local model has towards a higher sensitivity, while in the global

one a higher specificity can be noticed.

Figure 5.8: Appetite prediction using Adaboost. Left-Negative Predictive value inLocal vs. Global Models; Right-ROC points for models built varying the number offeatures.

67


For Appetite no significant improvement in the prediction performance can be no-

ticed. The small increase in sensitivity, associated with a decrease in specificity,

shows that the results almost follow the x=y line which signifies random guess. The

relatively high negative predicted value can be explained by the fact that as the

distribution in the testing data has not been changed (30% positive class), a basic

classifier saying all the time no, will have an aprox. 70% negative predictive value.

Figure 5.9: Nail Disorder prediction using Adaboost. Left-Negative Predictive valuein Local vs. Global Models; Right-ROC points for models built varying the numberof features.

In the case of Nail Disorder, a higher sensitivity (40-60%) came at the cost of a larger

drop in sensitivity (or conversely an increase in false positive rate). The difference

between global and local model can be observed only for sensitivity over 40% when

the global one performed better. However, for the negative predicted value, the local

one had better results.

68


Figure 5.10: Neuropathy prediction using Adaboost. Left-Negative Predictive valuein Local vs. Global Models; Right-ROC points for models built varying the numberof features.

The performance in predicting Neuropathy is similar to that obtained for Nail Dis-

order, that is an increase in sensitivity is only obtained for a lower specificity. The

results for all four adverse events indicate a small increase in sensitivity (as com-

pared to the performance obtained in phase I). However, these values are associated

with a smaller specificity. The best results were obtained as in the previous phase,

for Neutropenia. In terms of comparison between local and global analysis, as far

as sensitivity and specificity are concerned the performance is similar. On the other

hand, the negative predicted value was again better for the local analysis for all

adverse events.

Random forest

Random forest is an ensemble classifier that uses a number of decision trees built

on bootstrap samples of the data and whose final prediction is the majority vote of

the individual classifiers. The randomness comes from the fact that for each node

of a decision tree a number of splits are chosen randomly and among those the best

one is used. As the results are similar with those obtained using AdaBoost, only

a summary of them is shown in Table 5.3, computed using 10 decision trees and

69


averaging over 20 shuffles of the data.



Local(BLBSAM) 0.4278 0.6073 0.7131


Local(cm71) 0.6896 0.6971 0.7943


Local(ncycles) 0.3336 0.9510 0.9038


Local(cm163) 0.2135 0.8283 0.8890

Table 5.3: Global and Local performance obtained using Random Forest for Ap-petite, Neutropenia, Nail Disorder and Neuropathy

The most accurate performance was for Neutropenia using the local model (sensitiv-

ity 68%, specificity 73% and negative predicted value 71%). For the others adverse

events, the results follow the same pattern: a low sensitivity and a higher specificity,

which denotes again that even using an oversample method the classifiers identify

only a small percentage of the positive class.

5.4.3 Support Vector Machines

Support Vector Machines are a set of classifiers that construct a decision boundary

(a hyperplane) or a set of decision boundaries separating the data in different classes.

A SVM finds the best hyperplane that is the one with the maximum margin between

the points in the classes. It can also be employed for non-linear data sets by using

the so-called Kernel trick that maps the features to a higher dimensional space using

different kernel functions. Thus, a data set that was not linearly separable in the

original space may become linearly separable in a higher space.

70


Experimental design and results

In the experiments the Radial Basis Function (RBF) was used as a Kernel Function.

It is defined as:

φ(x, y) = exp(γ|x− y|2)

The gamma parameter (γ) which is the kernel width was varied between [10−5, 10−7 . . . 109]

in order to understand how it affects the classifiers performance. The splitting into

training and testing data was done by considering 2/3 for training and 1/3 for test-

ing. In order to decrease the variance and allow the classifier to learn and to be

tested on different parts of the input space, the data was shuffled 20 times and the

results averaged. Each time, the number of the positive cases was increased in the

training data using the technique described in Balancing the class distribution sec-

tion so that there is an equal number of positive and negative examples. The results

are shown in Figures 5.11 and 5.12. For this experiment, the number of features was

fixed at 10 and each point in the ROC space represent the classifier obtained for a

different gamma parameter.

Figure 5.11: Neuropathy (Left) and Neutropenia (Right) prediction using SVM

71


Figure 5.12: Nail Disorder (Left) and Appetite (Right) prediction using SVM

For different values of the gamma parameter, different points in the ROC space

were obtained. It can be noticed that generally, as we obtain a higher sensitivity,

the specificity of the model drops. Better results compared to previous ones can be

noticed for Neuropathy, while for the others there is not a significant difference.

For Neuropathy, the local model was better than the global one in the majority of

cases. Compared with the initial results from the first phase, the models obtained

using SVM and oversampling show a significant increase in sensitivity. However,

this increase came at the cost of a decrease in specificity.

For Nail Disorder, the local model has a higher sensitivity (it reaches 50%), but

this increase came at the cost of a considerable reduction in specificity (60%). On

the other hand, the results obtained with the global model were less sensitive to

varying gamma. The best results were obtained as with all the other classifiers

used in experiments for the adverse event that was naturally more balanced which

is Neutropenia.

72


5.5 Chapter summary and Conclusions

5.5.1 Summary

This chapter investigated the performance of different classifiers in predicting the 4

adverse events. The experiments were conducted in two phases: in the first one, in

order to keep the resulting model easy to interpret, less complex classifiers were used:

Logistic Regression, Decision Trees and Nave Bayes, while in the second one the

experiments were conducted using AdaBoost, RandomForests and Support Vector

Machines along with an oversampling technique. Based on the results obtained in

the previous chapter where the most different splits were identified, a method for

building local models and comparing the performance with global ones was proposed.

5.5.2 Conclusions

The results indicate that the only predictable adverse event is Neutropenia. For

all the other events (Neuropathy, Nail Disorder, Appetite), the classification results

can be summarized as having low sensitivity and a high specificity in the first phase

of model building (see Figures 5.2, 5.4, 5.5, Table 5.1 and Table 5.2). Using more

powerful classifiers along with oversampling in the second phase resulted in an in-

crease in sensitivity, but followed by a decrease in specificity (see Figures 5.8, 5.9,

5.10 and Table 5.3).

Neutropenia

In the first phase of model building the results show that depending on the number

of features used, the sensitivity ranges between 40% and 65%, while the specificity

varies between 60% and 80% (the higher the sensitivity, the lower the specificity -see

Figure 5.3).

73


In the second phase of model building, the predictability of Neutropenia improved

in the following manner: the sensitivity increased- varies between 60% and 80%

while the specificity was maintained at the same level (60%-80%)- see Figure 5.7.

The negative predictive value also improved from aprox. 70% (for local model-see

Figure 5.3) to 80% (Figure 5.7).

The local analysis for Neutropenia was carried on the two subsets resulted by split-

ting the initial dataset on feature cm71 (Benzodiazepine derivatives).This feature

was indicated as being the first choice for splitting (see chapter Analysis of feature

importance within subsets, Table 4.2) . While in the first phase of model building,

the performance of local models was similar to that of the global ones as far as sensi-

tivity and specificity are concerned (Figure 5.3), in the second phase using Adaboost

and Random Forest, the local model had a higher sensitivity, but a lower specificity

than the global one (Figure 5.7). In both phases, the negative predictive value was

higher for the local analysis. Using SVM in the second phase revealed that the local

model had a better performance than the global one in the majority of situations

considered by changing the gamma parameter (Figure 5.11).

The performance of a classifier in the present setting is mainly concerned with cor-

rectly identifying the people who will experience an adverse event while maintaining

a reasonable rate of false positives. This is because the real cost of allowing a person

who will experience an adverse event to take part in a clinical trial (that is mak-

ing a false negative error) is higher than the cost of preventing someone who will

not experience an adverse event form taking part (a false positive error). For this

reason, the performance of predicting Neutropenia can be considered better in the

local analysis than in the global one.

74

Chapter 6

Conclusions

This chapter presents an overview of the research carried in this project along with

the main conclusions drawn. In addition to this, it proposes possible further inves-

tigation steps that have not been covered. The main objective of the project was to

investigate the hypothesis that feature importance differs in different parts of the

input space and based on the results to propose a method for building local clas-

sification models and compare the performance with the global ones. The research

was carried on the data provided by the pharmaceutical company AstraZeneca and

focused on the investigation of predicting four adverse events: Appetite Disorder,

Neutropenia, Nail Disorder and Neuropathy. The project was carried out in three

main phases as described below.

6.1 Summary of the research and conclusions

6.1.1 Data Preprocessing and Initial Experiments

This part gives a thorough description and motivation of the choices made in pre-

processing the data as those influences the final results obtained. Moreover, initial

75


experiments are carried out in three phases in order to gain insights in the relation-

ship between the data and the adverse events to be analyzed.

1. The first experiment was to analyze the normalized mutual information be-

tween each feature and the four adverse events.

2. The second one consisted in choosing five features (Sex, Race, hstltyp (His-

tology type)), Smoking habits and Tstage (Cancer Stage) to split the data on

and analyze the mutual information locally, in each of the resulted subsets.

3. The third experiment also focused on local analysis of individual feature im-

portance, but the subsets of the data were obtained by running a k means

clustering algorithm.

These initial experiments revealed a small statistical dependence between the mea-

surements and each of the four adverse events. However, a small increase could be

noticed when the mutual information was computed locally, in different subsets of

the initial data.

6.1.2 Analysis of feature importance within subsets

The second part of the project attempted to understand how the importance of each

feature in predicting a specific adverse event changes within different subsets and

identify the features that are only locally important. This part of the research is

structured into two subsections that aim at answering two questions:

1. Which are the most discriminant features? A discriminant feature is consid-

ered one that splits the original data into 2 subsests that have the following

characteristic: the features that are predictive for a specific adverse event in

one subset are different than the features that are predictive for the same

adverse event in the other subset.

76


2. Which are the biomarkers that are only locally important? A locally important

biomarker is considered one that is predictive of an adverse event only in a

sub area of the input space.

Firstly, a method for discovering which are the features that produce the most

dissimilar subsets in terms of top 10 predictive features is proposed. The subsets

are obtained by splitting the initial data on the categories each feature defines.

The method is based on Consistency Index introduced in [4] which measures the

similarity between two feature sets. Then, a score is associated with each feature

based on the frequency of appearance in the top most predictive features within

different subsets. A threshold on this score is proposed for considering a feature

only locally important.

The results indicate that the most discriminant feature is:

• For Appetite Disorder: BLBSAM (Baseline Body surface Area)

• For Neutropenia: cm71 (Benzodiazepine derivatives)

• For Nail Disorder: ncycles (Number of Cycles during Study)

• For Neuropathy: cm163 (H2-receptor antagonists)

The second investigation revealed that there are features that change their impor-

tance within different subsets and show only a local predictive power. A summary of

the most significant results is given below, mentioning the biomarker, its associate

score of occurrence in top predictive features, the adverse event and the group of

people for which it is important. In the complementary group, the same biomarker

was either never present among top predictive features or showed a considerably

lower score of occurrence (50% smaller).

77


Cm229: Osmotically acting laxatives

• 60% of the time present in top 4 predictive features for Appetite in the group

of people who had more than 3 cycles of doses

Cm147: Electrolyte Solutions

• 60% of the time present in top 5 predictive features for Appetite for people

with BLBMI > 25

Cm108: Colony Stimulating factors

• 100% of the time present in top 2 predictive features for Neutropenia for people

who have metastasis in lymph nodes (lmsite9)

• 100% of the time present in top 4 predictive features for Neutropenia in the

group of people who did not have prior chemo therapy with Vinorelbine (prt25)

BLBMI: Baseline Body Mass Index

• 80% of the times present in top 5 predictive features for Neuropathy for people

with no metastasis in Hepatic System (lmsite8)

• 80% of the times present in top 5 predictive features for Neuropathy for people

who had cm163 (H2-receptor antagonists)

6.1.3 Predictive model building

The last part of the project analyzes the performance of different classifiers in pre-

dicting the four adverse events. The analysis was carried in two phases:

78


1. Firstly, in order to maintain a simple, possibly interpretable model, less com-

plex classifiers were used: Logistic Regression, Naive Bayes and Decision Trees.

2. In the second phase, more advanced classification methods are employed (Ran-

dom Forest, Adaboost, Support Vector Machines) together with an oversam-

pling technique to balance the distribution of the positive and negative exam-

ples in the training data.

In addition to this, based on the results obtained in the second part of the project

(Analysis of feature importance within subsets), a method for building local predic-

tive models and comparing their performance with the global ones was proposed.

The local analysis was carried for both phases considered in this chapter.

The results obtained in this part showed that the only adverse event that can be

predicted from the measurements provided is Neutropenia. The data set on which

the prediction task was carried is made up of measurements from patients that

received the placebo in the clinical trial. The poor prediction performance obtained

for Appetite Disorder, Nail Disorder and Neuropathy indicates that there is not a

strong relationship between the occurrence of these adverse events and the clinical

measures taken. Repeating the investigation on the people that actually received

the new drug and comparing the prediction performance with the current results

can reveal whether the drug has a greater influence on the occurrence of any of the

adverse events.

For Appetite Disorder, Neuropathy, Nail Disorder, though in the analysis were used

both simple and complex classifiers with a different number of features and differ-

ent settings of the parameters along with an oversampling technique to artificially

increase the number of positive cases, the results obtained indicate that there is not

a significant dependence between the biomarkers measured and the occurrence of

these events. The performance of the classifiers in predicting Appetite Disorder,

Neuropathy and Nail Disorder can be summarized as showing a low sensitivity and

79


a high specificity in the first phase of the analysis (see Figures 5.2, 5.4 and 5.5),

while in the second phase an increase in the sensitivity can be obtained, but only

associated with a significant decrease in specificity (see Figures 5.8-5.10).

In predicting the occurrence of Neutropenia, the performance obtained was a sen-

sitivity between 40%-65% associated with a specificity between 60%-80% (a higher

sensitivity corresponds to a lower specificity) for the simple classifiers such as Lo-

gistic Regression, Naive Bayes and Decision Trees. The variation of the results is

attributed to the different numbers of features used: between 2 and 11 ( Figure 5.3

and Table 5.1). Using more complex classifiers (Adaboost , Random Forest) and

oversampling the positive class such that there is an equal number of positive and

negative examples in the training data, the results can be improved: the sensitivity

increased , while the specificity was maintained at the same level (see Figure 5.7).

The negative predictive value was also higher in the second phase of model building

that is when the complex classifiers were used.

The local models for predicting Neutropenia were built on the subsets obtained by

splitting the initial data on cm71 (Benzodiazepine derivatives) as computed in the

chapter Analysis of Feature importance within subsets. As for comparing the perfor-

mance of local vs. global models, the local model had a higher negative predictive

value in all the experiments. In terms of specificity and sensitivity, the results for

the first phase of model building were similar for both approaches (the differences

are mainly associated with the different numbers of features used). However, in the

second phase it can be noticed a tendency for the local model towards a higher sen-

sitivity, and for the global model toward a higher specificity. Moreover, using SVM

with different parameter settings the performance of the local model was better than

the performance of the global one in the majority of cases (Figure 5.11-Right).

Since the predictive task is set in a medical framework, there is a greater importance

of obtaining a higher sensitivity (without a major loss of specificity) as this means a

larger number of people correctly identified as suffering an adverse event. All things

80


considered, since the local model was either similar or better than the global one

in terms of specificity and sensitivity and always better for the negative predictive

value, the local analysis has an advantage over the global one.

6.1.4 How can the proposed techniques be transferred to

new data sets?

The thesis presented a research into locally important biomarkers on a specific data

set. However, the methods proposed can be easily transferred to new data. This

section shows in a concise manner the main steps of the employed procedures.

Step1: Identifying the most discriminant feature

This procedures identified the most discriminant feature in a data set, which in this

context is the one that splits the original data into 2 subsests such that the features

that are predictive for a target variable in one subset are different than the features

that are predictive for the same variable in the other subset.

Begin: Define a valid split of the data in the context of the given problem

(minimum number of datapoints in a group, merging subcategories, etc)

For each feature that creates a valid split

Split the data into 2 subsets : A and B

FS1 = top k predictive features after running a feature selection algorithm on A

FS2 = top k predictive features after running a feature selection algorithm on B

Compute Kuncheva Consistency Index1for FS1 and FS2

End For

Display the feature with the lowest Kuncheva Consistency Index (most discriminant).

End Procedure

Depending on the number of instance available and on the constraints imposed for

1For data sets where individual features have a significant mutual information with the targetvariable Importance Profile Angle proposed in [1] can also be used.

81


a valid split, the procedure can be repeated in a recursive manner, such that S1

and S2 can be treated again like individual problems and perform a further split.

However, it should be noted that the smaller the problem, the less accurate the

results obtained and the higher the risk of overfitting.

Step2: Identifying locally important biomarkers

The second step is to identify what are the biomarkers that have only local impor-

tance. The discriminant features obtained in Step 1 are employed for decomposing

the initial problem in subspaces for performing a local analysis of the biomarker’s

importance.

Begin: Compute n bootstrap samples of the initial data

For each bootstrap sample

Split the data on a discriminant feature (see Step1) into subgroups A and B

Select top k important features for A using a feature selection algorithm

and update their frequency of occurrence, frq1

Select top k important features for B using a feature selection algorithm

and update their frequency of occurrence, frq2

End For

F= The union of features that appeared as being important at least once in

one of the two subsets: (frq1 > 0 or frq2 > 0)

For each feature f in F

Compute the scores of occurrence by normalizing the frequencies:

s1 = frq1/n

s2 = frq2/n

If (min (s1, s2) = 0 and max (s1, s2) ≥ 0.5 )or

((s1 > 0.7 or s2 > 0.7) and max(s1,s2)min(s1,s2)

> 1.5))2

then f is a locally important feature.

End If

End For

End Procedure

2The threshold may be adjusted to meet the particularities of a specific data set.

82


Step3: Building local predictive models

The last step is the actual building of local predictive models. General guidances

on how this can be done in order to allow comparison with the global ones is shown

below:

• Split the data into two subsets3 A and B on a particular discriminant feature;

• Consider each of the subsets and the corresponding target as a different prob-

lem by separating locally the training and the testing sets and building two

local models, one for each subset;

• Concatenate the two testing sets resulted from splitting both A and B into

training and testing data and predict the class for each point using the model

built on the subset it belongs.

6.1.5 Future work

There are several possible paths that can be considered for further investigation of

the data set and of the assumption that local feature selection can provide better

results than a global one.

Analysis of feature importance within subsets

• Integrating prior knowledge in identifying the degree of heterogeneity of the

dataset and choosing a more meaningful decomposition into subsets. Having

a medical opinion regarding how to group the patients that are more likely to

have similar causes for an adverse event could potentially improve the classi-

fication results, as it decreases the chances that a classifier would be confused

by features that are only meaningful for a certain category of people.

3An extension to more than 2 subsets can also be employed in an analogous manner.

83


• Repeating the analysis of the local biomarkers using different feature selection

criteria than JMI and investigate the stability of the obtained results.

• A more fine grained investigation of the most suitable threshold on feature

scores in order to consider a biomarker only locally important.

6.1.6 Model Building

• Since the current oversampling technique assumes that the features are inde-

pendent, steps could be considered for loosening this assumption and taking

into account the possible dependencies between the features.

• A further analysis of different other kernels, as in applying SVM only RBF

kernel was investigated.

• Experimenting with different classifiers for the local and global models. In the

present study, the same classifier and the same feature selection method was

used both for both global and local models.

84

References

[ 1 ] Apte Chidanand, Hong J, Hosking J, Lepre J, Pednault E, Rosen B.1997.

Decomposition of heterogeneous classification problems. Proceedings of the

Second International Symposium on Advances in Intelligent Data Analysis,

Springer-Verlang 17-28

[ 2 ] Battiti R.1994, Using mutual information for selecting features in supervised

neural net learning. IEEE Transactions on Neural Networks. 5(4):537-550

[ 3 ] Bontempi G, Meyer P. 2006. On the Use of Variable Complementarity for

Feature Selection in Cancer Classification. Applications of Evolutionary Com-

puting 91-102.

[ 4 ] Brown G, Pocock A, Zhao M, Lujan M 2010. Feature Selection via Conditional

Likelihood. Journal of Machine Learning Research 1-48.

[ 5 ] Brown G, 2010. Lecture Pack1, Machine Learning and Data Mining. The

University of Manchester.

[ 6 ] Domingos P. Context-Sensitive Feature Selection for Lazy Learners. 1997.

Journal of Artificial Intelligence Review, 11(15), 227253.

[ 7 ] Edwards R, Aronson JK 2000. Adverse drug reactions: definitions, diagnosis

and management. The Lancet, vol 356.

[ 8 ] Guyon I, Elisseeff A 2003. An Introduction to Variable and Feature Selection.

Journal of Machine Learning Research 3(2003) 1157-1182.


[ 9 ] Guyon I, Gunn S, Nikravesh M, Zadeh L.A (Eds.)2006. Feature Extraction.

Foundations and Applications. Springer-Verlag Berlin Heidelberg.

[ 10 ] Kira K, Rendell L 1992. A practical Approach to Feature Selection In Pro-

ceedings of the ninth international workshop on Machine learning (ML92).

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 249-256.

[ 11 ] Liu H, Motoda H, Setiono R, Zhao Z 2010. Feature Selection: An Ever

Evolving Frontier in Data Mining. Journal of Machine Learning Research:

Workshop and Conference Proceedings 10: 4-13

[ 12 ] Pechenizkiy M, Tsymbal A, Puuronen S. 2006. Local Dimensionality Reduc-

tion and Supervised Learning Within Natural Clusters for Biomedical Data

Analysis. IEEE Transactions on Information Technology in Biomedicine.

10(3):533-9.

[ 13 ] Peng H, Long F, Ding C.(2005) Feature Selection based on MI: Criteria of

max-dependency, max-relevance, and min-redundancy. IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 27, no 8 1226-1238.

[ 14 ] Pineda-Bautista B, Carrasco-Ochoa J. A, Martinez-Trinidad J. 2010.General

framework for class-specific feature selection. Expert Systems with Applica-

tions. Volume 38, Issue 8.

[ 15 ] Puuronen S, Tsymbal A.2001 Local Feature Selection with Dynamic Integra-

tion of Classifiers. Journal Fundamenta Informaticae Intelligent Systems, vol

47, issue 1-2, 91-117.

[ 16 ] Puuronen S, Tsymbal A, Skrypnyk I. 2000. Advanced Local Feature Selec-

tion in Medical Diagnostics In Proceedings of the 13th IEEE Symposium on

Computer-Based Medical Systems (CBMS’00) (CBMS ’00). IEEE Computer

Society, Washington, DC, USA, 25-.

86


[ 17 ] Wang L, Zhou N, Chu F. 2008, A general wrapper Approach to Selection of

Class-Dependent Features. IEEE Transactions on Neural Networks vol 19, no

7.

[ 18 ] VideoLectures.Net (2010) Guyon I. Presentation on: Introduction to feature

selection. [Online] Available at :http://videolectures.net/bootcamp07_

guyon_ifs/ [Accessed : May 2011]

[ 19 ] AstraZeneca (2011) Clinical Trials [Online] Available at:http://www.astrazeneca.

co.uk/rnd/clinical-trials/ [Accessed : May 2011]

[ 20 ] Francois D., Wertz V., Verleysen M., 2006. The permutation test for feature

selection by mutual information. ESANN’2006 proceedings - European Sym-

posium on Artificial Neural Networks. Bruges (Belgium), 26-28 April 2006,

d-side publi., ISBN 2-930307-06-4.

[ 21 ] Hua H., Moody J., 1999. Feature Selection Based on Joint Mutual Informa-

tion. Advances in Intelligent Data Analysis. Rochester New York

[ 22 ] Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of

the 25th International Multi-Conference on Artificial Intelligence and Appli-

cations. February 2007) 390395

[ 23 ] M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms

to classify rare classes: Comparison and improvements. In Proceeding of the

First IEEE International Conference on Data Mining( ICDM01 ), 2001

[ 24 ] Liu A, Ghosh J, Martin C (2007), Generative Oversampling for Mining Im-

balanced Datasets, In Proceedings of the 2007 International Conference on

Data Mining, Las Vegas, Nevada, USA 2007, CSREA Press.

[ 25 ] Luc Devroye (1986). Non-Uniform Random Variate Generation. New York:

Springer-Verlag.

87


[ 26 ] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning

(Information Science and Statistics). Springer-Verlag New York, Inc., Secau-

cus, NJ, USA.

[ 27 ] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip

Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J.

Artif. Int. Res. 16, 1 (June 2002), 321-357.

[ 28 ] Wikipedia-Lymph node [Online] Available at http://en.wikipedia.org/

wiki/Lymph_node [Accessed : 15 August 2011]

[ 29 ] Wikipedia- Neutropenia [Online] Available at http://en.wikipedia.org/

wiki/Neutropenia [Accessed : 15 August 2011]

88

Documents

Feature Selection for Adverse Event Prediction