40
CLASSIFICATION ALGORITHMS IN MALIGNANT ASTROCYTOMAS DIAGNOSIS USING INFORMATION ON GENETIC BIOMARKERS Estudiante: Luis Pérez del Villar MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL ESCUELA NACIONAL DE SALUD- INSTITUTO DE SALUD CARLOS III 2013-2014 Computational Intelligence Group, Department of Artificial Intelligence, Technical University of Madrid, Madrid, Spain DIRECTOR DE LA TESIS: Dr. Pedro Larrañaga CODIRECTOR DE LA TESIS Dr. Alfonso Valencia FECHA: September 2014

CLASSIFICATION ALGORITHMS IN MALIGNANT ASTROCYTOMAS ...€¦ · MALIGNANT ASTROCYTOMAS DIAGNOSIS USING INFORMATION ... with uncontrolled cellular proliferation, ... glioblastomas

  • Upload
    buique

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

CLASSIFICATION ALGORITHMS IN

MALIGNANT ASTROCYTOMAS DIAGNOSIS USING INFORMATION

ON GENETIC BIOMARKERS

Estudiante: Luis Pérez del Villar

MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL ESCUELA NACIONAL DE SALUD- INSTITUTO DE SALUD CARLOS III

2013-2014

Computational Intelligence Group, Department of Artificial Intelligence,

Technical University of Madrid, Madrid, Spain

DIRECTOR DE LA TESIS: Dr. Pedro Larrañaga

CODIRECTOR DE LA TESIS Dr. Alfonso Valencia

FECHA: September 2014

¡Gracias! Me gustaría agradecer a Concha Bielza e Irene Rodríguez Hernández sus ideas y

aportaciones a este trabajo; así como a mis directores, Pedro Larrañaga y

Alfonso Valencia, por las facilidades que me han prestado para el desarrollo de

las prácticas.

Índice

1   OBJECTIVES  .................................................................................................................  4  

2   INTRODUCTION  ..........................................................................................................  5  

3   MATERIAL AND METHODS  ......................................................................................  9  

4   RESULTS  .....................................................................................................................  19  

5   DISCUSSION  ...............................................................................................................  29  

6   CONCLUSIONS  ..........................................................................................................  35  

7   REFERENCES  .............................................................................................................  36  

4 OBJECTIVES

1 OBJECTIVES The main objective of the present work was to use of a machine learning approach

based on supervised classification and feature subset selection data mining to

predict 105 grade II to IV astrocytoma diagnosis according to 39 common

alterations in TP53, EGFR-PI3K and Rb pathways, IDH1 mutations, alterations in

DNA repair genes and epigenetic factors. In addition, we also performed

unsupervised hierarchical cluster analysis to gain further support for high-grade

astrocytoma patients’ biology and outcome. In this context, we define the following

specific objectives:

1) To use five well known supervised classification algorithms (logistic regression,

support vector machine, Decision Tree, Naïve Bayes and k-NN) and two further

ensemble methods (bagging and boosting) in order to determine astrocytoma

diagnosis on the basis of its molecular characteristics.

2) To perform a feature selection technique in order to improve the accuracy of

supervised classification algorithms and to seed light about those features that are

relevant for the diagnosis of astrocytomas.

3) To perform unsupervised classification algorithms to differentiate informative

groups about astrocymas survival and to explored those features that are relevant in

the prognosis.

5 INTRODUCTION

2 INTRODUCTION

Diffuse infiltrating astrocytomas are the most common and aggressive

primary brain tumors in adults. They account for 70% of all primary brain tumors

with an incidence rate adjusted to the European Standard Population of 5.27 per 100

000 persons per year [1,2]. Malignant astrocytomas are histologically

heterogeneous and invasive tumours classified on the basis of histopathological

characteristics [3]. Analysis of tumor differentiation, cellularity, cytonuclear atypia,

mitotic activity, microvascular proliferation, and necrosis grades astrocytomas as

low-grade astrocytomas (WHO grade II), anaplastic astrocytomas (WHO grade III)

and glioblastomas (WHO grade IV) with increasing aggressiveness [3].

All diffuse astrocytomas are characterized by a high propensity to infiltrate

the surrounding brain structures that renders them incurable by surgery. Low-grade

astrocytomas occur in young adults between the ages of 30 and 45 years. Low-grade

astrocytomas are well-differentiated and slow-growing tumors with long clinical

courses. They tend to relapse and progress over time to astrocytomas of higher

grades of malignancy with a median survival of 5 to 10 years. Anaplastic

astrocytomas are considered a progression of low-grade astrocytomas. They are

more rapidly fatal and exhibit increased cellularity, nuclear atypia and high mitotic

activity. Anaplastic astrocytomas occur in adults around the age of 45 years and

patients have a median survival of around 2 years [2,4]. Glioblastomas are the most

frequent and aggresive astrocytomas, with a median survival of 12-14 months after

diagnosis despite optimal treatment [5]. Glioblastomas are highly heterogeneous

and infiltrating tumours, with uncontrolled cellular proliferation, high mitotic

activity, regions of necrosis and microvascular proliferation. On the basis on their

clinical presentation, glioblastomas can be separated in primary and secondary

glioblastomas. Primary glioblastomas present in an acute de novo manner with no

evidence of antecedent lower grade pathology. They account for more than 95% of

glioblastomas and occur in patients older than 50 years. In contrast, secondary

6 INTRODUCTION

glioblastomas derives from the progressive transformation of lower grade

astrocytomas, are quite rare and tend to occur in patients below the age of 45 years.

However, primary and secondary glioblastomas are morphologically and clinically

indistinguishable [2,6].

Malignant transformation in astrocytomas results from the sequential

accumulation of genetic aberrations and the deregulation of growth-factor signaling

pathways that are normally associated with the grade of malignance of the tumour.

The classical genetic alterations in astrocytoma target pathways governing cellular

proliferation, cellular survival (apoptosis and necrosis), invasion, angiogenesis and

metabolism [1,2]. Thus, low-grade astrocytomas are characterized by mutations in

TP53 and IDH1 genes, and overexpression of the (platelet-derived growth factor

receptor (PDGFR). In the progress to anaplastic astrocytomas, tumours acquire

mutations in retinoblastoma gene and loss of heterozygosity of chromosomes 9 and

19. In addition, anaplastic astrocytomas could progress to secondary glioblastomas

that carry the same mutations of lower grades of malignance plus loss of

heterozygosity of chromosome 10 where PTEN gene is located. On the other hand,

primary glioblastomas are characterized by loss of heterozygosity of chromosome

10 and consequently of PTEN gene, EGFR amplification, loss of p16, whereas

mutations in TP53 and IDH1 are less frequent. A summary of common alterations

of these tumors are include in figure 1. In general pathologist utilize a variety of

microscopic, genetic, and immunologic techniques to make site-specific diagnosis.

However current techniques are limited in their ability to distinguish different tumor

types. Many specimens are incorrectly classified due to their morphological

similarity to other tumor types. Therefore, the aim of the present work was to use of

a machine learning approach based on supervised classification and feature subset

selection data mining to predict 105 grade II to IV astrocytoma diagnosis according

to 39 common alterations in TP53, EGFR-PI3K and Rb pathways, IDH1 mutations,

alterations in DNA repair genes and epigenetic factors.

7 INTRODUCTION

Figure 1. Gene and genomic regions involved in the biology of diffuse astrocytomas. IDH1=isocitrate dehydrogenase 1. IDH2=isocitrate dehydrogenase 2. PTEN=phosphatase and tensin homologue. PI3K=phosphoinositide-3-kinase. EGFR=epidermal growth factor receptor. MDM2=mdm2 p53 binding protein homologue (mouse). PDGFR=platelet-derived growth factor receptor. Rb=Retinoblastoma. MGMT= methylguanine DNA methyltransferase. LOH=Loss of heterozygosity. CDKN2A=cyclin-dependent kinase inhibitor 2A encoding for p16.

We tested several different supervised classifier algorithms (see materials

and method). From a wide point of view, supervised machine learning is typically

associated with prediction where for each observation of the predictor

measurements (also known as feature variables) there is an associated response

measurement (also known as the class label). In this context, we aim to predict the

final diagnosis of astrocytomas based on their molecular alterations.

In addition, to gain further support for high-grade (grade III and IV) astrocytoma

patients’ biology and outcome, we also performed unsupervised hierarchical cluster

analysis. Unsupervised machine learning is a more open-ended style of statistical

learning. Instead of using labelled data sets, unsupervised learning is a set of

statistical tools intended for applications where there is only a set of feature

8 INTRODUCTION

variables measured across a number of observations. In this case, prediction is not

the goal because the data set is unlabeled, i.e. there is no associated response

variable that can supervise the analysis. Rather, the goal is to discover interesting

things about the measurements on the feature variables. For example, you might

find an informative way to visualize the data, or discover subgroups among the

variables or the observations. Methods to classify tumors according to key

molecular events that regulate growth of their most aggressive cellular component

and to predict changes that accompany disease recurrence might greatly facilitate

the development of targeted therapies.

3 MATERIAL AND METHODS Patients and Samples

One hundred and five astrocytoma samples were collected from the

Neurosurgery Department, University Hospital of Salamanca (Spain). All tumors

were histologically classified at the time of diagnosis and reviewed by an

independent blinded pathologist to the original diagnosis according to the 2007

WHO classification [3]. Tumors were classified as 24 low-grade astrocytomas

(grade II), 16 anaplastic astrocytomas (grade III) and 65 primary glioblastomas

(grade IV). Thus, the class to be predicted in the present study is the final diagnosis

of astrocytomas split on the three described categories. All patients gave informed

consent to participate and the study was approved by the local ethics committee.

Blood and tumor tissue samples of all patients were obtained at diagnosis before

initiation of treatment. Matched DNA from peripheral blood and frozen tumor

specimens were extracted by standard phenol/chloroform procedure. Tumor tissues

were also fixed in formalin and embedded in paraffin for immunohistochemical

analysis.

Mutational analysis

Exons and flanking intronic sequences of TP53 (all exons), PTEN (all

exons), PI3KCA (exons 7, 9 and 20), IDH1 (exon 4), IDH2 (exon 4) and BRAF

(exons 11 and 15) were screened for mutations in tumor DNA by PCR (Polymerase

Chain Reaction) amplification and direct sequencing analysis using an ABI Prism

3100 Genetic Analyzer (Applied Biosystems). Primers sequences are available upon

request.

Polymorphism analysis

Genotyping of TP53 (rs1042522) and MDM2 (rs10425229) polymorphisms

was performed by the PCR-RFLP (Polymerase Chain Reaction - Restriction

Fragment Length Polymorphism) technique using DNA extracted from peripheral

blood of patients as described [7,8]. Genotyping of BCL2 (rs2279115), BAX

10 MATERIAL AND METHODS

(rs4645878), KRAS-LCS6 (rs61764370) ERCC1 rs11615, ERCC2 rs13181, ERCC6

rs4253079, APEX1 rs1130409, XRCC1 rs25487, XRCC3 rs861539 and MLH1

rs1800734 polymorphisms was performed using TaqMan 5’-exonuclease allelic

discrimination assays (Applied Biosystems) [9]. To assess reproducibility, a

randomly selected 5% of the samples were re-genotyped and all of these genotypes

matched with genotypes initially designated.

Methylation-specific multiplex ligation-dependent probe amplification (MS-

MLPA)

Methylation status of MGMT, MLH1, MSH2 and MSH6 promoter regions

was detected using the SALSA MS-MLPA Kit ME011 (MRC-Holland) using

specific probes that recognized sequences containing a methylation-sensitive HhaI

restriction site. All reactions, processing and analysis were carried out as reported

previously [10,11].

Reverse transcription-polymerase chain reaction (RT-PCR)

Total RNA was extracted from frozen tumor tissues using Trizol reagent

(Invitrogen) according to manufacturer protocol. Complementary DNA (cDNA)

was synthesized with the Improm-IITM Reverse Transcription System (Promega).

EGFRvIII expression was determined by RT-PCR following the procedure reported

elsewhere [12].

Fluorescence In Situ Hybridization (FISH)

EGFR amplification (locus 7p12) and PTEN deletion (locus 10q23) were

analyzed by Fluorescence In Situ Hybridization (FISH). Dual-probe FISH analysis

was performed on tumor tissue sections using locus-specific probes for centromere

7/EGFR gene and centromere 10/PTEN gene (Vysis, Dowerners Grove, IL). FISH

studies were carried out following well-established methods as described before

[11,13].

Immunohistochemistry

p53, p63, ki67, VRK1, VRK2, MLH1, MSH2, MSH6, HDAC1, HDAC2

and HDAC3 protein expression was assessed by immunohistochemistry using a

tissue microarray containing the 105 astrocytomas and 8 different tissue controls as

it has been previously described [10,11]. The primary monoclonal antibodies used

11 MATERIAL AND METHODS

were p53 (clone DO-7, Novocastra), p63 (clone 4A4, Dako), ki-67 (clone MIB-1,

Dako), VRK1 [14], VRK2 [15], MLH1 (clone G168-15, BD Pharmingen), MSH2

(clone FE11, Biocare Medical), MSH6 (clone BC/44, Biocare Medical), HDAC1

(ab7028 Abcam), HDAC2 (ab7029 Abcam) and HDAC3 (ab7030 Abcam). Protein

expression was evaluated semiquantitatively by a blinded pathologist to clinical and

molecular information as reported before [10,11].

Microsatellite Instability (MSI)

MSI was assessed by PCR in paired normal and tumor DNA using a panel

of 8 markers: 3 mononucleotide markers (Bat25, Bat26 and Bat40), 3 dinucleotide

markers (D2S123, D5S346 and D17S250) and 2 tetranuclotide markers (Mycl and

Pax6). The procedures and analysis for MSI determination have been previously

reported [10].

Supervised classification methods

We used the supervised classification algorithms described above.

Logistic regression is useful to predict the presence or absence of an

outcome or a characteristic based upon the values of a set of predictor variables. We

consider the following definition for binary classification:

𝑝 𝐶 = 1 𝑥 = 1/[1+ 𝑒! !!! !"  !!!!!! ]

Where x represents an instance to be classified meanwhile 𝛽!,𝛽!,… ,𝛽! are

the parameters of the model. These parameters should be estimated from the data in

order to obtain a concrete model. The parameter estimation is performed by means

of the maximum likelihood estimation method as previously described [16].

Nearest-neighbor (K-NN) classifier is a simple but effective classification

algorithm. The training process is basically memorizing all the training data. To

predict a new data point, we found the closest K (a tunable parameter) neighbours

from the training set and let them vote for the final prediction, as shown in figure 2.

12 MATERIAL AND METHODS

Figure 2: Example of a k-nearest-neighbor classification. The problem consists of two variables, X1 and X2 , and two classes, circle and cross. The circles and crosses represent the known examples, and the question mark a new instance that we need to classify. A 1-nearest neighbor classifies an unlabeled instance as the class of the known instance closest to the instance. In this case, a 1-nearest neighbor would classify the question mark as a cross. A 2-nearest neighbor looks at the two closest examples. In our case, we have a circle and a cross and thus have to choose a way to break the ties. A 3-nearest neighbor would classify the question mark as a circle (we have two circles and a cross). Setting the k value at an odd value allows us to avoid ties in the class assignment.

The IB-K algorithm used in this paper is a case-based K-NN classifier. The

main idea of K-NN is the following: to assign labels to new samples depending on

their similarity with a reference set of already labelled instances (K=5) [17].

Implementations commonly use Euclidean distance for numeric attributes and

nominal-overlap for symbolic features. In the present study Manhattan metric was

used as measure of the degree identity between the test sample and the training

instances. The strength of K-NN is its simplicity, as no model needs to be trained.

Incremental learning is automatic when more data arrives (and old data can be

13 MATERIAL AND METHODS

deleted as well). The weakness of K-NN is that it does not handle high number of

dimensions well.

Decision or Classification Trees are one of the oldest machine-learning

models and it is usually applied to illustrate the very basic idea of machine learning.

Based on a tree of decision nodes, the learning approach is to recursively divide the

training data into nodes of homogeneous members through the most discriminative

dividing criteria. The measurement of "homogeneity" is based on the output label.

The related literature proposes a broad range of metrics to measure this correlation

degree between the known class C and the instance mainly based on information

theory as describes later. The model is fitted using binary recursive partitioning.

Each internal node splits the instance space into two or more sub-spaces according

to a certain discrete function of the input attributes values. The selection of this

variable is used to expand a tree in as many branches as possible values the variable

takes. Terminal nodes will ideally have samples of only one of the classes, although

a mixture of classes is usually found. During the training, various dividing criteria

based on the input will be tried (using in a greedy heuristic; that is, selecting at each

step the best splitting variable according to the selected criterion). Tree algorithms

can take different data type of input and output variables (categorical, binary and

numeric value). It can handle missing values and outliers well. The decision tree is

also good in explaining reasoning for its prediction and therefore gives good insight

about the structure underlying the data.

ID3 algorithm. One of the most common decision trees used is the

algorithm ID3 described by Quinlan (1986). The correlation measure between the

variable 𝑋! and the variable Class C is based on information gain:

I X! C =  H C − H(C|X!)

Information gain is the measure of the difference in entropy from before to after

the variable Class C is split on an attribute the variable 𝑋!.

14 MATERIAL AND METHODS

C.4.5 Algorithm is an extension of Quinlan´s earlier ID3 algorithm. In this

case the measure of correlation between the variable 𝑋! and the Class variable C is

based on gain ratio defined as follows:

I X! C /H(X!)

This ratio avoids that the variables with more possible values were

advantageous for its selection. In the present study the tree was constructed using

the C4.5 algorithm [18].

Naive Bayes algorithm is the simplest Bayesian classifier. It apply Bayes’

theorem with the assumption of conditional independence of the predictive

variables given the class [19]. Given an instance, described by its feature vector

(𝑋!…  𝑋!), and a class target c, Bayes’ theorem allows us to express the conditional

probability 𝑃 𝑐 𝑥 as a product of simpler probabilities using the naıve

independence assumption:

𝑃 𝑐 𝑥 =𝑃 𝑐 𝑃 𝑥 𝑐

𝑃(𝑥) =  𝑃 𝑐 Π!!!!  𝑃(𝑥!|𝑐)

𝑃(𝑥)  ∝ 𝑃 𝑐 𝑃(𝑥!|𝑐)!

!!!

Since 𝑃 𝑐 is constant for a given instance; the most probable a posteriori

assignment of the class variable is calculated as follows:

𝑐 = argmax!𝑃 𝑐   𝑃(𝑥!|𝑐)

!

!!!

It assumes that all domain variables are independent when the class value is

known. This assumption dramatically simplifies the exposed statistics, and only

univariate class-conditioned terms  𝑃 𝑋! 𝑐 are needed. Although this assumption is

violated in numerous occasions in real domains, the paradigm still performs well in

many situations.

Suppport vector machine (SVM) provides a binary classification

mechanism based on finding a dividing hyperplane between a set of samples with

positive and negative outputs. It assumes the data is linearly separable after

transformation. The problem can be structured as a quadratic programming

15 MATERIAL AND METHODS

optimization problem as maximizing the margin subjected to a set of linear

constraints (ie: data output on one side of the line must be positive while the other

side is negative). This can be solved by a quadratic programming technique.

If the data is not linearly separable due to noise (majority is still linearly

separable), then an error term will be added to penalize the optimization. If the data

distribution is fundamentally non-linear, the trick is to transform the data to a higher

dimension and hopefully the data is linearly separable in that higher dimension. The

optimization term turns out to be a dot product of the transformed points in the high

dimensional space, which was found to be equivalent to perform a kernel function

in the original (before transformation) space.

The kernel function provides a cheap way to equivalently transform the

original point to a high dimension (since we do not actually transform it) and

perform the quadratic optimization in that high dimension space. SVM with Kernel

function is a highly effective model and works well across a wide range of problem

sets. Although it was designed as a binary classifier, it can be easily extended to

multi-class classification by training a group of binary classifiers and using “one vs

all” or “one vs one” to predict. SVM predicts the output based on the distance to the

dividing hyperplane, which does not directly provide a probability estimation of its

prediction. SVM is a very powerful technique and perform good in a wide range of

non-linear classification problems. It works best when you have a small set of input

features because it will expand the features into higher dimension space, providing

that you also have a good size of training data (otherwise, overfit can happen).

Ensemble methods. Although the most common approach is to use a single

model for class prediction, the combination of classifiers with different biases is

gaining popularity in the machine learning community. As each classifier defines its

own decision surface to discriminate between problem classes, the combination

could construct a more flexible and accurate decision surface. While the first

approaches proposed in the literature were based on simple combinative models

(majority vote, unanimity vote), more complex approaches are now demonstrating

notable predictive accuracies. Among these we can cite the bagging, boosting,

16 MATERIAL AND METHODS

stacked generalization, random forest, or Bayesian combinative approaches. Due to

the negative effect of small sample sizes on bioinformatics problems, model

combination approaches are broadly used due to their ability to enhance the

robustness of the final classifier (also known as the meta-classifier). On the other

hand, the expressiveness and transparency of the induced final models are

diminished [20].

Model selection and evaluation

The estimation method used to evaluate the models was the k-fold cross-

validation method. This method involves partitioning the examples randomly into k

folds or partitions. One partition is used as a test set and the remaining partitions

form the training set. The process is repeated k times using each of the partitions as

the test set. For each classification algorithms we obtained the overall accuracy as

well as the accuracy for each tumor diagnosis. The performance results were finally

compiled with the mean and standard errors statistics. We further compared the

results obtained from the classification paradigms thought one-way ANOVA test to

determine which algorithms were most effective.

Additionally, the methods were also evaluated by the score-based ROC

statistics of Area Under the Curve (AUC). The AUC quantifies the probability that

a classifier will give a randomly drawn deleterious example a lower score than a

randomly drawn neutral example.

Feature subset selection

The designed feature set is composed of 39 descriptors. We performed

feature selection since some of these features could be irrelevant to the prediction

and characterization of astrocytoma molecular diagnosis. Feature selection methods

yield parsimonious models that reduce information cost, are easy to explain and

understand, and increase model applicability and robustness. We utilized a feature

selection of genetic biomarker based on the following methods:

ü Mutual information: in this work we considered the use of mutual

information to evaluate “information content” of each individual feature

17 MATERIAL AND METHODS

with regard to the output class [InfoGain(Class, feature) = H(Class) -

H(Class | feature)]

ü Mutual information ratio: A univariate filter algorithm that ranks the

predictive variables according to their Gain Ratio with the class label and

keeps the best 6 variables. [GainR(Class, feature) = (H(Class) - H(Class |

Attribute)) / H(feature)]

ü Correlation feature selection algorithms: this algorithm tries to find a subset

of predictive variables that is highly correlated with the class, but has low

intercorrelation between the predictive variables. It starts with an empty

subset and iteratively adds the variable that yields a subset with the highest

correlation value. The correlation measures the symmetric uncertainty of

each variable in the subset with the class (to maximize), and adjusts it to

take into account the symmetric uncertainty between the predictive variables

(to minimize). The symmetric uncertainty is a measure of correlation based

on the marginal entropies and the joint entropies between pairs of variables

[21].

The number of variables selected in mutual information algorithm was

performed under the criteria of χ2 distribution (p-value<0.001) and then compared

with the ranking features obtained Mutual Information Ratio algorithm; meanwhile,

the selection process using a CSF was also performed using internal k-fold cross-

validation scheme. The results obtained for each classification paradigms before

and after Feature Selection process was compared using unpaired t-test statistic.

Unsupervised classification

Unsupervised classification algorithms were applied in order to discover

groups that share similar genetic profiles within the dataset. Its starting point is a

training database formed by a set of N independent samples DN = (x1 , . . ., xN )

drawn from a joint and unknown probability distribution p(x, c). Each sample is

characterized by a group of d predictive variables or features {X1, ..., Xd} [22] and

18 MATERIAL AND METHODS

C is a hidden variable that represents the cluster membership of each instance. In

contrast to supervised classification, there is no label that denotes the class

membership of an instance, and no information is available about the annotation of

the database samples in the analysis. A key concept in clustering is the type of

distance measure that determines the similarity degree between samples. This will

dramatically influence the shape and configuration of the induced clusters, and its

election should be carefully studied. Usual distance functions are the Euclidean,

Manhattan, or Mahalanobis.

We used hierarchical clustering applying Manhattan distance for computing

similarity matrix between pairs of observations. The output of a hierarchical

clustering algorithm is a nested and hierarchical set of partitions/clusters

represented by a tree diagram or dendrogram, with individual samples at one end

(bottom) and a single cluster containing every element at the other (top).

Ward´s linkage clustering method was used to define the clusters. The ultimate

cluster results are obtained by slicing the dendrogram at a particular level. In our

case, this level is when only two cluster remain, attempting to find differences in

mortality or prognosis of patients.

Survival analysis

Survival curves were estimated by the Kaplan-Meier method and compared among

patient subsets using the log-rank test. Furthermore, associations between

genetic/clinical features and clusters identified previously were analysed using the

χ2 contingency test and the Fisher’s exact test when necessary (expected values

below 5).

19 RESULTS

4 RESULTS Characteristics of the patients included in the analysis

A total of 105 patients newly diagnosed with primary grade II to IV

astrocytomas were included in this study. The class to be predicted was low-grade

astrocytomas (24 patients) and high-grade astrocytomas (81 patients). The later

class included in the analysis was the final diagnosis of high-grade astrocytomas in

anaplastic astrocytomas (16 patients) and glioblastomas (65 patients). The median

age at diagnosis of patients was 57 years (range 14-79 years) and male to female

ratio was 1.6:1. Detailed clinicopathological characteristics of the patients were

summarized in Table 1. The data set comprised 39 features or molecular biomarkers

belonging to core pathways for astrocytoma pathogenesis.

Classification of astrocytoma diagnosis based on supervised machine learning

methods

We seek to classify the astrocytoma diagnosis in two steps procedure. First,

we differentiated between low-grade and high-grade astrocytomas; therefore we

merged AA and GBM into high-grade astrocytomas. Secondly, we tried to

distinguish between AA and GBM diagnosis. A representation of the performed

analysis was included in figure 3.

20 RESULTS

Table 1. Summary of clinicopathological characteristics of the patients.

Patients, No (%) LGA n=24

AA n=16

GBM n=65

Sex Male 14 (58.3%) 11 (68.8%) 40 (61.5%) Female 10 (41.7%) 5 (31.2%) 25 (38.5%) Median age, years [IQR] 35 [29.3-46.0] 55.0 [45.3-66.0] 64 [55.0-69.5] Tumor Localization Tumor Side Right 9 (37.5%) 9 (56.2%) 39 (60.0%) Left 12 (50.0%) 7 (43.8%) 24 (36.9%) Other 3 (12.5%) 0 (0.0%) 2 (3.1%) Tumor Region Temporal 9 (37.5%) 7 (43.8%) 26 (40.0%) Frontal 7 (29.2%) 6 (37.5%) 23 (35.4%) Parietal 3 (12.5%) 2 (12.5%) 6 (9.2%) Occipital 2 (8.3%) 1 (6.2%) 7 (10.8%) Other 3 (12.5%) 0 (0.0%) 3 (4.6%) Surgery Total resection 14 (58.3%) 10 (62.5%) 49 (75.4%) Subtotal resection 10 (41.7%) 6 (37.5%) 16 (246%) Treatment No treatment 12 (50.0%) 4 (25.0%) 6 (9.2%) Radiotherapy 7 (29.2%) 5 (31.2%) 44 (67.7%) Radiotherapy + Chemotherapy 5 (20.8%) 7 (43.8%) 15 (23.1%)

Median of survival, months [IC 95%]

98.9 [52.4-145.4] 14.0 [12.3-15.7] 11.0 [8.4-13.6]

LGA: low-grade astrocytoma; AA: anaplastic astrocytoma; GBM: glioblastoma.

21 RESULTS

Figure 3: Flowchart showing the classification performance of astrocytoma diagnosis. The first step aims to classify low-grade and high-grade astrocytomas, meanwhile the second step is focused on the classification of high grade astrocytomas (Anaplasic Astrocytomas and glioblastomas).

In both steps, feature selection techniques were applied in order to simplify

the classification task and to improve the accuracy of the methods. This rationally

lead to increase the accuracy of the classification compared to the classification

obtained with the three categories separately.

Classification of Low-grade and High-grade

We first explored the use of supervised algorithms to classify low-grade and

high-grade astrocytomas. Classification power of logistic regression, k-NN (k-star),

classification tree (J48), Naïve Bayes and support vector machine (SMO)

classification techniques as well as the two ensemble methods (Bagging and

Boosting) were tested in the astrocytoma dataset without any biomarker selection.

These methods were mainly evaluated by the percent of corrected classified and the

score-based independent ROC statistic of AUC. In our series, the classification

accuracy ranges from 78.09 % to 90.55%. The results indicated that Bagging

ensemble method was the classification paradigm that best classified astrocytoma

22 RESULTS

tumors in low-grade and high-grade categories. Although, no significant differences

were detected among the classifiers in terms of accuracy or AUC, Bagging method

reached to correctly classify the 90.55% of patients; meanwhile, Naïve-Bayes

showed the highest AUC 0.913 but only classify correctly 86.73% of patients

(Table 2A).

Choosing a robust and reliable set of relevant features is crucial in

classification analysis. Feature selection procedures were applied in order to

identify the most relevant genetic features and to reduce redundant information.

Given the initial set of 39 genetic biomarkers, we found the subset with k<n

features that were maximally informative about the diagnosis. Therefore, we studied

how to select features according to the maximal statistical dependency criterion

based on mutual information, mutual information ratio and correlation feature

selection in order to classify low-grade and high-grade astrocytomas. Results of

mutual information algorithm were summarized in Figure 4, showing that IDH1

mutation, ki67 expression and p63 expression seed information about the grade of

these heterogeneous tumors (p-value<0.001). Furthermore, IDH1 mutation, ki67

expression and p63 expression were also the three first ranked features based on

mutual information ratio. These results were also confirmed through correlation

feature selection method.

23 RESULTS

Figure 4. Scree-plot representing Mutual Information between grade of astrocytomas and the genetic markers use for the classification task. IDH1 mutation, ki67 expression and p63 expression were the three first variables ranked thought Mutual Information algorithm (p-value<0.001).

We further analysed the classification results obtained by supervised

classification algorithm after feature selection algorithms (Table 2B). The

classification power was also measured in terms of % accuracy and AUC using k-

fold cross-validation method. Unpaired t-test was applied to compare the

classification results obtained by supervised classification algorithms with feature

selection and non-feature selection approaches. The results suggested that the

supervised techniques improve their classification power when it was applied a

feature selection technique; however, a significant difference was only detected in

K-NN algorithm (p-value<0.05) (Table 2). K-NN was configured with k=5 after

trying some preliminary test, this configuration obtained better accuracy than k=1,

k=3 and k=7. In spite of K-NN being the simplest algorithm used to classify

atrocytomas, the results obtained are quite competitive with other approaches.

Interestingly, we found that the Bagging meta-classifier was also the supervised

classification paradigm that better classify the astrocytoma subtypes after feature

selection approach. This method reached to correctly classify 96 of the 105 patients

Mut

ual I

nfor

mat

ion

0.00

0.05

0.10

0.15

0.20

0.25

0.30

IDH1 m

utatio

n

ki67 e

xpres

sion

p63 e

xpres

sion

PTEN muta

tion

EGFR ampli

ficati

on

TP53 m

utatio

n

EGFRvIII e

xpres

sion

BCL2 rs

2279

115

p53 e

xpres

sion

MSH2 exp

ressio

n

PTEN loss

HDAC1 exp

ressio

n

VRK2 exp

ressio

n

XRCC1 rs2

5487 MSI

p16 e

xpres

sion

ERCC1 rs1

1615

ERCC2 rs1

3181

TP53 rs

1042

522

MDM2 rs2

2797

44

HDAC2 exp

ressio

n

BRAF muta

tion

MDM2 exp

ressio

n

PIK3C

A muta

tion

XRCC3 rs8

6153

9

VRK1 exp

ressio

n

MGMT meth

ylatio

n

MLH1 r

s180

0734

MLH1 e

xpres

sion

KRAS LCS6 r

s617

6437

0

APEX1 rs1

1304

09

Ikaros

expre

ssion

MSH6 exp

ressio

n

HDAC3 exp

ressio

n

MLH1 m

ethyla

tion

MSH6 meth

ylatio

n

MSH2 meth

ylatio

n

BAX rs46

4587

8

ERCC6 rs4

2530

79

24 RESULTS

(91.45%) included in the study using the selected IDH1 mutation, ki67 expression

and p63 expression features.

Classification of Anaplastic Astrocytomas and Glioblastomas

Once we classified astrocytomas in low-grade and high-grade categories

based on their genetic features, we further determined the final diagnosis of high-

grade astrocytomas (AA and GBM) using the same supervised algorithms and

genetic features. Naïve Bayes and Bagging classifiers were the methods that best

classify high-grade astrocytomas based on overall accuracy and AUC (Table 3).

However, none of these techniques were able to reach an accurate classification of

anaplastic astrocytoma category. It was performed feature selection techniques in

order to improve the results obtained. However, we did not detect any improvement

in terms of accuracy or AUC in any of the supervised methods (data not shown) and

the classifiers also yielded low accuracies in anaplastic astrocytomas diagnosis,

suggesting a similar molecular profile between AA and GBM diagnostic groups.

These results were not surprising since distinguishing both high-grade astrocytomas

subtypes is proved to be difficult for the experts and it has been described that both

tumor types share common genetic alterations.

25 RESULTS

Tab

le 2

. Cla

ssifi

catio

n pe

rfor

man

ce o

f low

-gra

de a

nd h

igh-

grad

e as

troc

ytom

as w

ithou

t fea

ture

sele

ctio

n an

d us

ing

feat

ure

sele

ctio

n ap

proa

ch.

L

R

K-N

N

CT

N

B

SMO

B

aggi

ng

Boo

stin

g A

) No

Feat

ure

Sele

ctio

n

Low

-gra

de

15/2

4 15

/24

14/2

4 14

/24

13/2

4 15

/24

17/2

4 H

igh-

grad

e 70

/81

68/8

1 77

/81

77/8

1 69

/81

80/8

1 77

/81

Ove

rall

accu

racy

(SD

) 81

(9.7

3)

79.2

7 (1

3.77

)*

86.7

3 (1

3.02

) 86

.73

(9.4

) 78

.09

(11.

74)

90.5

5 (9

.79)

89

.55

(14)

A

UC

0.

797

0.81

3 0.

691

0.91

3 0.

697

0.86

8 0.

905

B) F

eatu

re S

elec

tion

Lo

w-g

rade

17

/24

14/2

4 15

/24

15/2

4 14

/24

15/2

4 16

/24

Hig

h-gr

ade

77/8

1 81

/81

78/8

1 79

/81

77/8

1 81

/81

78/8

1 O

vera

ll ac

cura

cy (S

D)

89.4

5 (9

.81)

90

.55

(7.6

9) *

88

.55

(13.

52)

89.5

5 (9

.62)

86

.64

(13.

14)

91.4

5 (8

.26)

89

.45

(11.

86)

AU

C

0.87

9 0.

92

0.73

0 0.

906

0.76

7 0.

879

0.87

9 LR

: log

istic

regr

essi

on; K

-NN

: K-N

eare

st N

eigh

bor;

CT:

cla

ssifi

catio

n tre

e; N

B: N

aïve

Bay

es; S

MO

: sup

port

vect

or m

achi

ne

Tab

le 3

. Cla

ssifi

catio

n pe

rfor

man

ce o

f Ana

plas

ic A

stro

cyto

mas

and

Glio

blas

tom

as w

ithou

t fea

ture

sele

ctio

n ap

proa

ch.

L

R

K-N

N

DT

N

B

SMO

B

aggi

ng

Boo

stin

g A

A

7/16

3/

16

0/16

3/

16

6/16

7/

16

0/16

G

liobl

asto

ma

46/6

5 45

/65

59/6

5 56

/65

49/6

5 51

/65

60/6

5 O

vera

ll ac

cura

cy (S

D)

65.2

8 (1

8.63

) 59

.17

(8.7

4)

72.7

8 (2

2.72

) 72

.64

(13.

10)

67.7

8 (2

0.71

) 71

.6 (0

.54)

74

.03

(0.5

1)

AU

C

0.55

0 0.

521

0.37

3 0.

913

0.56

4 0.

549

0.46

4 LR

: log

istic

regr

essi

on; K

-NN

: K-N

eare

st N

eigh

bor;

DT:

cla

ssifi

catio

n tre

e; N

B: N

aïve

Bay

es; S

MO

: sup

port

vect

or m

achi

ne.

26 RESULTS

Due to these discrepancies, we decided to perform unsupervised analysis of

high-grade astrocytomas through hierarchical cluster and to explore the clinical and

molecular differences of the clusters obtained. Two groups were defined through

unsupervised cluster method named as cluster-1 that included 30 patients and

cluster-2 that comprised 51 patients. It should be notice that none of the selected

cluster was significantly enriched for AA or GBM diagnosis as shown in Figure 6.

Figure 6. Hierarchical clustering of high-grade astrocytomas based on genetic and molecular features revealed two different groups. AA: Anaplasic Astrocytomas, GBM: Glioblastomas.

Survival differences between genetic cluster-1 and cluster-2 were evaluated

and compared with AA and GBM diagnostic groups survival. We did not observe

differences in survival between patients diagnosed with AA compared with GBM

(574d vs 475d p-value=0.321; Figure 7A). Meanwhile, survival analysis of

molecular clusters revealed that cluster-2 of patients had a significantly better

median overall survival compared with cluster-1 (594d vs 310d p-value=0.006;

Figure 7B).

GBM

GBM GBM

GBM AA

GBM

GBM

GBM

GBM GBM

GBMGBM

AA

GBMGBM

GBM

GBM

GBM

GBM

GBM

GBM

GBM AA

GBM AA

GBM GBM

GBM GBM

GBM

GBM

GBM

GBM

GBM

GBM

GBM

AA

GBM

GBM

GBM GBM

AA

GBM AA

GBM GBM

GBM

AA

GBM

AA

GBMGBM

AAAA

GBM

AA

GBM

GBM

GBM

GBM

AA

GBM GBM

GBM

GBM

GBM AAAAAA GBM

GBM

GBM

GBM

GBM

GBM

GBM

GBM

GBM GBM

GBM

GBM

01020304050

hclust (*, "ward.D2")distance

Height

27 RESULTS

Figure 7. Kaplan-Meier estimates overall survival in high-grade astrocytomas according to (A) histological diagnosis and (B) molecular clusters. AA: Anaplasic Astrocytomas, GBM: Glioblastoma.

These results support the misclassification obtained through supervised

classification algorithms of high-grade astrocytomas. We also analyzed the impact

of patients’ age and treatment received since these both clinical parameters are

important for astrocytoma patients’ survival. No differences in patients´ age neither

in treatment management were observed between cluster-1 and cluster-2, supporting

the relevance of molecular alterations in astrocytomas’ outcome (Table 4). We

further explored the molecular features that distinguish both clusters. For this

purpose, feature selection approaches were performed above the cluster-1 and

cluster-2 categories. The six representative molecular alterations associated with the

two molecular clusters were commonly identified through mutual information,

mutual information ratio and correlation feature selection algorithms (Table 4).

0 1000 2000 3000 4000

020

4060

8010

0

Time (days)

Sur

viva

l pro

babi

lity

DiagnosisAA (n=16)GBM (n=65)

0 1000 2000 3000 40000

2040

6080

100

Time (days)

Sur

viva

l pro

babi

lity

ClassificationCluster-1 (n=30)Cluster-2 (n=51)

p = 0.321

p = 0.006

28 RESULTS

Table 4. Clinical and genetic features of the two genetic clusters

Clinical features Cluster-1 (n=30) (bad-prognosis)

Cluster-2 (n=51) (good-prognosis) p-value

Diagnosis 0.266 AA 13.3% 23.5% GBM 86.7% 76.5% Median age, years [IQR] 62.0 [54.8-68.3] 65.0 [53.0-70.0] 0.932 Surgery 0.939 Total resection 73.3% 72.5% Subtotal resection 26.7% 27.5% Medical treatment 0.206 No treatment 20.0% 7.8% Radiotherapy 60.0% 60.8% Radiotherapy + Chemotherapy 20.0% 31.4%

ki67 expression <0.001 <5% 17.4% 37.0% 5-10% 8.7% 47.8% >10% 73.9% 15.2% MLH1 expression <0.001 positive 48.3% 96.1% negative 51.7% 3.9% MSH6 expression <0.001 positive 40.7% 82.0% negative 59.3% 18.0% HDAC3 expression <0.001 Positive 34.6% 80.0% negative 65.4% 20.0% VRK2 expression <0.001 Positive 17.9% 94.1% negative 82.1% 5.9% PTEN mutation <0.001 wild-type 51.7% 89.1% mutated 48.3% 10.9%

Thus, cluster 1 was characterized by poor survival, high ki67 expression,

loss of MLH1, MSH6, HDAC3 and VRK2 expression and mutations in PTEN gene;

whereas cluster 2 was distinguished by better prognosis, low levels of ki67, positive

expression of MLH1, MSH6, HDAC3 and VRK2 proteins and wild-type PTEN.

29 DISCUSSION

5 DISCUSSION

Malignant astrocytomas are one of the most devastating cancers with a

dismal prognosis. Virtually all astrocytomas progress and locally relapse regardless

of multi-modality treatment approach [5,23]. Treatment decisions and management

as well as the prognostic prediction of astrocytoma patients rely mainly on the

histological grade of the tumor, often combined with perceived clinical features

[1,5,23]. However, astrocytomas are highly heterogeneous tumors and the currently

used morphology-based grading system does not always adequately represent the

high heterogeneity of these tumors [6,24]. Therefore is a need for more accurate

methods of astrocytoma classification. Astrocytomas are characterized by an

infiltrating and aggressive behavior directly related to their genetic alterations in

core signaling pathways [1]. Recent large-scale genomic studies have revealed

different subclasses of gliomas with different molecular and/or clinical phenotypes

or responses to therapy, but most of them are mainly focused on glioblastomas and

relatively little has been done for the molecular classification of anaplastic

astrocytomas and low-grade astrocytomas [25-30]. Thus, identification of new

genetics biomarkers may contribute to a better classification of both low-grade and

high-grade astrocytomas and can guide clinical decision-making. The explanatory

variables used for all the analysis of this project comprised a range of factors

involved in astrocytomas pathogenesis related with key cellular processes including

cellular proliferation, survival, apoptosis, cellular damage and metabolism. They

include common alterations in TP53, EGFR-PI3K and Rb pathways, IDH1

mutations, alterations in DNA repair genes or epigenetic factors.

The first stage of the current study was to investigate whether genetic and

molecular alterations involved in tumor proliferation, survival, apoptosis,

metabolism and DNA repair were able to predict the histological classification of

astrocytomas using supervised algorithms. Despite the usefulness of data mining in

medical applications [16], there is no previous implementation of such machine

30 DISCUSSION

learning algorithms in the classification of malignant astrocytoma subtypes based

on genetic and molecular features.

Therefore, in this methodological study we have compared different

supervised classification methods used in machine learning to distinguish between

astrocytoma diagnoses based on their molecular alterations. Our results showed that

supervised classification is an effective approach for low-grade and high-grade

astrocytomas classification, with accuracy higher than 85% in all applied algorithms

using the three selected features: IDH1 mutation, ki67 expression and p63

expression. Thus, the preselection of the variables using feature selection techniques

enhance the performance of supervised methods. Models obtained using feature

selection methods are desirable, not only because a higher accuracy is achieved, but

also because more parsimonious and easily understood models are obtained.

Furthermore, we suggest Bagging meta-classifier as the optimal algorithm to

classify astrocytoma in low-grade and high-grade subtypes since it was able to

correctly classify 91.5% of tumors after feature selection procedures. Ensemble

methodology is an efficient technique that has increasingly been adopted to

combine multiple learning algorithms in order to improve overall prediction

accuracy [31]. In addition, ensemble techniques have the advantage to alleviate the

small sample size problem by averaging and incorporating over multiple

classification models to reduce the potential for overfitting the training data [32]

Therefore, our main finding is that the results obtained after applying feature

selection approach in combination with the classificatory paradigms emphasize the

importance of IDH1 mutation and proliferation status, determined by both ki67 and

p63 expression, to discriminate between low-grade and high-grade tumors. In

keeping with previous data reported by the literature, IDH1 mutation and ki67

expression are highly associated with the grade of malignance of astrocytomas

[33,34]. IDH1 is a NADP+-dependent isocitrate dehydrogenase involved in cellular

metabolism. Mutations in IDH1 gene are one of the most frequent alterations in

low-grade astrocytomas and have been described as an early event in the

pathogenesis of these tumors. [28,34]. Regarding Ki67, its expression is a marker of

mitotic activity and cellular proliferation but its precise function is still unclear [33].

31 DISCUSSION

Ki67 expression is among the WHO criteria established for the grading of

malignant astrocytomas [3]. Interestingly, our results suggest for the first time the

role of the expression of p63 in astrocytoma classification. P63 is a p53 family

member with different functions in controlling cell proliferation, maintenance,

differentiation and apoptosis [35]. Its expression has been associated with more

aggressive tumours and poor prognosis in breast or lung cancer [35,36]. These three

features, IDH1 mutation, ki67 and p63 expression, along with the classical WHO

histological criteria [3] could help in the first step of diagnostic classification

between low-grade and high-grade astrocytomas.

Next, we attempted to deep in high-grade astrocytoma classification and to

discriminate between anaplastic astrocytomas and glioblastomas using supervised

algorithms based on common alterations in core signaling pathways. In this case, no

more than 70% of diagnostic accuracy was obtained regardless the use of feature

selection techniques and, furthermore, more than 50% of anaplastic astrocytomas

were misclassify. It should remark that pathology data presents special challenges

for researchers, including data imbalance for particular responses or predictors, and

high individual patient data variation that makes patter recognition and rule

detection difficult. In addition, the resources and time required for collecting and

genotyping specimens limit the number of samples in each class, particularly in

Anaplasic Astrocytoma. This limitation makes statistical inference very difficult to

carry out. From a clinical point of view, a potential explanation for misclassification

results may be that some anaplastic astrocytomas were actually glioblastomas and

that the histological specimen had not reflected the true nature of the tumor. Indeed,

it is often difficult and subjected to a large interobserver variability to distinguish

anaplastic astrocytomas from glioblastoma based on the presence of necrosis and/or

microvascular proliferation due to high-grade astrocytomas’ intratumoral variability

[3,37]. In addition, in the sequential progression from anaplastic astrocytoma to

glioblastoma, anaplastic astrocytoma could rapidly acquire characteristic

glioblastoma genetic alterations lacking some histological glioblastoma

characteristics like extensive endothelial cell proliferation [37,38]. Several studies

32 DISCUSSION

support the hypothesis that anaplastic astrocytomas and glioblastomas represent a

mix of molecular and genetic subtypes [24,29,39,40].

In our study, the hierarchical clustering of high-grade astrocytomas based on

the molecular alterations of the tumors was included in order to gather information

about the tumor biology or prognosis. This performance could be useful when

comparing similar phenotypic diagnosis do not seed light about the prognosis of

tumor. In fact, the results obtained through cluster analysis revealed two clearly

different clusters with different prognosis that are different from histological

subgroups. Importantly, no differences on age or clinical management were

detected between the clusters defined. In fact, the molecular classification obtained

predicts better the outcome of high-grade astrocytomas than the histological

classification since we did not find significant differences in survival between

anaplastic astrocytomas and glioblastomas in our series. Although glioblastoma is

associated with uniform poor survival rates, a small but discrete subgroup of long-

term survivors exists [3,41] and, furthermore, the clinical outcome of anaplastic

astrocytoma varies widely, ranging from rapid progression to prolonged survival

[42], indicating that diagnostic groups do not fully predict the variations in

astrocytoma disease outcome.

One of the challenges of astrocytoma research is to identify molecular

factors that predict clinical outcome. Six specific molecular alterations were

significantly associated with the two clusters, ki67, MLH1, MSH6, HDAC3 and

VRK2 expression and PTEN mutations, indicating a high prognostic potential for

these molecular classifiers. These alterations are mainly involved in tumor

proliferation, DNA repair and epigenetic mechanisms. Several of these alterations

were previously found to be individually associated with astrocytoma prognosis. As

mention before, ki67 is a marker of proliferation related with increasing grade of

malignance and prognosis [33]. PTEN is a tumor suppressor gene involved in the

regulation of several cellular processes including cell cycle control, cellular

survival, angiogenesis or cellular metabolism [43]. PTEN gene is frequently

inactivated in high-grade astrocytomas and has been widely associated with more

33 DISCUSSION

aggressive tumors and reduced survival of these patients [44,45]. VRK2 is a

member of a subfamily of serine-threonine kinases involved in the regulation of

several signal transduction pathways in the context of cancer biology [46]. VRK2

expression has been correlated with a proliferative phenotype in head and neck

squamous cell carcinomas and astrocytomas [11,47]. Furthermore, VRK2

expression has recently been associated with better survival in breast cancer and

high-grade astrocytomas that could be explain by the role of VRK2 in the

modulation of mitogenic, stress or apoptosis signaling [11,48]. Interestingly, loss of

expression of two mismatch repair proteins, MLH1 and MSH6, were found to be

associated with poor prognosis in our series, suggesting an important role of this

cellular pathway in astrocytoma biology. Mismatch repair system maintains DNA

stability by repairing errors acquired during DNA replication, and its defective

function is mainly associated with the development of colorectal and endometrial

carcinomas [49,50]. In astrocytomas, loss of MSH6 function confers resistance to

alkylating agents and it has been associated with tumor growth and progression

[51,52]. Although less is known about the role of MLH1 in astrocytomas, its

expression have been linked to therapy response and tumor recurrence [53,54].

Finally, HDAC3 is involved in the control of gene expression through histone

deacetylation. Its expression has been inversely associated with tumor grade and

decreased expression of HDAC3 was correlated with unfavorable patient outcome

[55].

To sum up, our results suggest that supervised classification using IDH1 mutation,

ki67 expression and p63 expression could be a complementary approach to

histological classification of astrocytomas in low-grade and high-grade tumors.

Furthermore, unsupervised classification of high-grade astrocytomas using the

above 6-markers signature related to cell proliferation, DNA repair and histone

deacetylation predicts better their outcome than histological criteria. This finding

provides supporting evidence that genetic-based classification is a more accurate

method to classify high-grade astrocytomas. These six markers could be used as

prognostic factors of high-grade astrocytomas and might seed light in the clinical

management and treatment approach of these patients. Nonetheless, further studies

34 DISCUSSION

in larger series of patients are necessary to confirm our observations and the

external validation would be crucial in future investigations.

35 CONCLUSIONS

6 CONCLUSIONS

1) Our results suggest that supervised classification methods using molecular

information could be a complementary approach to histological

classification of astrocytomas in low-grade and high-grade tumors.

2) Feature selection techniques reveals the importance of IDH1 mutation, ki67

expression and p63 expression in the classification of low-grade and high-

grade astrocytomas

3) Unsupervised classification of high-grade astrocytomas based on

hierarchical clustering methods predicts better their outcome than

histological criteria. Suggesting that unsupervised classification methods as

suitable methodology to gather information about the tumor biology or

prognosis

4) Six molecular markers signature of ki67, MLH1, MSH6, HDAC3 and

VRK2 expression and PTEN mutations related to cell proliferation, DNA

repair and histone deacetylation could be used as prognostic factors of high-

grade astrocytomas and might seed light in the clinical management and

treatment approach of these patients.

36 REFERENCES

7 REFERENCES

1. Ricard D, Idbaih A, Ducray F, Lahutte M, Hoang-Xuan K, et al. (2012) Primary brain

tumours in adults. Lancet 379: 1984-1996.

2. Wen PY, Kesari S (2008) Malignant gliomas in adults. N Engl J Med 359: 492-507.

3. Louis DN, Ohgaki H, Wiestler OD, Cavenee WK, Burger PC, et al. (2007) The 2007

WHO classification of tumours of the central nervous system. Acta Neuropathol

114: 97-109.

4. Ohgaki H, Kleihues P (2009) Genetic alterations and signaling pathways in the

evolution of gliomas. Cancer Sci 100: 2235-2241.

5. Stupp R, Hegi ME, Mason WP, van den Bent MJ, Taphoorn MJ, et al. (2009) Effects

of radiotherapy with concomitant and adjuvant temozolomide versus

radiotherapy alone on survival in glioblastoma in a randomised phase III study:

5-year analysis of the EORTC-NCIC trial. Lancet Oncol 10: 459-466.

6. Furnari FB, Fenton T, Bachoo RM, Mukasa A, Stommel JM, et al. (2007) Malignant

astrocytic glioma: genetics, biology, and paths to treatment. Genes Dev 21:

2683-2710.

7. Gomez-Sanchez JC, Delgado-Esteban M, Rodriguez-Hernandez I, Sobrino T, Perez

de la Ossa N, et al. (2011) The human Tp53 Arg72Pro polymorphism explains

different functional prognosis in stroke. J Exp Med 208: 429-437.

8. Pastor-Idoate S, Rodriguez-Hernandez I, Rojas J, Fernandez I, Garcia-Gutierrez MT,

et al. (2013) The T309G MDM2 gene polymorphism is a novel risk factor for

proliferative vitreoretinopathy. PLoS One 8: e82283.

9. Rodriguez-Hernandez I, Perdomo S, Santos-Briz A, Garcia JL, Gomez-Moreta JA, et

al. (2014) Analysis of DNA repair gene polymorphisms in glioblastoma. Gene

536: 79-83.

10. Rodriguez-Hernandez I, Garcia JL, Santos-Briz A, Hernandez-Lain A, Gonzalez-

Valero JM, et al. (2013) Integrated analysis of mismatch repair system in

malignant astrocytomas. PLoS One 8: e76401.

37 REFERENCES

11. Rodriguez-Hernandez I, Vazquez-Cedeira M, Santos-Briz A, Garcia JL, Fernandez

IF, et al. (2013) VRK2 identifies a subgroup of primary high-grade astrocytomas

with a better prognosis. BMC Clin Pathol 13: 23.

12. Mellinghoff IK, Wang MY, Vivanco I, Haas-Kogan DA, Zhu S, et al. (2005)

Molecular determinants of the response of glioblastomas to EGFR kinase

inhibitors. N Engl J Med 353: 2012-2024.

13. Garcia JL, Perez-Caro M, Gomez-Moreta JA, Gonzalez F, Ortiz J, et al. (2010)

Molecular analysis of ex-vivo CD133+ GBM cells revealed a common invasive

and angiogenic profile but different proliferative signatures among high grade

gliomas. BMC Cancer 10: 454.

14. Valbuena A, Lopez-Sanchez I, Vega FM, Sevilla A, Sanz-Garcia M, et al. (2007)

Identification of a dominant epitope in human vaccinia-related kinase 1 (VRK1)

and detection of different intracellular subpopulations. Arch Biochem Biophys

465: 219-226.

15. Blanco S, Klimcakova L, Vega FM, Lazo PA (2006) The subcellular localization of

vaccinia-related kinase-2 (VRK2) isoforms determines their different effect on

p53 stability in tumour cell lines. Febs J 273: 2487-2504.

16. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, et al. (2006) Machine

learning in bioinformatics. Brief Bioinform 7: 86-112.

17. Aha DW, Kibler, D., Albert, M.K. (1991) Instance-based learning algorithms.

Machine Learning 6.

18. JR Q (1993) C4.5: Programs for Machine Learning.; kaufmann SFM, editor.

19. 1961 M (1961) Steps toward artficial intelligence. Proceedings of the Institute of

Radio Engineers 49.

20. K.W.DeBock KC, and D.VandenPoel (2010) Ensemble classification based on

generalized additive models. Computational Statistics and Data Analysis 54:

1535-1546.

21. Hall M (1999) Correlation-based Feature selection for Machine Learning.

22. Osareh A, Shadgar B (2013) An efficient ensemble learning method for gene

microarray classification. Biomed Res Int 2013: 478410.

23. Wehming FM, Wiese B, Nakamura M, Bremer M, Karstens JH, et al. (2012)

Malignant glioma grade 3 and 4: how relevant is timing of radiotherapy? Clin

Neurol Neurosurg 114: 617-621.

38 REFERENCES

24. Gravendeel LA, Kouwenhoven MC, Gevaert O, de Rooi JJ, Stubbs AP, et al. (2009)

Intrinsic gene expression profiles of gliomas are a better predictor of survival

than histology. Cancer Res 69: 9065-9072.

25. Killela PJ, Pirozzi CJ, Reitman ZJ, Jones S, Rasheed BA, et al. (2014) The genetic

landscape of anaplastic astrocytoma. Oncotarget 5: 1452-1457.

26. (2008) Comprehensive genomic characterization defines human glioblastoma genes

and core pathways. Nature 455: 1061-1068.

27. Brennan CW, Verhaak RG, McKenna A, Campos B, Noushmehr H, et al. (2013)

The somatic genomic landscape of glioblastoma. Cell 155: 462-477.

28. Parsons DW, Jones S, Zhang X, Lin JC, Leary RJ, et al. (2008) An integrated

genomic analysis of human glioblastoma multiforme. Science 321: 1807-1812.

29. Phillips HS, Kharbanda S, Chen R, Forrest WF, Soriano RH, et al. (2006) Molecular

subclasses of high-grade glioma predict prognosis, delineate a pattern of disease

progression, and resemble stages in neurogenesis. Cancer Cell 9: 157-173.

30. Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, et al. (2010) Integrated

genomic analysis identifies clinically relevant subtypes of glioblastoma

characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer

Cell 17: 98-110.

31. Dietterich T (2000) Ensemble methods in machine learning. Proceedings of the

Multiple Classifier System Conference: 1-15.

32. Dietterich TG (2000) An experimental comparison of three meth- ods for

constructing ensembles of decision trees: bagging, boosting, and randomization.

Machine Learning 40: 139-157.

33. Johannessen AL, Torp SH (2006) The clinical value of Ki-67/MIB-1 labeling index

in human astrocytomas. Pathol Oncol Res 12: 143-147.

34. Yan H, Parsons DW, Jin G, McLendon R, Rasheed BA, et al. (2009) IDH1 and

IDH2 mutations in gliomas. N Engl J Med 360: 765-773.

35. Graziano V, De Laurenzi V (2011) Role of p63 in cancer development. Biochim

Biophys Acta 1816: 57-66.

36. Petitjean A, Hainaut P, Caron de Fromentel C (2006) TP63 gene in stress response

and carcinogenesis: a broader role than expected. Bull Cancer 93: E126-135.

37. Ohgaki H, Kleihues P (2007) Genetic pathways to primary and secondary

glioblastoma. Am J Pathol 170: 1445-1453.

39 REFERENCES

38. Sonoda Y, Ozawa T, Aldape KD, Deen DF, Berger MS, et al. (2001) Akt pathway

activation converts anaplastic astrocytoma to glioblastoma multiforme in a

human astrocyte model of glioma. Cancer Res 61: 6674-6678.

39. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, et al. (2003) Gene

expression-based classification of malignant gliomas correlates better with

survival than histological classification. Cancer Res 63: 1602-1607.

40. Freije WA, Castro-Vargas FE, Fang Z, Horvath S, Cloughesy T, et al. (2004) Gene

expression profiling of gliomas strongly predicts survival. Cancer Res 64: 6503-

6510.

41. Burger PC, Scheithauer BW, Vogel FS (2002) The brain tumors. Surgical pathology

of the central nervous system and its coverings. 4th ed: New York: Churchill

Livingstone. pp. 180–198.

42. Tortosa A, Vinolas N, Villa S, Verger E, Gil JM, et al. (2003) Prognostic

implication of clinical, radiologic, and pathologic features in patients with

anaplastic gliomas. Cancer 97: 1063-1071.

43. Song MS, Salmena L, Pandolfi PP (2012) The functions and regulation of the PTEN

tumour suppressor. Nat Rev Mol Cell Biol 13: 283-296.

44. Smith JS, Tachibana I, Passe SM, Huntley BK, Borell TJ, et al. (2001) PTEN

mutation, EGFR amplification, and outcome in patients with anaplastic

astrocytoma and glioblastoma multiforme. J Natl Cancer Inst 93: 1246-1256.

45. Xiao WZ, Han DH, Wang F, Wang YQ, Zhu YH, et al. (2014) Relationships

between PTEN gene mutations and prognosis in glioma: a meta-analysis.

Tumour Biol 35: 6687-6693.

46. Valbuena A, Sanz-Garcia M, Lopez-Sanchez I, Vega FM, Lazo PA (2011) Roles of

VRK1 as a new player in the control of biological processes required for cell

division. Cell Signal 23: 1267-1272.

47. Santos CR, Rodriguez-Pinilla M, Vega FM, Rodriguez-Peralto JL, Blanco S, et al.

(2006) VRK1 signaling pathway in the context of the proliferation phenotype in

head and neck squamous cell carcinoma. Mol Cancer Res 4: 177-185.

48. Fernandez IF, Blanco S, Lozano J, Lazo PA (2010) VRK2 inhibits mitogen-

activated protein kinase signaling and inversely correlates with ErbB2 in human

breast cancer. Mol Cell Biol 30: 4687-4697.

40 REFERENCES

49. Lynch HT, Lynch PM, Lanspa SJ, Snyder CL, Lynch JF, et al. (2009) Review of the

Lynch syndrome: history, molecular genetics, screening, differential diagnosis,

and medicolegal ramifications. Clin Genet 76: 1-18.

50. Koornstra JJ, Mourits MJ, Sijmons RH, Leliveld AM, Hollema H, et al. (2009)

Management of extracolonic tumours in patients with Lynch syndrome. Lancet

Oncol 10: 400-408.

51. Hunter C, Smith R, Cahill DP, Stephens P, Stevens C, et al. (2006) A hypermutation

phenotype and somatic MSH6 mutations in recurrent human malignant gliomas

after alkylator chemotherapy. Cancer Res 66: 3987-3991.

52. Yip S, Miao J, Cahill DP, Iafrate AJ, Aldape K, et al. (2009) MSH6 mutations arise

in glioblastomas during temozolomide therapy and mediate temozolomide

resistance. Clin Cancer Res 15: 4622-4629.

53. Friedman HS, McLendon RE, Kerby T, Dugan M, Bigner SH, et al. (1998) DNA

mismatch repair and O6-alkylguanine-DNA alkyltransferase analysis and

response to Temodal in newly diagnosed malignant glioma. J Clin Oncol 16:

3851-3857.

54. Gomori E, Pal J, Meszaros I, Doczi T, Matolcsy A (2007) Epigenetic inactivation of

the hMLH1 gene in progression of gliomas. Diagn Mol Pathol 16: 104-107.

55. Campos B, Bermejo JL, Han L, Felsberg J, Ahmadi R, et al. (2011) Expression of

nuclear receptor corepressors and class I histone deacetylases in astrocytic

gliomas. Cancer Sci 102: 387-392.