A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data

Jorge M. Arevalillo and Hilario Navarro

Dpt. Statistics and Operational Research

University Nacional de Educación a Distancia

A Study of Random Forests Learning Mechanism with Application to the

Identification of Informative Gene Interactions in Microarray Data

Salford Analytics and Data Mining Conference 2012. San DiegoSalford Analytics and Data Mining Conference 2012. San Diego1

Outline

Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions2

Weak Marginal / Strong bivariate genetic interactions

RF learning mechanism

RF bivariate interaction detector procedure

Controlling the curse of dimensionalityHandling the small sample effect

Application to microarray data

Conclusions

Human Genetics Basics


DNA is often described as the blueprint of living organisms. It is composed by two complementary strands of nucleotides (A-T, C-G)

Adenine (A) pairs with thymine (T) and cytosine (C) with guanine (G)

Basically, a gene is a piece of the DNA that contains the genetic information for the synthesis of a protein

The human genome in numbers

23 pairs of chromosomes 2 meters of DNA A sequence of 3 billion bps length 30000 – 40000 genes Over 99% of the genome is identical in all human beings

The central dogma of molecular biology


The expression of the genetic information stored in the DNA occurs in two stages

1)TRANSCIPTION. During which DNA is transcribed into messenger RNA (mRNA).2)TRANSLATION. At this stage mRNA is transported to cell cytoplasm and translated to produce a protein

Amino acids are used to construct proteins which in turn will determine the observed phenotype

DNA microarray technologies allow to measure the abundance of mRNA by monitoring the expression levels for hundreds or thousands of genes at different conditions of the phenotype

Weak marginal / Strong bivariate genetic interactions

5 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions

In binary classification we define a WM/SB bivariate gene to gene interaction as a pair of variables (genes) whose joint distribution discriminates the outcome but have irrelevant marginal distributions for class separation

RF learning mechanism

6

Random Forest is an ensemble of decision trees grown in a special way

Randomness is injected in RF mechanism by bootstrap resampling to grow each tree in the forest and also by finding the best splitter at each node within a randomly selected subset of inputs

The number ntree of trees in the forest and the number R of candidate inputs for splitting each node must be set in advance. Defaults: ntree = 500 and R = square root of the number p of inputs Each tree is grown on nearly 63% of data. The classification error rate is estimated using the 37% left out observations. The error rate evaluated on the out of bag cases is called oob

Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions

RF table of variable importance

7

The high dimensional nature of the data obtained by gene expression microarray experiments has created the need for variable selection procedures that separate relevant predictors (genes) carrying on useful information for classifying the phenotype from irrelevant predictor (genes)

RF generates variable importance measures that allow to rank predictors in accordance to their contribution to the predictive accuracy of the ensemble

RF gives two measures of variable importance

1) GINI MEASURE. Each variable is assigned a score that accounts for the all the improvements in the Gini index in all the nodes of the trees in the forests that use the variable as splitting variable

2) PERMUTATION BASED MEASURE. For each variable, all the cases are randomly permuted to a noisy predictor; this noisy predictor is used in place of the original predictor and the oob is computed again. The importance of the variable is defined by the difference between oob errors after and before permutation


The oob error rate degradation in high dimensional settings

8

An extreme synthetic example. XOR interaction pattern

The oob error rate rapidly becomes degraded as the number of noisy inputs increases; hence the XOR signal will be lost The interaction is captured as long as it appears alone without the disturbance of the noisy inputs; so an exhaustive search among all the pairs of inputs is required if we want RF learning mechanism detects the interaction Our proposal offers shortcuts and tricky artifacts that simplify the search


Search procedure. Sequential stage

9

RF ranking of variable importance gives new insights regarding the degradation of the oob error rate

Some alternatives, Díaz Uriarte (2006) and Genuer (2008), that explore this ranking in a sequential manner have been proposed to identify relevant patterns correlated to the outcome


Search procedure. Hunting stage

10

The second stage is designed to hunt difficult to uncover bivariate associations, which are lost by sequential search strategies

The idea is to group the inputs in blocks; then use the oob error of RF run for all the variables belonging to each pair of blocks in order to highlight block matches where the WM / SB interactions are more likely to appear. This will limit the search

Block i

Bloc

k j

Match (i,j)

Ranking of block matches


Drawback with the oob error rate

11

Simulation experiment with block size = 6

The boxplots show that the oob error rate cannot distinguish between block matches containing a weak marginal / strong bivariate association and block matches with only noisy inputs

The curses of dimensionality and low sample size are coming up again

XOR NOISY INPUTS

0.3

0.4

0.5

0.6

0.7

sample sizes (40,40)

overlap=0.31

oo

b e

rro

r ra

te

XOR NOISY INPUTS

0.2

00

.25

0.3

00

.35

0.4

00

.45

0.5

0


overlap=0.42

oo

b e

rro

r ra

te


Data augmentation

12

To overcome this drawback, data are artificially augmented and then oob error rate of a RF run on the augmented data is computed

Data perturbation is carried out in accordance to the following scheme

r is the sample range of X b is the number of bins the range is divided in. It controls the amount of perturbation An augmentation parameter k that gives the factor by which the dataset must be amplified is also introduced

The new oob error computed on the augmented merged dataset is actually a perturbed error rate measure. We call it perturbed oob


Details in Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on Machine Learning in Bioinformatics

The perturbed oob measure

13

The perturbed oob measure overcomes the initial drawback

1 (overlap=0.15) 3 (overlap=0.07) 5 (overlap=0.05) 7 (overlap=0.05) 9 (overlap=0.03)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

k

pe

rtu

rbe

d o

ob


XOR NOISY INPUTS

0.3

0.4

0.5

0.6

0.7


overlap=0.31

oo

b e

rro

r ra

te

XOR NOISY INPUTS

0.2

00

.25

0.3

00

.35

0.4

00

.45

0.5

0


overlap=0.42

oo

b e

rro

r ra

te

1 (overlap=0.35) 3 (overlap=0.24) 5 (overlap=0.18) 7 (overlap=0.16) 9 (overlap=0.14)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

k

pe

rtu

rbe

d o

ob



Summary of the algorithm

14

Usually bsize = 6, 8, b =5 and k = 3, 5, 7 are good settings

Strategies for this step include: screeplots for variable importance, VARSEL (Díaz Uriarte (BMC. Bioinformatics. 2006) and oob error smoothing (Genuer et al. INRIA. 2008)


The details about the implementation of the algorithm can be seen in Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on Machine Learning in Bioinformatics

Application to the colon cancer data

15

Gene expression levels corresponding to 40 tumor and 22 healthy tissue samples were collected with an Affymetrix oligonucleotide Hum6000 array (Alon et al. PNAS 1999). The expression levels were arranged in a matrix with 2000 columns (genes) and 62 rows along with a column containing the clinical outcome variable Y Y=1 for tumorous samples and Y=0 for healthy samples


The data are publicly available and can be downloaded from the package colonCA of Bioconductor www.bioconductor.org

Data pre-processing

16

Gene expression intensities were pre-processed with a log transformation and a standardization across genes

The figure shows the potential outliers given by RF outlier detector. Cases 18, 20, 52, 55 and 58 were previously indentified as outliers in the specialized literature (Chow et al. Physiol. Genomics 2001. Ambroise and McLachlan. PNAS 2002)


These outliers might be caused by different sources of error while collecting the data. We eliminate them from the analysis and end up with a data set containing 57 cases and 2000 predictors

A first selection. Sequential search

17 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions

List of genes selected after the sequential search step. It has a great agreement with previous selections (Ben-Dor et al. J.Comp.Biol. 2000)

Simple inspection of the screeplot of RF variable importance allow us to identify the most relevant variables. A forward sequential search strategy as in Genuer (2008) gives a selection containing the most informative genes for classifying the clinical outcome

Results

18

Control parameters for the hunting stage of the procedure have been set to block size = 5, k = 5 and b = 5. RF controls ntree and mtry were set to their default values

Findings for three top ranked block matches (heat map plots of the oob for each match and the scatter plots for the selected gene to gene interactions)

Bivariate gene interaction

(X86693, M80815)

(R60883, U04953)

(L12350, X86693)


Summary and conclusions

20

RF is a widely used algorithm for classification and variable selection in high dimensional small sample data. However, sequential search strategies based on the oob error and its ranking of variable importance usually fail in uncovering weak marginal / strong bivariate hidden interactions in these data structures

This happens because of the curse of dimensionality and the small sample size; both of them produce the degradation in the performance of RF classifier. Data augmentation and an exhaustive exploration by blocks of the feature space, which uses RF as the search engine, will protect us from this phenomenon

A perturbed oob measure is obtained when RF is run for all the features belonging to every pair of blocks in the augmented dataset

So the ranking of perturbed oobs will limit the search from the set of all possible bivariate interactions to the variables within the top ranked blocks The application of the proposed bivariate interaction detector algorithm to a real gene expression data was able to uncover WM/SB gene to gene interactions associated with the phenotype


Future research

21

The method was proposed for binary classification. Its extension to multi-class problems and the development of tricks and shortcuts that reduce the computational cost open future research avenues

The interaction detector algorithm utilizes RF as the search engine. The use of other search engines with classifiers like LDA, QDA, SVM, … is also an issue for future research. Recently, Arevalillo and Navarro (2011) BMC Bioinformatics have proposed the QDA as search engine

The development of an R package that incorporates all these improvements

Finally, the study of the problem of finding informative WM/SB genomic interactions in SNP data is an open research issue


Thank you for your attention

22

Jorge M. Arevalillo: [email protected] Hilario Navarro: [email protected]

Department of Statistics and Operational ResearchUniversity Nacional Educación a DistanciaPaseo Senda del Rey nº 9. 28040 Madrid


Technology

A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data