Upload
salford-systems
View
1.214
Download
1
Tags:
Embed Size (px)
Citation preview
Jorge M. Arevalillo and Hilario Navarro
Dpt. Statistics and Operational Research
University Nacional de Educación a Distancia
A Study of Random Forests Learning Mechanism with Application to the
Identification of Informative Gene Interactions in Microarray Data
Salford Analytics and Data Mining Conference 2012. San DiegoSalford Analytics and Data Mining Conference 2012. San Diego1
Outline
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions2
Weak Marginal / Strong bivariate genetic interactions
RF learning mechanism
RF bivariate interaction detector procedure
Controlling the curse of dimensionalityHandling the small sample effect
Application to microarray data
Conclusions
Human Genetics Basics
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions3
DNA is often described as the blueprint of living organisms. It is composed by two complementary strands of nucleotides (A-T, C-G)
Adenine (A) pairs with thymine (T) and cytosine (C) with guanine (G)
Basically, a gene is a piece of the DNA that contains the genetic information for the synthesis of a protein
The human genome in numbers
23 pairs of chromosomes 2 meters of DNA A sequence of 3 billion bps length 30000 – 40000 genes Over 99% of the genome is identical in all human beings
The central dogma of molecular biology
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions4
The expression of the genetic information stored in the DNA occurs in two stages
1)TRANSCIPTION. During which DNA is transcribed into messenger RNA (mRNA).2)TRANSLATION. At this stage mRNA is transported to cell cytoplasm and translated to produce a protein
Amino acids are used to construct proteins which in turn will determine the observed phenotype
DNA microarray technologies allow to measure the abundance of mRNA by monitoring the expression levels for hundreds or thousands of genes at different conditions of the phenotype
Weak marginal / Strong bivariate genetic interactions
5 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
In binary classification we define a WM/SB bivariate gene to gene interaction as a pair of variables (genes) whose joint distribution discriminates the outcome but have irrelevant marginal distributions for class separation
RF learning mechanism
6
Random Forest is an ensemble of decision trees grown in a special way
Randomness is injected in RF mechanism by bootstrap resampling to grow each tree in the forest and also by finding the best splitter at each node within a randomly selected subset of inputs
The number ntree of trees in the forest and the number R of candidate inputs for splitting each node must be set in advance. Defaults: ntree = 500 and R = square root of the number p of inputs Each tree is grown on nearly 63% of data. The classification error rate is estimated using the 37% left out observations. The error rate evaluated on the out of bag cases is called oob
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
RF table of variable importance
7
The high dimensional nature of the data obtained by gene expression microarray experiments has created the need for variable selection procedures that separate relevant predictors (genes) carrying on useful information for classifying the phenotype from irrelevant predictor (genes)
RF generates variable importance measures that allow to rank predictors in accordance to their contribution to the predictive accuracy of the ensemble
RF gives two measures of variable importance
1) GINI MEASURE. Each variable is assigned a score that accounts for the all the improvements in the Gini index in all the nodes of the trees in the forests that use the variable as splitting variable
2) PERMUTATION BASED MEASURE. For each variable, all the cases are randomly permuted to a noisy predictor; this noisy predictor is used in place of the original predictor and the oob is computed again. The importance of the variable is defined by the difference between oob errors after and before permutation
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
The oob error rate degradation in high dimensional settings
8
An extreme synthetic example. XOR interaction pattern
The oob error rate rapidly becomes degraded as the number of noisy inputs increases; hence the XOR signal will be lost The interaction is captured as long as it appears alone without the disturbance of the noisy inputs; so an exhaustive search among all the pairs of inputs is required if we want RF learning mechanism detects the interaction Our proposal offers shortcuts and tricky artifacts that simplify the search
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Search procedure. Sequential stage
9
RF ranking of variable importance gives new insights regarding the degradation of the oob error rate
Some alternatives, Díaz Uriarte (2006) and Genuer (2008), that explore this ranking in a sequential manner have been proposed to identify relevant patterns correlated to the outcome
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Search procedure. Hunting stage
10
The second stage is designed to hunt difficult to uncover bivariate associations, which are lost by sequential search strategies
The idea is to group the inputs in blocks; then use the oob error of RF run for all the variables belonging to each pair of blocks in order to highlight block matches where the WM / SB interactions are more likely to appear. This will limit the search
Block i
Bloc
k j
Match (i,j)
Ranking of block matches
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Drawback with the oob error rate
11
Simulation experiment with block size = 6
The boxplots show that the oob error rate cannot distinguish between block matches containing a weak marginal / strong bivariate association and block matches with only noisy inputs
The curses of dimensionality and low sample size are coming up again
XOR NOISY INPUTS
0.3
0.4
0.5
0.6
0.7
sample sizes (40,40)
overlap=0.31
oo
b e
rro
r ra
te
XOR NOISY INPUTS
0.2
00
.25
0.3
00
.35
0.4
00
.45
0.5
0
sample sizes (40,20)
overlap=0.42
oo
b e
rro
r ra
te
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Data augmentation
12
To overcome this drawback, data are artificially augmented and then oob error rate of a RF run on the augmented data is computed
Data perturbation is carried out in accordance to the following scheme
r is the sample range of X b is the number of bins the range is divided in. It controls the amount of perturbation An augmentation parameter k that gives the factor by which the dataset must be amplified is also introduced
The new oob error computed on the augmented merged dataset is actually a perturbed error rate measure. We call it perturbed oob
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Details in Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on Machine Learning in Bioinformatics
The perturbed oob measure
13
The perturbed oob measure overcomes the initial drawback
1 (overlap=0.15) 3 (overlap=0.07) 5 (overlap=0.05) 7 (overlap=0.05) 9 (overlap=0.03)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
k
pe
rtu
rbe
d o
ob
sample sizes (40,40)
XOR NOISY INPUTS
0.3
0.4
0.5
0.6
0.7
sample sizes (40,40)
overlap=0.31
oo
b e
rro
r ra
te
XOR NOISY INPUTS
0.2
00
.25
0.3
00
.35
0.4
00
.45
0.5
0
sample sizes (40,20)
overlap=0.42
oo
b e
rro
r ra
te
1 (overlap=0.35) 3 (overlap=0.24) 5 (overlap=0.18) 7 (overlap=0.16) 9 (overlap=0.14)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
k
pe
rtu
rbe
d o
ob
sample sizes (40,20)
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Summary of the algorithm
14
Usually bsize = 6, 8, b =5 and k = 3, 5, 7 are good settings
Strategies for this step include: screeplots for variable importance, VARSEL (Díaz Uriarte (BMC. Bioinformatics. 2006) and oob error smoothing (Genuer et al. INRIA. 2008)
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
The details about the implementation of the algorithm can be seen in Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on Machine Learning in Bioinformatics
Application to the colon cancer data
15
Gene expression levels corresponding to 40 tumor and 22 healthy tissue samples were collected with an Affymetrix oligonucleotide Hum6000 array (Alon et al. PNAS 1999). The expression levels were arranged in a matrix with 2000 columns (genes) and 62 rows along with a column containing the clinical outcome variable Y Y=1 for tumorous samples and Y=0 for healthy samples
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
The data are publicly available and can be downloaded from the package colonCA of Bioconductor www.bioconductor.org
Data pre-processing
16
Gene expression intensities were pre-processed with a log transformation and a standardization across genes
The figure shows the potential outliers given by RF outlier detector. Cases 18, 20, 52, 55 and 58 were previously indentified as outliers in the specialized literature (Chow et al. Physiol. Genomics 2001. Ambroise and McLachlan. PNAS 2002)
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
These outliers might be caused by different sources of error while collecting the data. We eliminate them from the analysis and end up with a data set containing 57 cases and 2000 predictors
A first selection. Sequential search
17 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
List of genes selected after the sequential search step. It has a great agreement with previous selections (Ben-Dor et al. J.Comp.Biol. 2000)
Simple inspection of the screeplot of RF variable importance allow us to identify the most relevant variables. A forward sequential search strategy as in Genuer (2008) gives a selection containing the most informative genes for classifying the clinical outcome
Results
18
Control parameters for the hunting stage of the procedure have been set to block size = 5, k = 5 and b = 5. RF controls ntree and mtry were set to their default values
Findings for three top ranked block matches (heat map plots of the oob for each match and the scatter plots for the selected gene to gene interactions)
Bivariate gene interaction
(X86693, M80815)
(R60883, U04953)
(L12350, X86693)
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Summary and conclusions
20
RF is a widely used algorithm for classification and variable selection in high dimensional small sample data. However, sequential search strategies based on the oob error and its ranking of variable importance usually fail in uncovering weak marginal / strong bivariate hidden interactions in these data structures
This happens because of the curse of dimensionality and the small sample size; both of them produce the degradation in the performance of RF classifier. Data augmentation and an exhaustive exploration by blocks of the feature space, which uses RF as the search engine, will protect us from this phenomenon
A perturbed oob measure is obtained when RF is run for all the features belonging to every pair of blocks in the augmented dataset
So the ranking of perturbed oobs will limit the search from the set of all possible bivariate interactions to the variables within the top ranked blocks The application of the proposed bivariate interaction detector algorithm to a real gene expression data was able to uncover WM/SB gene to gene interactions associated with the phenotype
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Future research
21
The method was proposed for binary classification. Its extension to multi-class problems and the development of tricks and shortcuts that reduce the computational cost open future research avenues
The interaction detector algorithm utilizes RF as the search engine. The use of other search engines with classifiers like LDA, QDA, SVM, … is also an issue for future research. Recently, Arevalillo and Navarro (2011) BMC Bioinformatics have proposed the QDA as search engine
The development of an R package that incorporates all these improvements
Finally, the study of the problem of finding informative WM/SB genomic interactions in SNP data is an open research issue
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Thank you for your attention
22
Jorge M. Arevalillo: [email protected] Hilario Navarro: [email protected]
Department of Statistics and Operational ResearchUniversity Nacional Educación a DistanciaPaseo Senda del Rey nº 9. 28040 Madrid
Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions