Cluster validationIntegration
ICES Bioinformatics
Overview
INTRODUCTION
MICROARRAY ANALYSIS
VALIDATION OF THE RESULTS
• Statistical validation
• Biological validation
INTEGRATION
Cluster validation
Preprocessing 1
Clustering
Algorithm 1
Preprocessing 2
Clustering
Algorithm 2Clustering
Algorithm 3
Parameter
Setting 1Parameter
Setting 2 Parameter
Setting 3
Validation
Why cluster validation?• Different algorithms, parameters• Intrinsic properties of the dataset (sensitivity to noise, to outliers)
STATISTICAL VALIDATION
• Sensitivity analysis– Leaf one out cross validation (FOM)
– Sensitivity analysis• Gaussian noise• ANOVA
• Cluster coherence testing– Euclidean distance score
– Gap statistics
Statistical validationValidation
Figure of Merit (sensitivity towards an experiment)
• Tested cluster algorithm is applied to all experimental conditions except the left out condition
• Hypothesis: if the cluster algorithm is robust it can predict the measured values of the left out condition
• To estimate the predictive power of the algorithm FOM is calculated
FOM is the root mean square deviation in the left-out condition e of the individual gene expression levels relative to their cluster means
This is repeated for all conditions and the average FOM is calculated
Statistical validationValidation
Yeung et al., 2001
Sensitivity analysis towards the signal to noise ratio
Sensitivity analysis = A way of assigning confidence to the cluster membership
– create new in silico replica's of the dataset of interest by adding a small amount of noise on the original data
– treat new datasets as the original one and cluster
– Genes consistently clustered together over all in silico replicas are considered as robust towards adding noise
How to determine the noise?
Statistical validationValidation
• Gaussian noise with 0 and standard deviation estimated as the median standard deviation for the log-ratios for all genes across the different experiment Bittner et al. 2000
How to determine the noise? How to generate simulated datasets?
ijkgjgkgiggjiijkg DGVGAGGDAy ** )()()(
• noise based on the appropriate ANOVA model describes the noise term
•The values are the estimates from the original fit•The are drawn with replacement from the studentized residuals of the original fit
Clustering is repeated on the simulated datasets
Statistical validationValidation
Comparing cluster results
• cluster label known:determine the stability of a gene: the percent of bootstrap cluster experiments in which the gene matches to the same cluster
• cluster label unknown:• Identify pairs of genes that cluster together in C^ and count the frequency with which such pairs cluster together in the bootstrapped clusters C^*. When each pair of genes clusters together reliably stable clusters will emerge
• RAND INDEX (Yeung et al. 2001)• Jaccard coefficient (Ben-Hur et al. 2002)
Approximate the confidence in the clustering output of a gene
Statistical validationValidation
Cluster exp 1
C1
Cluster exp 2
C1
Cluster exp 3
C1
Cluster exp 4
C1
Cluster exp 1
C1
C2
C3
…
Cluster exp 2
C1
C2
C3
…
Cluster exp 3
C1
C2
C3
…
Cluster exp 4
C1
C2
C3
…
RAND index• statistic designed to assess the degree of agreement between two partitions• Usually an unknown partition against an external standard
Adjusted RAND index• adjusted so that the expected value of the RAND index between two random partions is zero
Statistical validationValidation
dcba
daRAND
a: the number of object pairs that are clustered together in data set 1 and in dataset 2b: the number of object pairs that are clustered together in data set 1 but not in dataset 2c: the number of object pairs that are clustered together in data set 2 but not in dataset 1d: the number of object pairs that are put in different clusters in both datasetsa, d: agreement between cluster resultsb, c: disagreement between cluster results
The rand index is defined as the fraction of agreement that is the number of pairs of objects that are either in same groups in both partitions (a) or in different groups in both partitions (b) divided by the total number of pairs of objects (a + b + c +d). The rand index lies between 0 and 1.
• Jaccard coefficient
Statistical validation
...
001
001
...010
Matrix A: cluster results of dataset 1
# genes
# genes
...
001
001
...100
Matrix B: cluster results of dataset 2
# genes
# genes
Cij=1: if xiand xjbelong to the same cluster; 0 otherwise
...
001
001
...010
Matrix A: cluster results of dataset 1
# genes
# genes
...
001
001
...100
Matrix B: cluster results of dataset 2
# genes
# genes
Cij=1: if xiand xjbelong to the same cluster; 0 otherwise
ij ijijCCCC 2121,
212211
21
21 ,,,
,),(
CCCCCC
CCLLJ
Jaccard coefficient
Based on the clusters of one dataset, binary pair vectors are calculated, where each element corresponds to a unique pair of genes and had a value one if both genes were clustered into the same cluster and zero otherwise.
From two such pairvectors, where one was derived from the first dataset and the other from the second dataset, the jaccard coefficient is computed. This coefficient compares the correlation between both obtained binary matrices.
Cluster coherence testing
k points (genes) in cluster
p experiments (dimensions)
average profile of cluster
jx
])(1
[1 2
1j
Siij
p
jw xx
kpV
k
j
Vw:
• Variance of the genes about the the cluster average averaged over all experiments
•Maximizes coherence of the genes within a cluster
x
Statistical validation
Euclidian distance
Validation
wBT
jSi
ij
p
jw
j
p
jB
VVV
xxkp
V
xxp
V
k
])(1
[1
)(1
2
1
2
1
p experiments
Cluster average profile
jx
VB:
•Describes how the average at each experimental point oscillates around the average of the average cluster profile
•Maximizes variance across experiments
x average profile of cluster
Statistical validationGap score
Validation
Score function:
•R2 select clusters containing tightly co-expressed genes (minimal Vw) showing a high variable profile (high VB) across the experiments (ie affected by the signal studied).
•Score is compared to a similar score calculated based on a randomly generated cluster (bootstrapping)
•The difference between the score of the randomly generated cluster and the cluster of interest is calculated. (gapstatistics)
22
2
)(
/1
/100
randomSkk
wB
wB
T
B
RRSGapscore
VV
VV
V
VR
Gap statistics
Statistical validationValidation
Overview
INTRODUCTION
MICROARRAY ANALYSIS
VALIDATION OF THE RESULTS
• Statistical validation
• Biological validation
INTEGRATION
Biological validation
dataset
small clusters• contain genes with highly similar profile (+)• some information given up in first step (-)
validate “core” clusters
Motif finding
DNA level
literature/knowledge
extend clusters
big clusters• contain all real positives (+)• increasing number of false positives (-)
Validation
Microarrays and TextMining
Rationale:
Clustering Accession Nrs
AC0020D11428 SRS,
Medline, GeneCards,..
Manual Query :huge task
data Literature/knowledge
Validation Biological validation
Controlled vocabularies
Clu
ster
nu
mbe
r
Gra
phic
al
repr
esen
tati
on
of c
lust
er
Num
ber
of
OR
Fs
MIP
S
func
tion
al
cate
gory
(t
op-l
evel
)
OR
Fs
wit
hin
func
tion
al
cate
gory
P
-val
ue (
-lo
g 10)
1
426 energy transport facilitation
47 40
10 5
3
196 cell growth, cell division and DNA synthesis
48 5
4
149 protein synthesis cellular organisation
71 107
50 19
5
159 cell rescue, defense, cell death and ageing
20 4
6
171 cell growth, cell division and DNA synthesis
76 24
9
78 cell growth, cell division and DNA synthesis
23 4
37
11 metabolism 9 6
Cumulative hypergeometric distribution
Biological validation
.1),min(1
0
fn
ki
k
i
n
g
in
fg
i
f
n
g
in
fg
i
f
P
p-value that this degree of enrichment could have occurred by chance
(implemented in Ontoexpress)
2 test or Fisher exact test (as implemented in FATIGO software)
Biological validation
k
ii
ii
E
Eo1
220
)(
N1: number of genes on the chip
N2: number of differentially expressed genes
Microarrays and Motif Finding
cDNA arrays
Motif finding
Clustering
Preprocessing of the data
EMBLBLAST
Upstream regions
Gibbs sampling
Validation Biological validation
Overview
INTRODUCTION
MICROARRAY ANALYSIS
VALIDATION OF THE RESULTS
• Statistical validation
• Biological validation
INTEGRATION
• IT level
• Algorithmic level
Integration
Integration
Need for integrated tool
Validation
Overview
INTRODUCTION
MICROARRAY ANALYSIS
VALIDATION OF THE RESULTS
• Statistical validation
• Biological validation
INTEGRATION
• IT level
• Algorithmic level
Integration
Need for integrated algorithms
Validation
• Retain high sensitivity (minimize number of false negatives)• Reduce level of noise (minimize number of false positives)
• In corporate a priori information• Combine data from different sources that can mutually confirm
each other
• Example: sequence information and expression profiles• Server rMotif (Lapidot and Pilpel, 2003)• Selects genes from a microarray if
– Contain a motif
– Have a highly correlated expression profile
IntegrationValidation
• Motif diagnosis tool
• measures the extent to which a set of genes that contain a given motif in their promoter) display expression profiles similar to each other at a given set of conditions (analyzed by microarrays)
• score (EC expression coherence) of a set of N genes is defined as the number of p pairs of genes in the set for which the Euclidean distance between the mean and variance normalized profiles falls below a threshold D, divided by the total number of pairs in the set
• EC= p/[(0.5(N)(N-1)]
IntegrationValidation
IntegrationValidation