Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Unsupervised analysis of gene expression data
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
Overall workflow of a microarray study
Microarray experiment
Biological question
Experiment design
Image analysis
Pre-processing
Data Analysis
Hypothesis Experimental verification
Applied Bioinformatics, Spring 2011 2
Three major goals of gene expression studies
Class comparison (supervised analysis) e.g. disease biomarker discovery
Differential expression analysis
Input: gene expression data, class label of the samples
Output: differentially expressed genes
Class detection (unsupervised analysis) e.g. patient subgroup detection
Clustering analysis
Input: gene expression data
Output: groups of similar samples or genes
Class prediction (supervised learning) e.g. disease diagnosis and prognosis
Machine learning techniques
Input: gene expression data, class label of the samples (training data)
Output: prediction model
Applied Bioinformatics, Spring 2011
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$
3
What is clustering
Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities
Unsupervised techniques that do not require sample annotation in the process
Applied Bioinformatics, Spring 2011
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 …… TNNC1 14.82 14.46 14.76 11.22 11.55 …… DKK4 10.71 10.37 11.23 19.74 19.73 …… ZNF185 15.20 14.96 15.07 12.57 12.37 …… CHST3 13.40 13.18 13.15 11.18 10.99 …… FABP3 15.87 15.80 15.85 13.16 12.99 …… MGST1 12.76 12.80 12.67 14.92 15.02 …… DEFA5 10.63 10.47 10.54 15.52 15.52 …… VIL1 11.47 11.69 11.87 13.94 14.01 …… AKAP12 18.26 18.10 18.50 15.60 15.69 …… HS3ST1 10.61 10.67 10.50 12.44 12.23 …… …… …… …… …… …… …… ……
Gen
es
Samples
4
Why clustering?
Exploratory data analysis, providing rough maps and suggesting directions for further study
Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram
Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes
Functional annotation based on guilt by association
Applied Bioinformatics, Spring 2011 5
Clustering methods
Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters
Partitioning: divide the data into g groups using some reallocation algorithms, e.g. K-means
Applied Bioinformatics, Spring 2011 6
Hierarchical clustering
Agglomerative clustering (bottom-up) Start out with all sample units in n clusters of size 1.
At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.
The algorithm stops when all sample units are combined into a single cluster of size n.
Divisive clustering (top-down) Start out with all sample units in a single cluster of size n.
At each step of the algorithm, clusters are partitioned into a pair of daughter clusters, selected to maximize the distance between each daughter.
The algorithm stops when sample units are partitioned into n clusters of size 1.
Applied Bioinformatics, Spring 2011 7
Agglomerative clustering
Require distance measurement Between two objects
Between clusters
Applied Bioinformatics, Spring 2011 8
Between objects distance measurement
Euclidean distance Focus on the absolute expression value
Pearson correlation coefficient Focus on the expression profile shape Parametric, normally distributed and
follow the linear regression model
Spearman correlation coefficient Focus on the expression profile shape Non-parametric, no assumption Less sensitive but more robust than
Pearson
Applied Bioinformatics, Spring 2011
!
d = xi " yi( )2i=1
n
#
!
r =xi " x ( )(yi " y )
i=1
n#(xi " x )2
i=1
n# (yi " y )2
i=1
n#
!
d =1" r
9
Different measurement, different distance
0
1
2
3
4
5
6
1 2 3 4 5 6 7Time (hr)
Gen
e ex
pres
sion
leve
l (lo
g2)
GeneAGeneBGeneCGeneD
Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink)
Pearson: GeneC (green)
Spearman: GeneD (red)
Applied Bioinformatics, Spring 2011 10
Between cluster distance measurement
Single linkage: the smallest distance of all pairwise distances Complete linkage: the maximum distance of all pairwise distances
Average linkage: the average distance of all pairwise distances
Applied Bioinformatics, Spring 2011 11
Visualization and interpretation of hierarchical clustering results
Dendrogram Output of a hierarchical
clustering
Tree structure with the genes or samples as the leaves
The height of the join indicates the distance between the left branch and the right branch
Heat map Graphical representation of
data where the values are represented as colors.
Applied Bioinformatics, Spring 2011 12
Partitioning
General idea Select the number of groups, g
Randomly divide the objects into g Group
Iteratively rearrange the objects until a stop condition
Representative methods K-means
Self Organizing Map (SOM)
Applied Bioinformatics, Spring 2011 13
K-means
Define k = number of clusters Randomly initialize a seed vector for each cluster
Go through all objects, and assign each object to the cluster witch it is most similar to
Recalculate all seed vectors as means of patterns of each cluster
Repeat 3 & 4 until a stop condition (e.g. Until all objects get assigned to the same partition twice in a row)
Applied Bioinformatics, Spring 2011 14
K-means seed vector 1
seed vector 2
Objects join with closest seed Randomly initialize seeds
Recaculate seeds Reassign objects
Recaculate seeds Reassign objects
Seeds become stable: final clusters
Applied Bioinformatics, Spring 2011 15
Cool animations
Hierarchical clustering http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
K-means http://animation.yihui.name/mvstat:k-means_cluster_algorithm
Applied Bioinformatics, Spring 2011 16
Resources
Data source Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/
ArrayExpress: http://www.ebi.ac.uk/arrayexpress/
Microarray data analysis tools Bioconductor: http://www.bioconductor.org/
Expression profiler: http://www.ebi.ac.uk/expressionprofiler/
Applied Bioinformatics, Spring 2011 17
Summary
Agglomerative clustering Bottom-up
Between objects distance measurement Euclidean distance Pearson’s correlation coefficient Spearman’s correlation coefficient
Between cluster distance measurement Single linkage
Complete linkage
Average linkage
Visualization Dendrogram
Heat map
k-means clustering Partitioning
Applied Bioinformatics, Spring 2011 18
Exercise
Data set: evan_deneris_2010_5ht_top500diff.txt
500 selected probe sets
Four groups (Rostral_5ht, Rostral_non5ht, Caudal_5ht, Caudal_non5ht)
No missing value; Already normalized; Already log transformed
Use hierarchical clustering in Expression profiler (http://www.ebi.ac.uk/expressionprofiler) to generate a heat map
Applied Bioinformatics, Spring 2011 19