Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Unsupervised analysis of gene expression data

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Overall workflow of a microarray study

Microarray experiment

Biological question

Experiment design

Image analysis

Pre-processing

Data Analysis

Hypothesis Experimental verification

Applied Bioinformatics, Spring 2011 2

Three major goals of gene expression studies

  Class comparison (supervised analysis)   e.g. disease biomarker discovery

  Differential expression analysis

  Input: gene expression data, class label of the samples

  Output: differentially expressed genes

  Class detection (unsupervised analysis)   e.g. patient subgroup detection

  Clustering analysis

  Input: gene expression data

  Output: groups of similar samples or genes

  Class prediction (supervised learning)   e.g. disease diagnosis and prognosis

  Machine learning techniques

  Input: gene expression data, class label of the samples (training data)

  Output: prediction model

Applied Bioinformatics, Spring 2011

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

3

What is clustering

  Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities

  Unsupervised techniques that do not require sample annotation in the process


Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 …… TNNC1 14.82 14.46 14.76 11.22 11.55 …… DKK4 10.71 10.37 11.23 19.74 19.73 …… ZNF185 15.20 14.96 15.07 12.57 12.37 …… CHST3 13.40 13.18 13.15 11.18 10.99 …… FABP3 15.87 15.80 15.85 13.16 12.99 …… MGST1 12.76 12.80 12.67 14.92 15.02 …… DEFA5 10.63 10.47 10.54 15.52 15.52 …… VIL1 11.47 11.69 11.87 13.94 14.01 …… AKAP12 18.26 18.10 18.50 15.60 15.69 …… HS3ST1 10.61 10.67 10.50 12.44 12.23 …… …… …… …… …… …… …… ……

Gen

es

Samples

4

Why clustering?

  Exploratory data analysis, providing rough maps and suggesting directions for further study

  Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram

  Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes

  Functional annotation based on guilt by association


Clustering methods

  Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters

  Partitioning: divide the data into g groups using some reallocation algorithms, e.g. K-means


Hierarchical clustering

  Agglomerative clustering (bottom-up)   Start out with all sample units in n clusters of size 1.

  At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.

  The algorithm stops when all sample units are combined into a single cluster of size n.

  Divisive clustering (top-down)   Start out with all sample units in a single cluster of size n.

  At each step of the algorithm, clusters are partitioned into a pair of daughter clusters, selected to maximize the distance between each daughter.

  The algorithm stops when sample units are partitioned into n clusters of size 1.


Agglomerative clustering

  Require distance measurement   Between two objects

  Between clusters


Between objects distance measurement

  Euclidean distance   Focus on the absolute expression value

  Pearson correlation coefficient   Focus on the expression profile shape   Parametric, normally distributed and

follow the linear regression model

  Spearman correlation coefficient   Focus on the expression profile shape   Non-parametric, no assumption   Less sensitive but more robust than

Pearson


!

d = xi " yi( )2i=1

n

#

!

r =xi " x ( )(yi " y )

i=1

n#(xi " x )2

i=1

n# (yi " y )2

i=1

n#

!

d =1" r

9

Different measurement, different distance

0

1

2

3

4

5

6

1 2 3 4 5 6 7Time (hr)

Gen

e ex

pres

sion

leve

l (lo

g2)

GeneAGeneBGeneCGeneD

Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink)

Pearson: GeneC (green)

Spearman: GeneD (red)


Between cluster distance measurement

  Single linkage: the smallest distance of all pairwise distances   Complete linkage: the maximum distance of all pairwise distances

  Average linkage: the average distance of all pairwise distances


Visualization and interpretation of hierarchical clustering results

  Dendrogram   Output of a hierarchical

clustering

  Tree structure with the genes or samples as the leaves

  The height of the join indicates the distance between the left branch and the right branch

  Heat map   Graphical representation of

data where the values are represented as colors.


Partitioning

  General idea   Select the number of groups, g

  Randomly divide the objects into g Group

  Iteratively rearrange the objects until a stop condition

  Representative methods   K-means

  Self Organizing Map (SOM)


K-means

  Define k = number of clusters   Randomly initialize a seed vector for each cluster

  Go through all objects, and assign each object to the cluster witch it is most similar to

  Recalculate all seed vectors as means of patterns of each cluster

  Repeat 3 & 4 until a stop condition (e.g. Until all objects get assigned to the same partition twice in a row)


K-means seed vector 1

seed vector 2

Objects join with closest seed Randomly initialize seeds

Recaculate seeds Reassign objects

Recaculate seeds Reassign objects

Seeds become stable: final clusters


Cool animations

  Hierarchical clustering   http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

  K-means   http://animation.yihui.name/mvstat:k-means_cluster_algorithm


Resources

  Data source   Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/

  ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

  Microarray data analysis tools   Bioconductor: http://www.bioconductor.org/

  Expression profiler: http://www.ebi.ac.uk/expressionprofiler/


Summary

  Agglomerative clustering   Bottom-up

  Between objects distance measurement   Euclidean distance   Pearson’s correlation coefficient   Spearman’s correlation coefficient

  Between cluster distance measurement   Single linkage

  Complete linkage

  Average linkage

  Visualization   Dendrogram

  Heat map

  k-means clustering   Partitioning


Exercise

  Data set: evan_deneris_2010_5ht_top500diff.txt

  500 selected probe sets

  Four groups (Rostral_5ht, Rostral_non5ht, Caudal_5ht, Caudal_non5ht)

  No missing value; Already normalized; Already log transformed

  Use hierarchical clustering in Expression profiler (http://www.ebi.ac.uk/expressionprofiler) to generate a heat map


Documents

Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised