Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
8. GENE EXPRESSION ANALYSIS MICROARRAYS - 2
BIOINFORMATICS COURSE MTAT.03.239
30.10.2012
GENE EXPRESSION ANALYSIS MICROARRAYS
Slides adapted from Konstantin Tretyakov’s 2011/2012 and Kaur Alasoo’s 2012/2013 year slides
3 “Gene expression analysis - microarrays" Bioinformatics Course
FLOW OF GENETIC INFORMATION
http://www.nature.com/scitable/topicpage/gene-expression-14121669
4 “Gene expression analysis - microarrays" Bioinformatics Course
GENE EXPRESSION is the presence of the gene’s product in the cell in the form of a protein or mRNA
5 “Gene expression analysis - microarrays" Bioinformatics Course
FLOW OF GENETIC INFORMATION
gene expression http://www.nature.com/scitable/topicpage/gene-expression-14121669
6
QUESTIONS FOR GENE EXPRESSION
“Gene expression analysis - microarrays" Bioinformatics Course
How gene expression differs in different cell types? How gene expression differs in normal vs diseased cell (cancer)? How gene expression changes occur during organisms life span? How gene expression is regulated – which genes regulate which and how? How gene expression changes when a cell is treated by a drug?
http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/06-07/pcmda/slides/Microarrays_Brazma_lecture1.pdf
7
COMPUTATIONAL TASKS
“Gene expression analysis - microarrays" Bioinformatics Course
Differential expression which genes have different expression levels across two groups?
Clustering which genes seem to be regulated together? which treatment/individuals have similar expression profiles?
Classification to which functional class does a given gene belong to? to which class does a given sample belong to? (e.g. determine the cancer type)?
Visualization How to show these visually?
http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/04-expressionanalysis.pdf
8 “Gene expression analysis - microarrays" Bioinformatics Course
MANY WAYS OF LOOKING AT DATA There is no right answer to view at data, use your imagination and invent new approaches.
9 “Gene expression analysis - microarrays" Bioinformatics Course
EXAMPLE DATA
10
EXAMPLE DATASET
“Gene expression analysis - microarrays" Bioinformatics Course
> library(ArrayExpress)
> library(affy)
# Download the experiment files
> affydata = ArrayExpress("E-GEOD-31215")
# Normalize the data
> normdata = rma( affydata )
> expdata = exprs( normdata )
# Set CEL file groups (the same order as in the expression matrix)
> k <- c( "ewsfli1", "empty", "ewsfli1", "empty", "ewsfli1", "empty", "ewsfli1", "empty" )
11
PREPROCESSING (SINGLE CHANNEL)
“Gene expression analysis - microarrays" Bioinformatics Course
Background correction PM/MM probes, against GC content
Normalization Key assumption: most probes are not differentially expressed; distribution of intensities is approximately equal across arrays.
Summarization from probes to probesets (approximately, genes)
http://www.bioconductor.org/help/course-materials/2010/SeattleJan10/day2/PreProcessing.pdf
12
COMPUTATIONAL TASKS
“Gene expression analysis - microarrays" Bioinformatics Course
Differential expression which genes have different expression levels across two groups?
Clustering which genes seem to be regulated together? which treatment/individuals have similar expression profiles?
Classification to which functional class does a given gene belong to? to which class does a given sample belong to? (e.g. determine the cancer type)?
Visualization How to show these visually?
http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/04-expressionanalysis.pdf
13
DIFFERENTIAL EXPRESSION
“Gene expression analysis - microarrays" Bioinformatics Course
To understand the effect of a drug we might be interested to know what genes are up-regulated (increased in expression) or down-regulated (decreased in expression) between treatment and control groups?
Find genes with different expression between conditions
14
DIFFERENTIAL EXPRESSION METHODS
“Gene expression analysis - microarrays" Bioinformatics Course
• use a t-test or it’s derivates
• Limma R package > library(limma)
• RankProd R package > library(RankProd)
• Fold change
15
DIFFERENTIAL EXPRESSION METHODS
“Gene expression analysis - microarrays" Bioinformatics Course
> library(limma)
> mm = model.matrix(~ as.factor( k ) - 1)
> colnames(mm) = c( "empty", "ewsfli1" )
> fit = lmFit(expdata, mm)
> contr = contr <- makeContrasts( ewsfli1 - empty, levels = colnames(mm) )
> fit = contrasts.fit(fit,contr)
> fit = eBayes(fit)
> dT = decideTests(fit, adjust.method="fdr", p.value=0.05)
> tT = topTable( fit, number = 10000000 )
> up = tT[ tT$logFC > 0 & tT$adj.P.Val <= 0.05, "ID" ]
> down = tT[ tT$logFC < 0 & tT$adj.P.Val <= 0.05, "ID" ]
> table( dT )
dT
-1 0 1
17 54495 163
16
COMPUTATIONAL TASKS
“Gene expression analysis - microarrays" Bioinformatics Course
Differential expression which genes have different expression levels across two groups?
Clustering which genes seem to be regulated together? which treatment/individuals have similar expression profiles?
Classification to which functional class does a given gene belong to? to which class does a given sample belong to? (e.g. determine the cancer type)?
Visualization How to show these visually?
http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/04-expressionanalysis.pdf
17
CO-EXPRESSION
“Gene expression analysis - microarrays" Bioinformatics Course
Find similarly behaving genes using correlation or distance metrics
use dist() for distance measures in R use cor() for correlation measures in R
Unsupervised data exploration – clustering
use hclust() for hierarchical clustering in R use kmeans() for k-means clustering in R
18
HOW TO SEE MANY DIMENSIONS
> rsamp = sample( 1:nrow( expdata ), 25 )
> expdata.sample = expdata[rsamp,]
> image( t( expdata.sample ) )
19
HOW TO SEE MANY DIMENSIONS
> rsamp = sample( 1:nrow( expdata ), 100 )
> expdata.sample = expdata[rsamp,]
> image( t( expdata.sample ) )
> heatmap( expdata.sample )
20
HOW TO SEE MANY DIMENSIONS
> plot( expdata.sample[1,], type = "l", xlab = "Experiments", ylab = "intensity", ylim = c( 2, 11 ) )
> for( i in 2:50 ) lines( expdata.sample[i, ], col=i ) “Gene expression analysis - microarrays"
Bioinformatics Course
21
PROJECT INTO 2-DIMENSIONS
> plot( expdata.sample[,1], expdata.sample[,2], xlab = colnames( expdata.sample)[1], ylab = colnames( expdata.sample)[2] )
“Gene expression analysis - microarrays" Bioinformatics Course
22
PRINCIPAL COMPONENT ANALYSIS (PCA)
> pc = prcomp( expdata, retx = TRUE )
> plot( pc )
> plot( pc$x[,c(1,2)]) “Gene expression analysis - microarrays"
Bioinformatics Course
23
PRINCIPAL COMPONENT ANALYSIS (PCA)
> pc = prcomp( t(expdata), retx = TRUE )
> plot( pc )
> plot( pc$x[,c(1,2)]) “Gene expression analysis - microarrays"
Bioinformatics Course
24
PRINCIPAL COMPONENT ANALYSIS (PCA) [5372 chips] http://www.nature.com/nbt/journal/v28/n4/fig_tab/nbt0410-322_F1.html
KEGGANIM http://biit.cs.ut.ee/kegganim/
26
DISTANCE BETWEEN GENES Euclidean distance
> dist( expdata.sample[c(1:2),], method = "euclidean" )
Correlation distance
> covariance = ( sum( ( x - mean( x ) ) * ( y - mean( y ) ) ) ) / ( length( x ) - 1 )
> pearson = cov( x, y )/( sd( x ) * sd( y ) )
> 1 - pearson
> covariance = cov( x, y )
> pearson = cor.test( x, y )
> 1 - pearson$estimate
27
CORRELATION
“Gene expression analysis - microarrays" Bioinformatics Course
28 “Gene expression analysis - microarrays" Bioinformatics Course
CLUSTERING is grouping genes so that similar genes are in the same group and genes different from each other are in separate groups
29 “Gene expression analysis - microarrays" Bioinformatics Course
HIERARCHICAL CLUSTERING
> c = hclust( dist( expdata.sample, "euclidean" ), "complete" )
30
K-MEANS CLUSTERING
> pc = prcomp( expdata , retx = TRUE ) # PCA
> c = kmeans( expdata, 20) # kmeans clustering
> plot( pc$x[,c(1,2)], col = c$cluster )
31
K-MEANS CLUSTERING
> c$size
[1] 1288 5279 807 330 6332 1742 2338 600 2484 2556 3587 7102 357 173 6387 2592 375 2754 3114 4478
32
K-MEANS CLUSTERING
> par( mfrow=c(3,4))
> for( i in 9:20 ) plot( c$centers[i,], type = "l", col = i )
33
K-MEANS CLUSTERING
> par( mfrow=c(3,4))
> for( i in 9:20 ) plot( c$centers[i,], type = "l", col = i )
> par( mfrow=c(1,1))
> my.expdata = expdata[ which( c$cluster == 14 ), ]
> for ( i in 1:nrow( my.expdata )) my.expdata[i,] = my.expdata[i,] - mean( my.expdata[i,] )
> plot( my.expdata[1,], type = "l" )
> for( i in 2:nrow( my.expdata ) ) lines( my.expdata[i, ], col=i )
K-MEANS CLUSTERING
35 “Gene expression analysis - microarrays" Bioinformatics Course
CLUSTERING
http://www.bioconductor.org/packages/release/BiocViews.html#___Clustering
36 “Gene expression analysis - microarrays" Bioinformatics Course
MEM http://biit.cs.ut.ee/mem/index.cgi
37
COMPUTATIONAL TASKS
“Gene expression analysis - microarrays" Bioinformatics Course
Differential expression which genes have different expression levels across two groups?
Clustering which genes seem to be regulated together? which treatment/individuals have similar expression profiles?
Classification to which functional class does a given gene belong to? to which class does a given sample belong to? (e.g. determine the cancer type)?
Visualization How to show these visually?
http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/04-expressionanalysis.pdf
38 “Gene expression analysis - microarrays" Bioinformatics Course
FUNCTIONAL ANALYSIS mapping genes or genomic regions to biological annotations like ontology categories, different pathways, diseases states (e.g. giving insight into genes function in biological processes and physiological states)
39 “Gene expression analysis - microarrays" Bioinformatics Course
GENE ONTOLOGY Tries to unify the representation of gene and gene product attributes across all species
Aims to:
• Maintain and develop its controlled vocabulary of gene and gene product attributes • Annotate genes and gene products, and assimilate and disseminate annotation data • Provide tools for easy access to all aspects of the data
40 "Data and Databases" Bioinformatics Course
http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:F5GZI2&session_id=3393amigo1347355013
GENE ONTOLOGY
41 “Gene expression analysis - microarrays" Bioinformatics Course
GENE ONTOLOGY > up "205440_s_at" "1554384_at" "203936_s_at" "1554385_a_at" "235874_at" "217561_at" "223644_s_at” "204818_at" "202746_at" "205306_x_at" "209616_s_at" "211138_s_at" "203929_s_at" "208072_s_at” "205826_at" "222919_at" "230650_at" "223809_at" "206645_s_at" "235944_at" "220948_s_at” "229850_at" "241436_at" "206401_s_at" "219837_s_at" "228863_at" "202419_at" "205535_s_at” "201648_at" "208433_s_at" "228715_at" "204201_s_at" "219427_at" "1569256_a_at" "209685_s_at” "213201_s_at" "240950_s_at" "238417_at" "202747_s_at" "1558279_a_at" "206190_at" "205656_at" "204087_s_at" "1552508_at" "209791_at" "207957_s_at" "206326_at" "210941_at" "227289_at" "227115_at" "205307_s_at" "243856_at" "204916_at" "206191_at" "219551_at" "239297_at" "204229_at" "217783_s_at" "204364_s_at" "218976_at" "228224_at" "216080_s_at" "229139_at" "228737_at" "205534_at" "227610_at" "213933_at" "229485_x_at" "210964_s_at" "235079_at" "230593_at" "202709_at" "207178_s_at" "224963_at" "209541_at" "202023_at" "204223_at" "214455_at" "202421_at" "242817_at" "224959_at" "215695_s_at" "225379_at" "235924_at" "218182_s_at" "205818_at" "219908_at" "229040_at" "227875_at" "217495_x_at" "205227_at" "39966_at" "225564_at" "219806_s_at" "225864_at" "45288_at" "227405_s_at" "206595_at" "224178_s_at" "204365_s_at" "222379_at" "229383_at" "226865_at" "209652_s_at" "1553878_at" "209757_s_at" "205932_s_at" "205899_at" "220108_at" "204105_s_at" "213368_x_at" "225619_at" "201976_s_at" "206002_at" "228262_at" "205097_at" "228214_at" "227498_at" "37986_at" "229242_at" "227750_at" "203928_x_at" "206915_at" "230839_at" "221011_s_at" "238455_at" "57588_at" "227933_at" "201562_s_at" "212397_at" "214807_at" "221552_at" "232136_s_at” "210904_s_at" "228640_at" "228981_at" "205637_s_at" "202637_s_at" "204140_at" "236193_at" "228955_at" "218162_at" "239537_at" "218831_s_at" "213353_at" "223366_at" "215043_s_at” "201418_s_at" "219343_at" "219892_at" "205051_s_at" "227497_at" "227995_at" "213644_at" "221530_s_at" "226106_at" "229041_s_at" "227647_at" "227536_at" "220094_s_at" "222760_at" "229580_at" "231887_s_at" > down "210119_at" "226722_at" "219308_s_at" "227565_at" "217997_at" "201810_s_at" "226873_at" "207604_s_at" "201983_s_at" "212298_at" "202795_x_at" "1556308_at" "220260_at" "212642_s_at" "217996_at" "1555216_a_at" "210296_s_at"
42 “Gene expression analysis - microarrays" Bioinformatics Course
http://biit.cs.ut.ee/gprofiler
43 “Gene expression analysis - microarrays" Bioinformatics Course
http://biit.cs.ut.ee/gprofiler
44 “Gene expression analysis - microarrays" Bioinformatics Course
http://biit.cs.ut.ee/gprofiler
45 “Gene expression analysis - microarrays" Bioinformatics Course
http://biit.cs.ut.ee/gprofiler
UP-REGULATED GENES
46 “Gene expression analysis - microarrays" Bioinformatics Course
http://biit.cs.ut.ee/gprofiler
DOWN-REGULATED GENES
47 “Gene expression analysis - microarrays" Bioinformatics Course
MEASURE OF “INTERESTINGESS”
P[a randomly chosen cluster will have at least as many group representatives as our cluster]
48 “Gene expression analysis - microarrays" Bioinformatics Course
HYPERGEOMETRIC DISTRIBUTION is a discrete probability distribution that describes the probability k success in n draws from a finite population of size N containing m successes without replacement
49 “Gene expression analysis - microarrays" Bioinformatics Course
HYPERGEOMETRIC DISTRIBUTION
50
HYPERGEOMETRIC DISTRIBUTION is a discrete probability distribution that describes the probability k success in n draws from a finite population of size N containing m successes without replacement GO id is “GO:0045596” n - number of genes in GO – 433 m - number of up-regulated genes – 88 k - number of up-regulated genes in GO – 10 N - total number of genes – 14611
> 1 - phyper( 10, 433, 14611 - 433, 88 )
[1] 5.600853e-05
> dhyper( 10, 433, 14611 - 433, 88 )
[1] 0.0002137613
51 “Gene expression analysis - microarrays" Bioinformatics Course
ANNOTATING CLUSTERS For each functional category
• Count how many genes in cluster • Count how many genes in category total • Estimate probability to get same results randomly (p-value)
Leave those categories whose p-value is smaller than 0.05
52 “Gene expression analysis - microarrays" Bioinformatics Course
ANNOTATING CLUSTERS For each functional category
• Count how many genes in cluster • Count how many genes in category total • Estimate probability to get same results randomly (p-value)
Assign those categories whose p-value is smaller than 0.05
DO NOT FORGET MULTIPLE TESTING CORRECTION!
53
MULTIPLE CORRECTION PROBLEM The problem of multiplicity arises from the fact that as we increase the number of hypotheses in a test, we also increase the likelihood of witnessing a rare event, and therefore, the chance to reject the null hypotheses when it's true. With probability 0.05 we will assign cluster to 5 categories out of 100 by random chance. With probability 0.05 we will assign cluster to 5000 categories out of 100 000 by random chance. > dT = decideTests(fit, adjust.method="none", p.value=0.05)
> table( dT )
dT -1 0 1
2102 50459 2114
54 “Gene expression analysis - microarrays" Bioinformatics Course
BONFERRONI CORRECTION The cut-off p-value that determines significant assignments is divided by the number of tests.
only consider categories with p-value <= 0.05 / 100 000 For 100 000 GO categories Leave those categories whose p-value <= 5e-07 > pvalues = tT[, "P.Value"]
> p.adjust( pvalues, method="bonferroni", n = length( pvalues ) )
55 “Gene expression analysis - microarrays" Bioinformatics Course
FALSE DISCOVERY RATE Designed to control the expected proportion of incorrectly rejected null hypotheses (“false discoveries”).
To control FDR at level δ
• Order the unadjusted p-values: p1 ≤ p2 ≤ … ≤ pm • Find the test with the highest rank, j, for which the p-value, pj, is less than or equal to (j*m) / δ
• Declare the test of rank 1, 2, …, j as significant, reject the others
> pvalues = tT[, "P.Value"]
> p.adjust( pvalues, method=“BH", n = length( pvalues ) )
56 “Gene expression analysis - microarrays" Bioinformatics Course
FALSE DISCOVERY RATE > pvalues = tT[ , "P.Value"] # a vector of p-values
> pvalues = sort( pvalues ) # sort the p-values in ascending order
> ord = order( pvalues ) # an order vector of the p-values
> padj = ( ord / length( pvalues) ) * 0.05 # adjusted p-value
> table( pvalues <= padj )
FALSE TRUE
54495 180