8. GENE EXPRESSION ANALYSIS MICROARRAYS - 2 · 2013-10-30 · 7 . COMPUTATIONAL TASKS “Gene expression analysis - microarrays" Bioinformatics Course Differential expression which

8. GENE EXPRESSION ANALYSIS MICROARRAYS - 2

BIOINFORMATICS COURSE MTAT.03.239

30.10.2012

GENE EXPRESSION ANALYSIS MICROARRAYS

Slides adapted from Konstantin Tretyakov’s 2011/2012 and Kaur Alasoo’s 2012/2013 year slides

3 “Gene expression analysis - microarrays" Bioinformatics Course

FLOW OF GENETIC INFORMATION

http://www.nature.com/scitable/topicpage/gene-expression-14121669


GENE EXPRESSION is the presence of the gene’s product in the cell in the form of a protein or mRNA


FLOW OF GENETIC INFORMATION

gene expression http://www.nature.com/scitable/topicpage/gene-expression-14121669

6

QUESTIONS FOR GENE EXPRESSION

“Gene expression analysis - microarrays" Bioinformatics Course

How gene expression differs in different cell types? How gene expression differs in normal vs diseased cell (cancer)? How gene expression changes occur during organisms life span? How gene expression is regulated – which genes regulate which and how? How gene expression changes when a cell is treated by a drug?

http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/06-07/pcmda/slides/Microarrays_Brazma_lecture1.pdf

7

COMPUTATIONAL TASKS


Differential expression which genes have different expression levels across two groups?

Clustering which genes seem to be regulated together? which treatment/individuals have similar expression profiles?

Classification to which functional class does a given gene belong to? to which class does a given sample belong to? (e.g. determine the cancer type)?

Visualization How to show these visually?

http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/04-expressionanalysis.pdf


MANY WAYS OF LOOKING AT DATA There is no right answer to view at data, use your imagination and invent new approaches.


EXAMPLE DATA

10

EXAMPLE DATASET


> library(ArrayExpress)

> library(affy)

# Download the experiment files

> affydata = ArrayExpress("E-GEOD-31215")

# Normalize the data

> normdata = rma( affydata )

> expdata = exprs( normdata )

# Set CEL file groups (the same order as in the expression matrix)

> k <- c( "ewsfli1", "empty", "ewsfli1", "empty", "ewsfli1", "empty", "ewsfli1", "empty" )

11

PREPROCESSING (SINGLE CHANNEL)


Background correction PM/MM probes, against GC content

Normalization Key assumption: most probes are not differentially expressed; distribution of intensities is approximately equal across arrays.

Summarization from probes to probesets (approximately, genes)

http://www.bioconductor.org/help/course-materials/2010/SeattleJan10/day2/PreProcessing.pdf

12

COMPUTATIONAL TASKS







13

DIFFERENTIAL EXPRESSION


To understand the effect of a drug we might be interested to know what genes are up-regulated (increased in expression) or down-regulated (decreased in expression) between treatment and control groups?

Find genes with different expression between conditions

14

DIFFERENTIAL EXPRESSION METHODS


• use a t-test or it’s derivates

• Limma R package > library(limma)

• RankProd R package > library(RankProd)

• Fold change

15

DIFFERENTIAL EXPRESSION METHODS


> library(limma)

> mm = model.matrix(~ as.factor( k ) - 1)

> colnames(mm) = c( "empty", "ewsfli1" )

> fit = lmFit(expdata, mm)

> contr = contr <- makeContrasts( ewsfli1 - empty, levels = colnames(mm) )

> fit = contrasts.fit(fit,contr)

> fit = eBayes(fit)

> dT = decideTests(fit, adjust.method="fdr", p.value=0.05)

> tT = topTable( fit, number = 10000000 )

> up = tT[ tT$logFC > 0 & tT$adj.P.Val <= 0.05, "ID" ]

> down = tT[ tT$logFC < 0 & tT$adj.P.Val <= 0.05, "ID" ]

> table( dT )

dT

-1 0 1

17 54495 163

16

COMPUTATIONAL TASKS







17

CO-EXPRESSION


Find similarly behaving genes using correlation or distance metrics

use dist() for distance measures in R use cor() for correlation measures in R

Unsupervised data exploration – clustering

use hclust() for hierarchical clustering in R use kmeans() for k-means clustering in R

18

HOW TO SEE MANY DIMENSIONS

> rsamp = sample( 1:nrow( expdata ), 25 )

> expdata.sample = expdata[rsamp,]

> image( t( expdata.sample ) )

19


> rsamp = sample( 1:nrow( expdata ), 100 )

> expdata.sample = expdata[rsamp,]

> image( t( expdata.sample ) )

> heatmap( expdata.sample )

20


> plot( expdata.sample[1,], type = "l", xlab = "Experiments", ylab = "intensity", ylim = c( 2, 11 ) )

> for( i in 2:50 ) lines( expdata.sample[i, ], col=i ) “Gene expression analysis - microarrays"

Bioinformatics Course

21

PROJECT INTO 2-DIMENSIONS

> plot( expdata.sample[,1], expdata.sample[,2], xlab = colnames( expdata.sample)[1], ylab = colnames( expdata.sample)[2] )


22

PRINCIPAL COMPONENT ANALYSIS (PCA)

> pc = prcomp( expdata, retx = TRUE )

> plot( pc )

> plot( pc$x[,c(1,2)]) “Gene expression analysis - microarrays"


23

PRINCIPAL COMPONENT ANALYSIS (PCA)

> pc = prcomp( t(expdata), retx = TRUE )

> plot( pc )

> plot( pc$x[,c(1,2)]) “Gene expression analysis - microarrays"


24

PRINCIPAL COMPONENT ANALYSIS (PCA) [5372 chips] http://www.nature.com/nbt/journal/v28/n4/fig_tab/nbt0410-322_F1.html

KEGGANIM http://biit.cs.ut.ee/kegganim/

http://biit.cs.ut.ee/kegganim/

26

DISTANCE BETWEEN GENES Euclidean distance

> dist( expdata.sample[c(1:2),], method = "euclidean" )

Correlation distance

> covariance = ( sum( ( x - mean( x ) ) * ( y - mean( y ) ) ) ) / ( length( x ) - 1 )

> pearson = cov( x, y )/( sd( x ) * sd( y ) )

> 1 - pearson

> covariance = cov( x, y )

> pearson = cor.test( x, y )

> 1 - pearson$estimate

27

CORRELATION



CLUSTERING is grouping genes so that similar genes are in the same group and genes different from each other are in separate groups


HIERARCHICAL CLUSTERING

> c = hclust( dist( expdata.sample, "euclidean" ), "complete" )

30

K-MEANS CLUSTERING

> pc = prcomp( expdata , retx = TRUE ) # PCA

> c = kmeans( expdata, 20) # kmeans clustering

> plot( pc$x[,c(1,2)], col = c$cluster )

31

K-MEANS CLUSTERING

> c$size

[1] 1288 5279 807 330 6332 1742 2338 600 2484 2556 3587 7102 357 173 6387 2592 375 2754 3114 4478

32

K-MEANS CLUSTERING

> par( mfrow=c(3,4))

> for( i in 9:20 ) plot( c$centers[i,], type = "l", col = i )

33

K-MEANS CLUSTERING


> for( i in 9:20 ) plot( c$centers[i,], type = "l", col = i )


> my.expdata = expdata[ which( c$cluster == 14 ), ]

> for ( i in 1:nrow( my.expdata )) my.expdata[i,] = my.expdata[i,] - mean( my.expdata[i,] )

> plot( my.expdata[1,], type = "l" )

> for( i in 2:nrow( my.expdata ) ) lines( my.expdata[i, ], col=i )

K-MEANS CLUSTERING


CLUSTERING

http://www.bioconductor.org/packages/release/BiocViews.html#___Clustering

http://www.bioconductor.org/packages/release/BiocViews.html


MEM http://biit.cs.ut.ee/mem/index.cgi

http://biit.cs.ut.ee/mem/index.cgi

37

COMPUTATIONAL TASKS








FUNCTIONAL ANALYSIS mapping genes or genomic regions to biological annotations like ontology categories, different pathways, diseases states (e.g. giving insight into genes function in biological processes and physiological states)


GENE ONTOLOGY Tries to unify the representation of gene and gene product attributes across all species

Aims to:

• Maintain and develop its controlled vocabulary of gene and gene product attributes • Annotate genes and gene products, and assimilate and disseminate annotation data • Provide tools for easy access to all aspects of the data

40 "Data and Databases" Bioinformatics Course

http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:F5GZI2&session_id=3393amigo1347355013

GENE ONTOLOGY







GENE ONTOLOGY > up "205440_s_at" "1554384_at" "203936_s_at" "1554385_a_at" "235874_at" "217561_at" "223644_s_at” "204818_at" "202746_at" "205306_x_at" "209616_s_at" "211138_s_at" "203929_s_at" "208072_s_at” "205826_at" "222919_at" "230650_at" "223809_at" "206645_s_at" "235944_at" "220948_s_at” "229850_at" "241436_at" "206401_s_at" "219837_s_at" "228863_at" "202419_at" "205535_s_at” "201648_at" "208433_s_at" "228715_at" "204201_s_at" "219427_at" "1569256_a_at" "209685_s_at” "213201_s_at" "240950_s_at" "238417_at" "202747_s_at" "1558279_a_at" "206190_at" "205656_at" "204087_s_at" "1552508_at" "209791_at" "207957_s_at" "206326_at" "210941_at" "227289_at" "227115_at" "205307_s_at" "243856_at" "204916_at" "206191_at" "219551_at" "239297_at" "204229_at" "217783_s_at" "204364_s_at" "218976_at" "228224_at" "216080_s_at" "229139_at" "228737_at" "205534_at" "227610_at" "213933_at" "229485_x_at" "210964_s_at" "235079_at" "230593_at" "202709_at" "207178_s_at" "224963_at" "209541_at" "202023_at" "204223_at" "214455_at" "202421_at" "242817_at" "224959_at" "215695_s_at" "225379_at" "235924_at" "218182_s_at" "205818_at" "219908_at" "229040_at" "227875_at" "217495_x_at" "205227_at" "39966_at" "225564_at" "219806_s_at" "225864_at" "45288_at" "227405_s_at" "206595_at" "224178_s_at" "204365_s_at" "222379_at" "229383_at" "226865_at" "209652_s_at" "1553878_at" "209757_s_at" "205932_s_at" "205899_at" "220108_at" "204105_s_at" "213368_x_at" "225619_at" "201976_s_at" "206002_at" "228262_at" "205097_at" "228214_at" "227498_at" "37986_at" "229242_at" "227750_at" "203928_x_at" "206915_at" "230839_at" "221011_s_at" "238455_at" "57588_at" "227933_at" "201562_s_at" "212397_at" "214807_at" "221552_at" "232136_s_at” "210904_s_at" "228640_at" "228981_at" "205637_s_at" "202637_s_at" "204140_at" "236193_at" "228955_at" "218162_at" "239537_at" "218831_s_at" "213353_at" "223366_at" "215043_s_at” "201418_s_at" "219343_at" "219892_at" "205051_s_at" "227497_at" "227995_at" "213644_at" "221530_s_at" "226106_at" "229041_s_at" "227647_at" "227536_at" "220094_s_at" "222760_at" "229580_at" "231887_s_at" > down "210119_at" "226722_at" "219308_s_at" "227565_at" "217997_at" "201810_s_at" "226873_at" "207604_s_at" "201983_s_at" "212298_at" "202795_x_at" "1556308_at" "220260_at" "212642_s_at" "217996_at" "1555216_a_at" "210296_s_at"


http://biit.cs.ut.ee/gprofiler










UP-REGULATED GENES




DOWN-REGULATED GENES



MEASURE OF “INTERESTINGESS”

P[a randomly chosen cluster will have at least as many group representatives as our cluster]


HYPERGEOMETRIC DISTRIBUTION is a discrete probability distribution that describes the probability k success in n draws from a finite population of size N containing m successes without replacement


HYPERGEOMETRIC DISTRIBUTION

50

HYPERGEOMETRIC DISTRIBUTION is a discrete probability distribution that describes the probability k success in n draws from a finite population of size N containing m successes without replacement GO id is “GO:0045596” n - number of genes in GO – 433 m - number of up-regulated genes – 88 k - number of up-regulated genes in GO – 10 N - total number of genes – 14611

> 1 - phyper( 10, 433, 14611 - 433, 88 )

[1] 5.600853e-05

> dhyper( 10, 433, 14611 - 433, 88 )

[1] 0.0002137613


ANNOTATING CLUSTERS For each functional category

• Count how many genes in cluster • Count how many genes in category total • Estimate probability to get same results randomly (p-value)

Leave those categories whose p-value is smaller than 0.05


ANNOTATING CLUSTERS For each functional category

• Count how many genes in cluster • Count how many genes in category total • Estimate probability to get same results randomly (p-value)

Assign those categories whose p-value is smaller than 0.05

DO NOT FORGET MULTIPLE TESTING CORRECTION!

53

MULTIPLE CORRECTION PROBLEM The problem of multiplicity arises from the fact that as we increase the number of hypotheses in a test, we also increase the likelihood of witnessing a rare event, and therefore, the chance to reject the null hypotheses when it's true. With probability 0.05 we will assign cluster to 5 categories out of 100 by random chance. With probability 0.05 we will assign cluster to 5000 categories out of 100 000 by random chance. > dT = decideTests(fit, adjust.method="none", p.value=0.05)

> table( dT )

dT -1 0 1

2102 50459 2114


BONFERRONI CORRECTION The cut-off p-value that determines significant assignments is divided by the number of tests.

only consider categories with p-value <= 0.05 / 100 000 For 100 000 GO categories Leave those categories whose p-value <= 5e-07 > pvalues = tT[, "P.Value"]

> p.adjust( pvalues, method="bonferroni", n = length( pvalues ) )


FALSE DISCOVERY RATE Designed to control the expected proportion of incorrectly rejected null hypotheses (“false discoveries”).

To control FDR at level δ

• Order the unadjusted p-values: p1 ≤ p2 ≤ … ≤ pm • Find the test with the highest rank, j, for which the p-value, pj, is less than or equal to (j*m) / δ

• Declare the test of rank 1, 2, …, j as significant, reject the others

> pvalues = tT[, "P.Value"]

> p.adjust( pvalues, method=“BH", n = length( pvalues ) )


FALSE DISCOVERY RATE > pvalues = tT[ , "P.Value"] # a vector of p-values

> pvalues = sort( pvalues ) # sort the p-values in ascending order

> ord = order( pvalues ) # an order vector of the p-values

> padj = ( ord / length( pvalues) ) * 0.05 # adjusted p-value

> table( pvalues <= padj )

FALSE TRUE

54495 180

Documents

8. GENE EXPRESSION ANALYSIS MICROARRAYS - 2 · 2013-10-30 · 7 . COMPUTATIONAL TASKS “Gene expression analysis - microarrays" Bioinformatics Course Differential expression which