Understanding gene function by measuring its expression › MTAT.03.239 › 2018_spring › ... · Understanding gene function by measuring its expression ... of genes (or modules)

Understanding gene function by measuring its expressionKaur Alasoo

Many slides adapted from Simon Anders:https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lect/L05-deseq2-anders.pdf

https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lect/L05-deseq2-anders.pdf

Course organisation

• Consultations on Mondays from 12:00-13:00 in room 225

• More discussions on Piazza!

• Snakemake - new submission deadline.

Topics for today

• What is gene expression?

• How do we measure it with RNA-seq?

• How can we explore the raw data (PCA, clustering)?

• How do we identify genes that are differentially expressed?

• Why do we need to care about multiple testing?

Macrophages eat other cells

The macrophage marches on its phagosome: dynamic assays of phagosome function Russel et al (2009)

Macrophages eat other cells and pathogens

The macrophage marches on its phagosome: dynamic assays of phagosome function Russel et al (2009)

LPS (lipoplysaccharide)

LPS induces strong response in macrophages

Adipose tissueA type of connective tissue that is specialized for the storage of neutral lipids.

Transcription factorA specialized nuclear protein that can bind to DNA and regulate gene expression. Most transcription factors have transactivation or repressor domains but, in addition, they can function as architectural proteins and promote chromatin remodelling by recruiting additional activator or repressor complexes.

ChromatinChromation is composed of DNA together with histones and other associated proteins.

Transcriptional co-regulatorTranscriptional co-regulators lack DNA-binding specificity and must be recruited to their target genes through interactions with transcription factors or by binding to particular chromatin modifications. Co-regulators have an important role in modulating gene expression and in many cases couple transcription factors to downstream effector mechanisms for gene regulation.

regulators (such as N-ethylmaleimide-sensitive factor 2 for cellular stress induced by reactive oxygen species8, hypoxia-inducible factor 1α for the hypoxic response9, X-box-binding protein 1 for the unfolded protein response10 and aryl hydrocarbon receptor for the xenobiotic-induced response11). In each case, the genes that comprise a given transcriptional module are func-tionally related, which explains the requirement for their coordinated control by dedicated transcriptional regu-lators. Importantly, the full repertoire of TLR-induced transcriptional modules is currently unknown, as are the transcriptional master regulators of these modules. Nevertheless, the concept of transcriptional modules is useful when considering the heterogeneity of a complex transcriptional response, such as that induced by LPS in macrophages.

Here, we first review what is known regarding the regulation of the LPS-induced transcriptional response by transcription factors, chromatin modifica-tions and transcriptional co-regulators in macrophages, and the relative roles of each of these components in the inflammatory response. Then, we describe the mechanisms by which various signalling pathways modulate this transcriptional programme in distinct biological contexts. Finally, we emphasize the modular nature of the transcriptional control of inflammation as being central to its physiological regulation and therapeutic manipulation.

Transcription factorsInduction of the LPS-dependent transcriptional response in macrophages is orchestrated by many transcription factors, consistent with the complexity of the response. These transcription factors can be divided into three categories on the basis of their mode of activation and function. This classification is not intended to demarcate mutually exclusive groups of transcription factors, but to illustrate general principles regarding their mechanisms of action and their role in the control of various induc-ible transcriptional modules in macrophages.

The first category (class I) consists of transcription factors that are constitutively expressed by many cell types and that are activated by signal-dependent post-translational modifications. In most cases, these tran-scription factors are retained in the cytoplasm in the basal state and their signal-dependent activation involves their nuclear translocation. This class is the best character-ized of the three categories of transcription factors and it includes proteins that are known to have important roles in inflammation, such as NF-κB, IFN-regulatory factors (IRFs) and cAMP-responsive-element-binding protein 1 (CREB1). The genes that are induced most rapidly by LPS stimulation (the so called primary response genes) are regulated by these transcription factors (FIG. 1).

There are multiple mechanisms that quickly terminate the activation of NF-κB and IRFs; for example, inhibi-tor of NF-κB-α (IκBα) exports NF-κB from the nucleus

Box 1 | Module-specific transcriptional regulation of inflammatory gene expression

Many of the mechanisms that control gene expression operate in a gene-specific manner, which indicates that they might control specific modules of the inflammatory response. Although module-specific regulatory mechanisms have not yet been explored in detail, we describe a couple of examples to illustrate the biological contexts in which module-specific control has an obvious advantage.

Lipopolysaccharide (LPS) tolerance is a state of hyporesponsiveness to LPS (and other inflammatory stimuli) that is induced during conditions of excessive inflammation (such as sepsis) to limit inflammation-associated pathology. LPS-tolerant cells are refractory to the induction of expression of inflammatory cytokines such as tumour necrosis factor and interleukin-6 (IL-6), and this is due, at least in part, to the downregulation of expression of many inflammatory signalling proteins80. Importantly, however, LPS-induced signalling in LPS-tolerant cells can still induce the expression of genes that encode anti-inflammatory cytokines and antimicrobial peptides. The differential inducibility of these classes of genes (or modules) is associated with distinct patterns of chromatin remodelling81–83. This indicates that the transcriptional regulation of LPS tolerance enables the inhibition of some functional programmes (for example, those encoding inflammatory cytokines) while inducing other programmes (for example, antimicrobial effector functions), which could be advantageous when a host has to deal with a persistent infection82.

As another example of module-specific transcriptional regulation, we consider the multiple functions of different tissue-resident macrophage populations. All macrophages are key orchestrators of the inflammatory response, but they also have tissue-specific functions that are programmed by local factors (see figure). For example, IL-10 induces colonic epithelium macrophages to carry out immunomodulatory functions that are appropriate at this host–commensal interface, whereas white adipose tissue macrophages seem to have a role in metabolic regulation that is also tightly linked to the control of inflammation. These tissue-specific functional programmes are transcriptionally regulated and are conferred by the expression of transcription factors that are unique to these macrophage populations — inhibitor of nuclear factor-κB NS (IκBNS) and peroxisome proliferator-activated receptor-γ (PPARγ) in macrophages of the colonic epithelium and white adipose tissue, respectively57,84.

Nature Reviews | Immunology

Coagulation factors

Inflammatory cytokines

Chemotactic factorsand receptors

Antimicrobial effector functions

Tissue repair factors

Metabolic regulators

Pathogen recognitionand phagocytosis

Antigen processingand presentation

TLR4

LPS

REVIEWS

NATURE REVIEWS | IMMUNOLOGY VOLUME 9 | OCTOBER 2009 | 693

��)''0�DXZd`ccXe�GlYc`j_\ij�C`d`k\[%�8cc�i`^_kj�i\j\im\[

Transcriptional control of the inflammatory response Medzhitov et al (2009)

I+S

S

I

NN Naive

IFN! (18h)

Salmonella (5h)

IFN! (18h) + Salmonella (5h)

Macrophages RNA ATAC

84

84

84

84

42

41

31

31

Sample sizesExperimental design

ARTICLES NATURE GENETICS

Furthermore, only 8% of the caQTL regions overlapped annotated promoters, and 42% overlapped regions marked with acetylated histone H3 K27 modifications25 in macrophages (Supplementary Note). Next, using a statistical interaction test followed by filtering on effect size, we identified 387 response eQTLs and 2,247 response caQTLs with a small or undetectable effect (fold change (FC) < 1.5) in the naive state that increased at least 1.5 fold after stimulation (Methods). The use of an interaction test meant that our analy-sis should have been robust to false-positive response QTLs that might have arisen because of, for example, weak, undetected QTLs in the naive cell state. We verified this robustness by downsam-pling from a larger monocyte response eQTL dataset from Fairfax et al.3 (Supplementary Tables 2 and 3, and Supplementary Fig. 6). These genetic effects displayed a variety of activity patterns (Fig. 2a and Supplementary Fig. 7a). Notably, 18% of the response eQTLs appeared only after the cells were exposed to both stimuli (clus-ter 1), a number exceeding the number that appeared after IFNγ stimulation alone (clusters 5 and 6). Response caQTL regions con-tained closed chromatin in the naive cells (median transcripts per million = 0.49) and became 3.8 fold more accessible only after the relevant stimulus (Supplementary Fig. 7b). Furthermore, response caQTLs were associated with disruption of stimulus-specific TF motifs (Supplementary Fig. 7c), thus suggesting that they are largely driven by TFs that bind DNA only after stimulation.

Enhancer priming in the macrophage immune response. To quan-tify the extent of enhancer priming in the macrophage immune response, we next focused on how response eQTLs manifest at the chromatin level. We grouped response eQTLs (Fig. 2a) on the basis of the condition (treatment with IFNγ , Salmonella or both) in which they had the largest effect size. We then used linkage disequilibrium (LD) (R2 > 0.8) between the lead variants to identify 145 caQTL–eQTL pairs that were likely to be driven by the same causal variant (Methods). For example, we identified a QTL upstream of GP1BA that had no effect in naive cells but became simultaneously associated with chro-matin accessibility and gene expression after IFNγ + Salmonella stim-ulation (Fig. 2d). The lead caQTL variant (rs4486968) was predicted to disrupt an NF-κ B-binding motif (Supplementary Fig. 8), thus illus-trating how a caQTL can directly affect stimulus-specific TF bind-ing and gene expression. In contrast, a genetic variant in an intron of NXPH2 modulated the accessibility of a regulatory element in both naive and stimulated cells but became associated with gene expression only after IFNγ stimulation (Fig. 2e). Genome wide, we found that for approximately half of the response eQTLs with a linked caQTL, the caQTL was present in naive cells before stimulation (caQTL FC > 1.5), thus suggesting that many response eQTLs regulate gene expres-sion indirectly by first modulating the extent of enhancer priming in naive cells (Fig. 2b). One potential issue with our analysis is that using LD to identify eQTL–caQTL pairs might sometimes lead to false

PU

.1

G

C

PU

.1

G

C

IRF

1

Allele 1

Allele 2

Allele 1

Allele 2

CC CG GG

Chromatinaccessibility

CC CG GG

Geneexpression

CC CG GG CC CG GG

NaivePrimed enhancer

StimulatedActive enhancer

C

G

C

IRF

1

Allele 1

Allele 2

Allele 1

Allele 2

CC CG GG

Chromatinaccessibility

CC CG GG

Geneexpression

CC CG GG CC CG GG

G

NaiveInactive enhancer

StimulatedActive enhancer

I + S

S

I

NN Naive

IFNγ (18 h)

Salmonella (5 h)

IFNγ (18 h) +Salmonella (5 h)

Macrophages RNA ATAC

84

84

84

84

42

41

31

31

c d Signaling pathways activated by Salmonella and IFNγSample sizesExperimental design

TLR4IFNAR1IFNAR2

IFNGR1IFNGR2

LPS

IFNβ

IFNγ

MYD88

1 2 3 4 5 6 7 8 9 10 11

NF-κB

AP-1

1 2 3 4 5 6 7 8 9 10 11

TRIF

IRF3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

STAT1STAT2

IRF9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 2 3 4 5 6 7 8 9 10 11

STAT1STAT1IRF1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

IFNB1 IRF1

a b

Fig. 1 | Regulation of gene expression in the macrophage immune response. a, Genetic variant has a direct effect on the binding of a stimulation-specific TF (IRF1) and target-gene activation. b, Genetic variant in a primed enhancer disrupts the binding of a cell-type-specific TF (for example, PU.1) that indirectly influences stimulation-specific TF (IRF1) binding via modulation of chromatin accessibility. c, Overview of the experimental design. d, TLR4 recognizes LPS on the Salmonella cell wall and activates NF-κ B, AP-1 and IRF3 TFs53. IRF3 stimulates IFNβ production, which in turn culminates in activation of the STAT1–STAT2–IRF9 complex. IFNγ binds to the IFNγ receptor and activates STAT1 and IRF1 TFs54.

NA TURE GENETICS | www.nature.com/naturegenetics

© 2018 Nature America Inc., part of Springer Nature. All rights reserved.

Why we discard non-unique alignments

gene A gene B

control condition

treatment condition

What is differential gene expression?

Slide from Simon Anders

Why we discard non-unique alignments

gene A gene B

control condition

treatment condition


Sequencing+count+data+ control-1 control-2 control-3 treated-1 treated-2 FBgn0000008 78 46 43 47 89 FBgn0000014 2 0 0 0 0 FBgn0000015 1 0 1 0 1 FBgn0000017 3187 1672 1859 2445 4615 FBgn0000018 369 150 176 288 383 [...]

•  RNA4Seq+•  Tag4Seq+•  ChIP4Seq+•  HiC+•  Bar4Seq+•  ...+

27/02/2018 data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBzdGFuZGFsb25lPSJ5ZXMiPz4KCjxzdmcgdmVyc2lvbj0iMS4xIiB2aWV3Qm94PSIwLjAg…

data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBzdGFuZGFsb25lPSJ5ZXMiPz4KCjxzdmcgdmVyc2lvbj0iMS4xIiB2aWV3Qm94PSIwLjAgMC4wID… 1/1

SummarizedExperiment

Exploratory data analysis

More things to do with shrinkage:

The rlog transformation

Many useful methods want homoscedastic data: • Hierarchical clustering • PCA and MDS

But: RNA-Seq data is not homoscedastic.




RNA-Seq data is not homoscedastic. • On the count scale, large counts have large

(absolute) variance. • After taking the logarithm, small counts

show excessive variance.




Conceptual idea of the rlog transform: Log-transform the average across samples of each gene’s normalized count. The “pull in” the log normalized counts towards the log averages. Pull more for weaker genes.


Naive normalisation rlog normalisation

Variance stabilising transformation and rlog produce similar results

vst normalisationrlog normalisation

Testing for differential expression

Normalization+for+library+size+•  If+sample+A+has+been+sampled+deeper+than+sample+B,+we+expect+counts+to+be+higher.++

•  Naive+approach:+Divide+by+the+total+number+of+reads+per+sample+

•  Problem:+Genes+that+are+strongly+and+differentially+expressed+may+distort+the+ratio+of+total+reads.+


Normalization+for+library+size+actual+expression+

sequenced+reads+

naivly+normalized+


Normalization+for+library+size+To+compare+more+than+two+samples:+

•  Form+a+“virtual+reference+sample”+by+taking,+for+each+gene,+the+geometric+mean+of+counts+over+all+samples+

•  Normalize+each+sample+to+this+reference,+to+get+one+scaling+factor+(“size+factor”)+per+sample.+

Anders+and+Huber,+2010+similar+approach:+Robinson+and+Oshlack,+2010+


Counting+noise+In+RNA4Seq,+noise+(and+hence+power)+depends+on+count+level.+

Why?++


The+Poisson+distribution+•  This+bag+contains+very+many+small+balls,+10%+of+which+are+red.+

•  Several+experimenters+are+tasked+with+determining+the+percentage+of+red+balls.+

•  Each+of+them+is+permitted+to+draw+20+balls+out+of+the+bag,+without+looking.+


3 / 20 = 15%

1 / 20 = 5%

2 / 20 = 10%

0 / 20 = 0%


7 / 100 = 7%

10 / 100 = 10%

8 / 100 = 8%

11 / 100 = 11%


Poisson distribution: Counting uncertainty

expected number of red balls

standard deviation of number of red balls

relative error in estimate for the fraction of red balls

10 �10 = 3 1 / �10 = 31.6%

100 �100 = 10 1 / �100 = 10.0%

1,000 �1,000 = 32 1 / �1000 = 3.2%

10,000 �10,000 = 100 1 / �10000 = 1.0%


The+negative+binomial+distribution+

A+commonly+used+generalization+of+the+Poisson++distribution+with+two+parameters+


The+NB+from+a+hierarchical+model+

Biological sample with mean µ and variance v Poisson distribution with mean q and variance q.

Negative binomial with mean µ and variance q+v.

(Gamma distribution)


Two component noise model

Large counts !Biological noise dominant !Improve power: more biol. replicates

var = μ + c μ2

shot noise (Poisson) biological noise

Small counts !Sampling noise dominant !Improve power: deeper coverage

Slide from Wolfgang Huber

Testing:+Generalized+linear+models+Two+sample+groups:+treatment+and+control.++Model:+•  Count+value+Kij+for+a+gene+in+sample+j'is+generated+by+NB+distribution+with+mean+s'j''μj'and+dispersion+α.++•  The+expected+expression+strength+is:+''''''log+μj'='βi0''+'xj'βiT+++++++++++++++++++++++++++++++++++++++++++++++++ +xj'=+0+if+j+is+control+sample+

+xj'=+1+if+j+is+treatment+sample++Null+model:+

'βiT+=+0,+i.e.,+expectation+is+the+same+for+all+samples+

Alternative+model:+'βiT+≠+0,+i.e.,+expected+expression+changes+from+control+to+treatment,++ +with+log+fold+change+(LFC)++βT+


Testing:+Generalized+linear+models+

''Kij''~''NB+(+sj+μij,'αi')''log+μij+='βi0'++'xj'βiT+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++xj'=+0+for+if+j+is+control+sample+

+ + +++++++xj'=+1+for+if+j+is+treatment+sample++Calculate+the+coefEicients+β+that+Eit+best+the+observed+data+K.++Is+the+value+for+βiT++signiEicantly+different+from+null?++Can+we+reject+the+null+hypothesis+that+it+is+merely+cause+by+noise+(as+given+by+the+dispersion'αi+)?++We+use+a+Wald+test+to+get+a+p+value.+


Tasks+in+comparative+RNA4Seq+analysis+

•  Estimate+fold4change+between+control+and+treatment+

+•  Estimate+variability+within+groups++•  Determine+signiEicance+ the+hard+part+


Dispersion+

•  Minimum+variance+of+count+data:+v+=+μ+++++(Poisson)++

•  Actual+variance:+v+=+μ+++α'μ+²++

•  α+:+“dispersion” + ++α+=+(μ+4+v)+/+μ+²++(squared+coefEicient+of+variation+of+extra4Poisson+variability)+


Shrinkage estimation of dispersion

Love et al. Genome Biology (2014) 15:550 Page 3 of 21

Figure 1 Shrinkage estimation of dispersion. Plot of dispersion estimates over the average expression strength (A) for the Bottomly et al. [16]dataset with six samples across two groups and (B) for five samples from the Pickrell et al. [17] dataset, fitting only an intercept term. First, gene-wiseMLEs are obtained using only the respective gene’s data (black dots). Then, a curve (red) is fit to the MLEs to capture the overall trend ofdispersion-mean dependence. This fit is used as a prior mean for a second estimation round, which results in the final MAP estimates of dispersion(arrow heads). This can be understood as a shrinkage (along the blue arrows) of the noisy gene-wise estimates toward the consensus representedby the red line. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior (shrinkage would follow thedotted line). For clarity, only a subset of genes is shown, which is enriched for dispersion outliers. Additional file 1: Figure S1 displays the same databut with dispersions of all genes shown. MAP, maximum a posteriori; MLE, maximum-likelihood estimate.

variation to the extent that the data provide this informa-tion, while the fitted curve aids estimation and testing inless information-rich settings.Our approach is similar to the one used by DSS [6],

in that both methods sequentially estimate a prior dis-tribution for the true dispersion values around the fit,and then provide the maximum a posteriori (MAP) asthe final estimate. It differs from the previous imple-mentation of DESeq, which used the maximum of thefitted curve and the gene-wise dispersion estimate as thefinal estimate and tended to overestimate the dispersions(Additional file 1: Figure S2). The approach of DESeq2differs from that of edgeR [3], as DESeq2 estimates thewidth of the prior distribution from the data and there-fore automatically controls the amount of shrinkage basedon the observed properties of the data. In contrast, thedefault steps in edgeR require a user-adjustable parameter,the prior degrees of freedom, which weighs the contribu-tion of the individual gene estimate and edgeR’s dispersionfit.Note that in Figure 1 a number of genes with gene-

wise dispersion estimates below the curve have their finalestimates raised substantially. The shrinkage procedurethereby helps avoid potential false positives, which canresult from underestimates of dispersion. If, on the otherhand, an individual gene’s dispersion is far above the dis-tribution of the gene-wise dispersion estimates of othergenes, then the shrinkage would lead to a greatly reducedfinal estimate of dispersion. We reasoned that in manycases, the reason for extraordinarily high dispersion of a

gene is that it does not obey our modeling assumptions;some genes may showmuch higher variability than othersfor biological or technical reasons, even though they havethe same average expression levels. In these cases, infer-ence based on the shrunken dispersion estimates couldlead to undesirable false positive calls. DESeq2 handlesthese cases by using the gene-wise estimate instead ofthe shrunken estimate when the former is more than 2residual standard deviations above the curve.

Empirical Bayes shrinkage for fold-change estimationA common difficulty in the analysis of HTS data is thestrong variance of LFC estimates for genes with low readcount. We demonstrate this issue using the dataset byBottomly et al. [16]. As visualized in Figure 2A, weaklyexpressed genes seem to show much stronger differ-ences between the compared mouse strains than stronglyexpressed genes. This phenomenon, seen in most HTSdatasets, is a direct consequence of dealing with countdata, in which ratios are inherently noisier when countsare low. This heteroskedasticity (variance of LFCs depend-ing on mean count) complicates downstream analysis anddata interpretation, as it makes effect sizes difficult tocompare across the dynamic range of the data.DESeq2 overcomes this issue by shrinking LFC esti-

mates toward zero in a manner such that shrinkage isstronger when the available information for a gene islow, which may be because counts are low, dispersionis high or there are few degrees of freedom. We againemploy an empirical Bayes procedure: we first perform

Love et al. Genome Biology (2014) 15:550

Shrinkage of fold change estimates

Love et al. Genome Biology (2014) 15:550 Page 4 of 21

Figure 2 Effect of shrinkage on logarithmic fold change estimates. Plots of the (A)MLE (i.e., no shrinkage) and (B)MAP estimate (i.e., withshrinkage) for the LFCs attributable to mouse strain, over the average expression strength for a ten vs eleven sample comparison of the Bottomlyet al. [16] dataset. Small triangles at the top and bottom of the plots indicate points that would fall outside of the plotting window. Two genes withsimilar mean count and MLE logarithmic fold change are highlighted with green and purple circles. (C) The counts (normalized by size factors sj) forthese genes reveal low dispersion for the gene in green and high dispersion for the gene in purple. (D) Density plots of the likelihoods (solid lines,scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higherdispersion of the purple gene, its likelihood is wider and less peaked (indicating less information), and the prior has more influence on its posteriorthan for the green gene. The stronger curvature of the green posterior at its maximum translates to a smaller reported standard error for the MAPLFC estimate (horizontal error bar). adj., adjusted; LFC, logarithmic fold change; MAP, maximum a posteriori; MLE, maximum-likelihood estimate.

ordinary GLM fits to obtain maximum-likelihood esti-mates (MLEs) for the LFCs and then fit a zero-centerednormal distribution to the observed distribution of MLEsover all genes. This distribution is used as a prior on LFCsin a second round of GLM fits, and the MAP estimatesare kept as final estimates of LFC. Furthermore, a stan-dard error for each estimate is reported, which is derivedfrom the posterior’s curvature at its maximum (seeMethods for details). These shrunken LFCs and their stan-dard errors are used in the Wald tests for differentialexpression described in the next section.The resulting MAP LFCs are biased toward zero in a

manner that removes the problem of exaggerated LFCs forlow counts. As Figure 2B shows, the strongest LFCs are nolonger exhibited by genes withweakest expression. Rather,the estimates are more evenly spread around zero, and

for very weakly expressed genes (with less than one readper sample on average), LFCs hardly deviate from zero,reflecting that accurate LFC estimates are not possiblehere.The strength of shrinkage does not depend simply on

the mean count, but rather on the amount of informa-tion available for the fold change estimation (as indicatedby the observed Fisher information; see Methods). Twogenes with equal expression strength but different dis-persions will experience a different amount of shrinkage(Figure 2C,D). The shrinkage of LFC estimates can bedescribed as a bias-variance trade-off [18]: for genes withlittle information for LFC estimation, a reduction of thestrong variance is bought at the cost of accepting a biastoward zero, and this can result in an overall reduc-tion in mean squared error, e.g., when comparing to LFC

Love et al. Genome Biology (2014) 15:550

Multiple testing

https://xkcd.com/882/



88 F. Hahne, W. Huber

Histogram of tt$p.value

tt$p.value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0200

400

600

Histogram of ttrest$p.value

ttrest$p.value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

020

60100

Figure 6.2. Histograms of p-values. Right: after nonspecific filtering. Left: filterednonspecific probe sets only.

> table(ALLsfilt$mol.biol)BCR/ABL NEG

37 42> tt = rowttests(ALLsfilt, "mol.biol")> names(tt)[1] "statistic" "dm" "p.value"

Take a look at the histogram of the resulting p-values in the left panelof Figure 6.2.

> hist(tt$p.value, breaks=50, col=lcol1)

We see a number of probe sets with very low p-values (which correspondto di�erentially expressed genes) and a whole range of insignificant p-values.This is more or less what we would expect. The expression of the majorityof genes is not significantly shifted by the BCR/ABL mutation. To makesure that the nonspecific filtering did not throw away an undue amount ofpromising candidates, let us take a look at the p-values for those probe setsthat we filtered out before. We can compute t-statistics for them as welland plot the histogram of p-values (right panel of Figure 6.2):

> ALLsrest = ALL_bcrneg[sds<sh, ]> ttrest = rowttests(ALLsrest, "mol.biol")> hist(ttrest$p.value, breaks=50, col=lcol2)

Exercise 6.1Comment on the plot; do you think that the nonspecific filtering wasappropriate?

Observed p-values are a mix of samples from • a uniform distribution (from true nulls) and • from distributions concentrated at 0 (from true alternatives)

Diagnostic plot: the histogram of p-values

Slide from Wolfgang Huber

False discovery rate

p(1),:::, p(N):

The Benjamini and Hochberg (B-H) algorithm uses thefollowing rule: for a fixed value of q[(0,1), referred to as thecontrol rate, let imax be the largest index for which

p(i)ƒi

Nq,

and reject H0(i), the null hypothesis corresponding to p(i), if

iƒimax,

accepting H0(i) otherwise. Figure 1 illustrates how the B-Hprocedure works.

Benjamini and Hochberg proved the following result [23],which justified their procedure.

Theorem. For independent test statistics, the B-H algorithmcontrols the expected false discovery proportion (FDP) at q:

FDR:EfFDPg~(N0=N)qƒq,

where FDP~V=R, R is the number of cases rejected, V is thenumber of those that are actually null, and N0 is the number oftrue null hypotheses.

Clearly, the above FDP control attempts to keep the number offalse discoveries under control, and in a sense to keep the precisionabove a certain level. A good procedure should have as high recallrates as possible with prescribed high precision (or low FDP).

Applying the B-H Procedure to the Peak Picking ProblemWe will cast the NMR peak picking problem into the multiple

testing framework. In WaVPeak (or PICKY), after data cleaning atthe first stage by wavelet smoothing (or by hard thresholding), N

potential peaks are identified. We wish to test that, for eachi~1,:::,N,

H0i : the ith peak is a false peak :

against

H1i : the ith peak is a true peak :

We can view each candidate peak and its surroundings as onepopulation. We have a random sample of intensities, Xi1,:::,Xin

from the ith population. The sample size n depends on whichmethod is adopted. For WaVPeak, we have n~9 if we use arectangular neighborhood of length 1 in 2D spectra, such as 15N-HSQC; for PICKY, we have n~1 since we only use one intensityat each candidate peak.

We implement the B-H procedure below in two steps.

N Step I: calculating p-values.

For WaVPeak and PICKY, we use volume (Voli) and intensity(Inti) around the ith candidate peak as the test statistics,respectively. Our decision rule is to reject H0i if Voli or Inti islarge, respectively. The corresponding p-values are

pVi ~PH0i

(Voli§voli) for WaVPeak,

pIi ~PH0i

(Inti§inti) for PICKY,

where voli and inti are observed values of Voli and Inti.

N Step II: applying the B-H procedure at FDR~q.

Rank the p-values p1,:::,pN obtained from Step I in ascendingorder, and denote the ordered p-values as p(1),:::,p(N). We can

then plot p(k) vs k, and apply the B-H procedure.

Calculation of P-valuesWe now explain how to calculate p-values pV

i and pIi in Step I

above. We assume that the observations from different peaks areindependent, and that true peaks and false peaks are from twodifferent normal distributions. Then we can rewrite the abovetesting problem as

H0i : Xi1,Xi2,:::,Xin*i:i:d:N(m0,s20)

against

H1i : Xi1,Xi2,:::,Xin*i:i:d:N(m1,s21):

Typically, the mean intensity m0 from false peaks is muchsmaller than the mean intensity m1 from true peaks, usually writtenas m1&m0. However, m0 may not be zero, and can be estimated

from weak intensities. For variances, we typically have s21§s2

0.

The reason why m0 is small (compared with m1) but not zero isdue to how the candidate peaks are selected. In WaVPeak andPICKY, the volumes and intensities are calculated for a grid of

Figure 1. Illustration of the Benjamini-Hochberg procedure. Inthis example, the number of hypotheses (N) is 10 and the falsediscovery proportion (q) is 0.2. The largest index of the hypotheses thatis below the line is 6 (imax~6). Therefore, the first six hypotheses arerejected as the predicted peaks.doi:10.1371/journal.pone.0053112.g001

BH Peak Picking

PLOS ONE | www.plosone.org 3 January 2013 | Volume 8 | Issue 1 | e53112

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological) 289-300.

What are we going to do with the gene lists?

112

Genes with known

function x

Your gene list

?

Functional enrichment statistics

Slide credit: Priit Adler ELIXIR-EE tools course 2016

113

Genes with known

function x

Your gene list

?

Does your gene list includes more genes with function x than expected by random chance?



114

Genes with known

function x

Your gene list

?

Does your gene list includes more genes with function x than expected by random chance?

p =



g:Profiler toolset http://biit.cs.ut.ee/gprofiler

115

J. Reimand, M. Kull, H. Peterson, J. Hansen, J. Vilo: g:Profiler - a web-based toolset for functional profiling of gene lists from large-scale experiments (2007) NAR 35 W193-W200 Jüri Reimand, Tambet Arak, Priit Adler, Liis Kolberg, Sulev Reisberg, Hedi Peterson, Jaak Vilo: g:Profiler -- a web server for functional interpretation of gene lists (2016 update) Nucleic Acids Research 2016; doi: 10.1093/nar/gkw199


Readingtheoutput

Statistics

Your genes 50 GO:0034660

ncRNA metabolic process 475 genes

10


2

Supplementary Figure 2. Differential gene expression and chromatin accessibility in macrophage immune response. (A) Principal component analysis of the gene expression data, n = 84 independent donors in each condition. (B) Principal component analysis of the chromatin accessibility data. The number of independent donors in each condition was n = 42 (N), n = 41 (I) and n = 31 (S and I+S). (C) Left panel: 8,758 differentially expressed genes clustered into nine distinct expression patterns (n = 84 unique donors across four conditions). Right panel: Selection of Gene Ontology terms enriched in each cluster. Only enrichments with p < 1×10-8 are shown in the figure. Enrichment p-values were calculated using g:Profiler1. Differential gene expression patterns closely recapitulated known aspects of macrophage immune response. For example, genes upregulated by Salmonella (cluster 1) were enriched for tumor necrosis factor (TNF) signalling and cell death pathways whereas genes upregulated by IFNɣ (cluster 5) were enriched for IFNɣ response and antigen presentation pathways. (D) Left panel: heatmap of 63,350 differentially accessible regions clustered into seven distinct patterns (n = 16 high quality donors across four conditions (see Supplementary Note)). Right panel: enrichment of TF motifs in four groups of differentially accessible clusters relative to all open chromatin regions. The points represent fold enrichment calculated using two-sided Fisher’s exact test. Due to the large number of differentially accessible regions, the 95% confidence intervals from Fisher’s exact test are too narrow to be visible on the plot. Similarly to the gene expression data (panel C), open chromatin regions in clusters 1 and 2 that became accessible after Salmonella infection were specifically enriched for NF-κB and AP-1 motifs, two main TFs activated downstream of

Documents

Understanding gene function by measuring its expression › MTAT.03.239 › 2018_spring › ... · Understanding gene function by measuring its expression ... of genes (or modules)