30
Galina Glazko and Yasir Rahmatallah Biomedical Informatics, UAMS IDeA National Resource for Proteomics Workshop PATHWAY ANALYSIS: THEORY AND PRACTICE

PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Galina Glazko and Yasir Rahmatallah

Biomedical Informatics, UAMS

IDeA National Resource for Proteomics Workshop

PATHWAY ANALYSIS: THEORY AND PRACTICE

Page 2: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

OUTLINE

•Part I: Pathway analysis for protein lists

-Pitfalls and around them

•Part II: Pathway analysis without preselected protein lists

- different hypotheses and power

•Part III: Other Hypotheses

•NB: all approaches are applicable to any omics data type that can be presented in a matrix form

Page 3: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Problem: Omics data = thousands of measurements How to extract the strongest signal from the data? How to ask questions from data - statistical viewpoints: -test different statistical hypotheses -include prior knowledge Biological pathway is a list of functionally connected genes/proteins involved in the same biological process: we can consider it as a prior information that is available for any omics data set

Omics data: multivariate, correlated

Page 4: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

1. The challenge is no longer in obtaining omics data but in interpreting the results to get insights into biological mechanisms

Pathway analysis motivations 2. Individual changes can be weak and not significant if considered only at the level of a single protein

I. Pathway tools for protein lists II. Pathway tools without protein lists

Increase (1) data interpretability; (2) power*; (3) ‘comparability’ *) power=P(reject Ho|H1 is true)

3. It is easier to do comparative studies using pathways

Page 5: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Part I: pathway tools for protein lists

MOTIVATION: Proteins are working in groups; if a process is abnormal co-functioning proteins will be selected Hopefully they can be described through common biological process - ‘ontology’

Step1: select DA proteins->form list Step2: select ‘ontologies’ Step3: select test RESULT: Functional groups, over-represented in a list, as compared to the rest of proteins

Page 6: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Step 3: statistical formulations

H0: DA proteins are randomly sampled form the entire population

1. Hypergeometric formulation

H0: N11~Hyper(n,n1+,n+1);

In GO Not in GO

Total

Class 1 (DA) n11 (100) n12(113) n1+(213)

Class 2 (not DA) n21(9000) n22(11000) n2+(20000)

Total n+1(9100) n+2(11113) n

H0: the property to belong to the GO category and that to be DA are independent (contingency tab.)

2. Fisher exact: there is a relationship between being DE and being in GO

(actually, N11~Hyper(n,n1+,n+1)) 3. Chi-square approximation

4. Calculate p-values

100/9100=0.01

113/11113=0.01

Page 7: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

The Gene Ontologies (GO) : the most popular annotation database. Describe attributes of proteins in BP, MF, CC.

Alexa et al., 2006

-significantly enriched GO terms in a microarray study

all genes annotated to all children

Page 8: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Pitfalls and around them: 1 GO categories are nested –

‘antigen presentation’ contains ‘antigen presentation exogenous’

If parent and child - have the same proteins =>score is the same, but child is more specific

One way – to remove all genes that that are annotated to a significantly enriched node from all its ancestor:

Ale

xa e

t al., 20

06

params <- new("GOHyperGParams",

……………………………………conditional=TRUE,

testDirection="over")

Page 9: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Pitfalls and around them: 2

The sample size is ‘usually’ small in omics experiments => the signal is usually weak (by design or by nature) => finding highly DA proteins to make a gene list is DIFFICULT In addition, the necessity for multiple correction may be too restrictive for significance threshold and high correlations between gene expressions make FDR/FWER controlling procedures even more conservative

Page 10: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Part II. Without lists: GSEA

MOTIVATIONS: I. -no DE genes between patients with Type II diabetes mellitus

(DM2) and normal individuals

-the # of tests = # of genes ~22,000 (Bonferroni correction) -arrange genes into pathways (pre-defined gene sets) and consider Differentially Expressed PATHWAYS -the #of tests = #of pathways ~149 II. ‘Gene X is DE’ has less explanatory power than ‘Pathway X is DE’ H0: the rank ordering of genes according to expression differences

between phenotypes is random H1: the rank ordering of the pathway members is associated with

phenotype

(Mootha et al., 2003)

Page 11: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

1. arrange genes according to expression differences between Ph.

2. consider R1,…RN genes and |S|=G

3. for every S calculate Xi - scores

4. calculate running sum, ES over N genes; measure max. The ES for a single gene set is defined as the greatest positive deviation of the running sum across all N genes (a running sum=a sequence of partial sums)

5. permute the class labels 1,000 times, each time recording the maximum ES over all gene sets

6. construct a histogram of MES; calculate Pr. of MES as extreme as observed

Permute labels, record max MES

Enrichment score (ES)

~149 sets

GSEA algorithm 2003 (Mootha et al., 2003)

Q: whether a member of set S are enriched on the top

Ri € S

ES=

Page 12: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

GSEA now: online and in Bioconductor

REQUIRES: data matrix

WILL DO: Gene Set Enrichment Analysis, employing Molecular Signature Database BENEFITS: available online; large coverage of different MSigs; DRAWBACKS: might have FPs, depends on the background; difficulties with interpreting MSigs in the context of particular analysis

(GSEA's library 3494 articles)

Page 13: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

C2: curated gene sets (browse 4792 gene sets)

Gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts. The gene set page for each gene set lists its source. details

CP: canonical pathways (browse 1392 gene sets)

Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. details

CGP: chemical and genetic perturbations (browse 3400 gene sets)

Gene sets that represent gene expression signatures of genetic and chemical perturbations. A number of these gene sets come in pairs: an xxx_UP (xxx_DN) gene set representing genes induced (repressed) by the perturbation. The gene set page for each gene set lists the PubMed citation on which it is based.

C3: motif gene sets (browse 863 gene sets)

Gene sets that contain genes that share a cis-regulatory motif that is conserved across the human, mouse, rat, and dog genomes. The motifs are catalogued in Xie, et al. (2005, Nature 434, 338–345) and represent known or likely regulatory elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in a microarray experiment to a conserved, putative cis-regulatory element.

MIR: microRNA targets (browse 222 gene sets)

Gene sets that contain genes that share a 3'-UTR microRNA binding motif.

TFT: transcription factor targets (browse 615 gene sets)

Gene sets that contain genes that share a transcription factor binding site defined in the TRANSFAC (version 7.4, http://www.gene-regulation.com/) database. Each of these gene sets is annotated by a TRANSFAC record.

MSigDB: http://www.broadinstitute.org/gsea/msigdb/index.jsp

C4: computational gene sets (browse 883 gene sets)

Computational gene sets defined by mining large collections of cancer-oriented microarray data.

CM: cancer modules (browse 456 gene sets)

Gene sets defined by Segal et al. (Nature Genetics, 2004). Briefly, the authors compiled gene sets ('modules') from a variety of resources such as KEGG, GO, and others. By mining a large compendium of cancer-related microarray data, they identified 456 such modules as significantly changed in a variety of cancer conditions.

CGN: cancer gene neighborhoods (browse 427 gene sets)

Gene sets defined by expression neighborhoods centered on 380 cancer-associated genes (Brentani, Caballero et al. 2003). This collection is identical to that previously reported in (Subramanian, Tamayo et al. 2005). details

Page 14: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Since that time…

(Ackermann & Strimmer, 2009) ~261 different variants of gene set enrichment procedures

Page 15: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

(Ackermann & Strimmer, 2009)

MAJOR DIFFERENCES:

I. Univariate (protein-level) or multivariate statistics

II. Type of hypotheses

Q1:”competitive”*) (GSEA)

compares differential abundance of a set to a standard defined by the complement of that gene set

Q2:"self-contained" (T2, N)

compares the gene set to a fixed standard that does not depend on the measurements of genes outside the gene set

e.g. t-statistics: X1,…,Xn: ->t1,..,tn

e.g. t2

e.g. ∑t2

*): “of limited utility”

Meth

od

s’ Taxo

no

my

t21,..,t

2n

II. I.

Page 16: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

How to choose the relevant method?

IN GENERAL, for any test statistic: high power, low FP rate. IN PARTICULAR: high power, low false positive rate UNDER

REALISTIC BIOLOGICAL SETTINGS: (1)not all proteins in a set change their abundances. The percent of

proteins changing their abundance in the way a pathways is called DA - is an unknown characteristic of a test performance

(2) as a rule proteins in a set are functionally related =>correlated. Multivariate tests might have better power because they account for interdependence among genes, in contrast to univariate tests

WE WOULD LIKE TO SELECT A TEST WITH HIGH POWER, LOW FP RATE GIVEN THAT NOT ALL PROTEINS ARE DE IN A PATHWAY, AND ARE CORRELATED

Page 17: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

DE pathways: simplest way

Aggregated t-test

Let g1,…gn are from pathway Pk

Simplest:

1. for every gi (i=1,…,n) calculate

ordinary or mod. t-statistic ti

2. calculate Sk=∑ti: score for Pk

3. construct null distribution for

scores (Under H0 no gene sets are differentially expressed: no score difference)

4. calculate Sk p-values *) NB: p-value: the probability of a result at least as extreme as observed: #(scoreobs ≤ scoreperm)/#(perm)

Page 18: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

DE pathways: illustrative example

scores: Sk=∑ti

case NAM-treated

00010 -1.8684481 Glycolysis / Gluconeogenesis 00020 -0.7754142 Citrate cycle (TCA cycle) 00030 -0.5792160 Pentose phosphate pathway 00040 -1.4629696 Pentose and glucuronate interconversions …

t1

t2

t3

t4

2. construct null distribution for score using permutations: but? – -class labels that represent 2 samples with 3 observations can be permuted in ways = 20 the lowest p-value*) one can get (aside from 0) for 2x3 =1/n=0.05 ‘granularity’ of p-values “rule of thumb” (Alison et al., 2006): ≥5; =1/252=0.003 - at least

3

6

*) NB: p-value: the probability of a result at least as extreme as observed: #(scoreobs ≤ scoreperm)/#(perm)

Affy yeast: 5812 probe sets mapped onto 81 pathways in KEGG; next step: scores and p-values

Page 19: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

What is expected from a test?

-to have high power and low false positive rate

-to perform well in real biological settings: proteins in a pathway are working together, i.e., can be highly correlated. A test should account for that

There are several parameters, such as sample size, N; number of proteins in a pathway, p; correlation between proteins, r; detection call g (number of truly DA proteins)

How are these parameters influencing the test’s power?

The best way to figure out is by simulations

Page 20: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

100%

80%

90%

Page 21: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,
Page 22: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,
Page 23: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

1. Among all factors, dimensionality, gamma, pairwise correlations, the correlations impact the power of the tests in the most effective way

2. when r=0.1 the power of , T2, N is approx. the same

3. . when g=0.5 for p=100, 60, 20 the power is 100%, 90%, 80% given m=0.5: pathways dim./detection call only slightly influences power

4. When r=0.1, T2 has lower power in comparison to other tests. When r=0.9, T2 has the highest power when g=0.25, i.e. only several components of the mean vector needs to be different to reject null

Conclusions: m

54

Page 24: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Only N-statistics has the power to test the full hypothesis F=G against a fixed alternative F≠G

Conclusions: variance

NB from the analysis: multivariate tests have MORE POWER

Page 25: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Three major decisions to make:

Taxonomy!

Page 26: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Gene set analysis approaches for RNA-Seq data

Developed for microarrays, require data transformation Developed specifically for RNA-Seq

COMPETETIVE SELF-CONTAINED

Univariate tests Multivariate tests

H0: genes in a

gene set are

randomly

associated with

the phenotype

H0: two properties, being

DE and belong to a

particular pathway are

independent

H0: gene-set

score (p-value)

does not differ

between

phenotypes

H0: the

equality of

mean vectors

H0: the equality

of multivariate

distributions

Approaches employing pre-selected gene lists Approaches without pre-selected gene lists

GOseq: Gene Ontology analysis on RNA-seq

2. GSVA: gene set variation analysis

1. ROMER (limma)

1.Multivariate KS 2.ROAST (limma)

Supervised Unsupervised

2. SeqGSEA: Gene Set Enrichment analysis on RNA-Seq

3. edgeR 4. DEseq

1. SAM-GS 2. eBayes

FM for combining p-values

N-statistic 1. ssGSEA: single sample extension of GSEA

H0: gene-set

enrichment score

does not differ

between

phenotypes

Three major decisions to make!

1

2

3

Page 27: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

1. The space of null hypotheses for pathways is mostly covered by three null hypotheses (Hotelling T2, N-statistic difference in scores for protein-level tests)

2. Multivariate non-parametric two-sample tests (multivariate (Radial) Kolmogorov-Smirnov) can discriminate against several alternatives (Friedman & Rafsky, 1979)

3. Q: “Is the expression of a pathway different between two phenotypes”: is too non-specific

- Employing statistical hypotheses tests that allow formulating more specific alternatives can sharpen the initial question itself

- Multivariate Radial Kolmogorov-Smirnov (RKS) test is sensitive to alternatives having similar mean vectors but differences in scale (variance)

Part III: other hypotheses

Page 28: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

What about other hypotheses?

1. Pathway (Gene Set Analysis, GSA) approaches primarily focus on identifying differentially abundant protein sets (pathways)

2. Methods for identifying differentially co-abundant pathways also exist but are mostly based on aggregated pairwise correlations, or other pairwise measures of coexpression

3. Suggestion: Gene Set Net Correlations Analysis (GSNCA), a multivariate differential coexpression test that accounts for the complete correlation structure between genes

Page 29: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Gene Set Analysis in R (GSAR)

GSAR package (Dr. Rahmatallah & Dr. Glazko) is written in R and is freely available on Bioconductor web site. (BMC Bioinf., https://www.ncbi.nlm.nih.gov/pubmed/28118818)

It includes: 1. Multivariate non-parametric tests using MSTs: - Wald-Wolfowitz test - Kolmogorov-Smirnov test 2. Gene Set Net Correlation Analysis (GSNCA) 3. MST2 for visualizing the backbone structure of the correlation network

Page 30: PATHWAY ANALYSIS: THEORY AND PRACTICE - IDeAidearesourceproteomics.org/wp-content/uploads/2017/09/Pathway-Analysis.pdf · Galina Glazko and Yasir Rahmatallah Biomedical Informatics,

Conclusions

TO INCREASE DATA INTERPRETABILITY: test different statistical hypotheses TO INCREAS POWER: increase sample size and use multivariate tests USE GSAR!