22
Impact of knowledge accumulation on pathway enrichment analysis Lina Wadi1, Mona Meyer1, Joel Weiser1, Lincoln D. Stein1,3, Jüri Reimand1,2,* 1 - Ontario Institute for Cancer Research 2 - Department of Medical Biophysics, University Of Toronto 3 - Department of Molecular Genetics, University Of Toronto * -

20160530 journal club_jqo

Embed Size (px)

Citation preview

Page 1: 20160530 journal club_jqo

Impact of knowledge accumulation on pathway enrichment analysis

Lina Wadi1, Mona Meyer1, Joel Weiser1, Lincoln D. Stein1,3, Jüri Reimand1,2,*

1 - Ontario Institute for Cancer Research 2 - Department of Medical Biophysics, University Of Toronto

3 - Department of Molecular Genetics, University Of Toronto * -

Page 2: 20160530 journal club_jqo

Pathway enrichment analysis GO analysis

functional enrichment analysis …

High-Throughput experiment

Candidate genes

or proteins

100s of them —what can they tell us globally?

Annotation database (e.g. GO)

ApoptosisCell death

Annotation terms

(e.g. biological process)

Enriched annotated

terms

biological role of thecandidate genes/proteins

many tools…

Page 3: 20160530 journal club_jqo

genes associated with apoptosis

all other genes

Candidate genes

or proteins

75%

Whole genome or

assayed part of it

25%

Genes/proteins related to apoptosis

Enrichment analysis

Page 4: 20160530 journal club_jqo

The content of the annotation database is key for the enrichment analysis

High-Throughput experiment

Candidate genes

or proteins

100s of them —what can they tell us globally?

Annotation database (e.g. GO)

ApoptosisCell death

Annotation terms

(e.g. biological process)

Enriched annotated

terms

biological role of thecandidate genes/proteins

many tools…

Annotation databases constantly grow: gene ontology (GO) is

updated daily, Reactome quarterly

How often do these tools update the version of the annotation database

they use?

Page 5: 20160530 journal club_jqo

Supplementary information, Wadi et al.

Supplementary Figure 1: The majority of web-based pathway analysis tools are out of date. Shown are 21 pathway analysis tools and their time of update. Tools in the blue box have the latest updates prior to 2010.

�5

ConceptGene BayGO goSurfer

L2L GOToolBox

DAVID

EasyGO

GARNet

GeneTrail

GoMiner

FunSpec

GeneCodis

gsGator

WebGestalt

FuncAssociate

Babelomics

GREAT

PANTHER

goEast

ToppGene

g:Profiler

2006 2008 2010 2012 2014 2016

Only a few tools have been updated in the last 6 months

More than half have not been updated in 5 years!!

Page 6: 20160530 journal club_jqo

But most recent publications have not used tools with outdated annotation databases, right?

Page 7: 20160530 journal club_jqo

g:P

rofil

erG

oEas

tPA

NTH

ER

Topp

Gen

e

Bab

elom

ics

GR

EAT

Func

Ass

ocia

te

FunS

pec

Gen

eCod

isG

oMin

erW

ebge

stal

t

Bay

GO

DAV

IDE

asyG

OG

AR

Net

Gen

eTra

ilG

oSur

fer

GO

Tool

Box L2L

Con

cept

Gen

gsG

ator

% c

itatio

ns in

201

5

0

10

20

30

40

50

60

70

80

# c

itatio

ns in

201

5

UPDATE

< 6 mo < 2 yrs < 5 yrs > 5 yrs N/A

0

500

1000

1500

2000

2500

A

C

84% of 2015 publications cited tools that had not been updated within 5

years!!

Figure 1

Page 8: 20160530 journal club_jqo

Are outdated databases missing many new terms?

Page 9: 20160530 journal club_jqo

Probably yes!

• Human GO biological process annotations have more than doubled in 2009-2016

• Similar growth in GO molecular function & cell component

• Similar growth in Reactome (pathway database)

• similar growth in other speciesTe

rms

with

gen

e an

nota

tions

5000

10000

15000

20000

Year

SpeciesArabidopsisFlyHumanMouseYeast

09 10 11 12 13 14 15 16

B

Figure 1

Page 10: 20160530 journal club_jqo

In addition to improvements in quantity, have annotation databases improved in quality and

complexity?

Page 11: 20160530 journal club_jqo

Supplementary information, Wadi et al.

Supplementary Figure 3: Gene Ontology terms are increasingly specific.Histogram shows mean number of paths in the Gene Ontology connecting a given term and the root term. Significant increase in the depth of the GO hierarchy between 2009 and 2016 (p<10-5, permutation test) indicates that the biological vocabulary is increasingly detailed and terms are becoming more specific.

�7

0

5

10

15

20

2 4 6 8 10 12 14 16Mean length of path to root per GO term

% o

f ter

ms

Year20092016

3 5 7 9 11 13 15

Greater detail of the biological vocabulary

Page 12: 20160530 journal club_jqo

Supplementary information, Wadi et al.

Supplementary Figure 4: GO terms are increasingly interconnected.GO terms are connected to root term via significantly more paths in 2016 than in 2009 (p<10-5, permutation test), showing that biological concepts are increasingly interconnected. Complexity and depth of GO directly impacts enrichment analysis as gene annotations are hierarchically propagated.

�8

0.00

0.05

0.10

0.15

0 5 10Number of paths to root per GO term [log2]

Den

sity

Year20092016

Increasing interconnectedness of concepts

Page 13: 20160530 journal club_jqo

1

2.55

10204080

160320640

50 200 800 3200 12800

Pat

hway

s pe

r gen

e

2009 GO + REACTOME

50 200 800 3200 12800

2016 GO + REACTOME

Func

Ass

ocia

te

Median pathway size per gene

Pat

hway

s pe

r gen

e

DC

817 1144

Median value

0.000.050.100.150.20

Level density

1629 The median gene is

involved in 29 pathways

The size of the median functional gene set has

increased to 1,144 genes

Poorly annotated or highly specific genes?

Figure 1

Larger functional gene sets and better per-gene annotation

Page 14: 20160530 journal club_jqo

High-confidence experimental annotations are becoming more common

Fraction of poorly annotated human genes is decreasing

CategoryIEANon-IEAReactome

0

25

50

75

100

2009 2012 2016Year

Val

ue (%

)

DFigure 1

Reactome: manual literature curation IEA = inferred from electronic annotation —> lower confidence

Page 15: 20160530 journal club_jqo

Supplementary information, Wadi et al.

Supplementary Figure 7: Dark matter of protein-coding genome is decreasing.The fraction of human genes with no annotations (i.e. dark matter) has increased consistently over time. Dark matter genes include human protein-coding genes from the most recent CCDS release that have no Reactome or GO annotations, or only have root-level GO annotations.

�11

0

4

8

12

Year

% o

f dar

k m

atte

r gen

es in

CC

DS

09 10 11 12 13 14 15 16

Less genes with no annotation

Supplementary information, Wadi et al.

Supplementary Figure 8: Annotation analysis is affected by mismatches in gene symbols.Use of out-dated annotations with current human gene lists will cause increasingly common mismatches of gene symbols as standard nomenclature is updated. Symbols of protein-coding genes from the latest (2015) release are compared to earlier releases of CCDS.

�12

0.0

2.5

5.0

7.5

10.0

12.5

2009 2011 2012 2013 2014Year

Gen

e sy

mbo

ls n

ot fo

und

in 2

015

CC

DS

(%)

What happens if you try to find the function of a

nomenclature-updated gene (e.g. UCSC) using an outdated annotation

database?

But this is not the only reason for having non-annotated

genes…

Page 16: 20160530 journal club_jqo

Impact of outdated annotation: examples

Page 17: 20160530 journal club_jqo

• Example #1: breast cancer cell lines

• Top essential genes of 77 breast cancer cell lines (essential genes derived from shRNA screens (Marcotte et al. 2016 Cell)

• Enrichment analysis using 2010 and 2016 annotation

• Robust results:

• top 500 and top 100 essential genes

• GO + Reactome as well as GO and Reactome separately

Page 18: 20160530 journal club_jqo

Supplementary information, Wadi et al.

Supplementary Figure 11: Analysis of GO and Reactome pathways with breast cancer data confirms loss of information when using outdated software. Bar plots show the number of 2010-only, 2016-only and commonly detected enriched pathways for GO (top) and Reactome (bottom).

�15

Cell line

Num

ber o

f sig

nific

ant t

erm

s

Reactome: 2010 vs 2016

Num

ber o

f sig

nific

ant t

erm

s

GO: 2010 vs 2016Type

BothNewOld

A.

B.

0

200

400

600

mda

mb4

15su

m10

2su

m18

5su

m13

15hc

c218

5su

m44

evsa

tau

565

cal1

20hc

c221

8m

dam

b157

ocub

mhc

c156

9zr

751

hcc1

008

hcc2

688

mda

mb1

34vi

hcc1

395

mcf

10a

hcc3

8hc

c118

7hc

c159

9zr

75b

X184

b5su

m15

9hc

c315

3hb

l100

skbr

3su

m19

0kp

l1bt

20m

cf7

ly2

t47d

X184

a1m

fm22

3bt

483

skbr

7su

m22

5ca

l148

cal5

1sk

br5

hs57

8thc

c202

du44

75jim

t1m

dam

b361

sw52

7hc

c150

0m

dam

b453

X600

mpe

sum

229

hcc1

428

mcf

12a

cal8

51hc

c114

3m

dam

b468

mda

mb4

36bt

474

hcc1

419

bt54

9m

dam

b231

hcc7

0su

m52

mda

mb3

30ef

m19

2ahc

c193

7hc

c195

4ua

cc89

3ef

m19

mac

ls2

sum

149

cam

a1hd

qp1

mb1

57hc

c180

6m

x1

0

100

200

300

400

mda

mb4

15hc

c218

5su

m13

15su

m18

5ca

l120

sum

44m

dam

b157

evsa

tsu

m10

2au

565

ocub

mhc

c268

8m

dam

b134

vihc

c156

9hc

c221

8m

fm22

3m

cf10

ahc

c100

8zr

751

hcc1

599

zr75

bhc

c38

X184

a1su

m19

0X1

84b5

hcc1

395

hs57

8tsk

br3

mda

mb2

31hc

c315

3hb

l100

mda

mb4

53hc

c118

7kp

l1sk

br5

mda

mb4

68su

m22

5bt

20m

cf7

skbr

7m

dam

b361

cal1

48X6

00m

pehc

c202

du44

75ca

l851

sum

159

bt54

9t4

7dm

cf12

ahc

c142

8bt

483

sw52

7ef

m19

2asu

m52

hdqp

1hc

c114

3ly

2hc

c70

hcc1

500

hcc1

954

sum

229

bt47

4hc

c141

9m

dam

b436

uacc

893

cam

a1m

dam

b330

cal5

1hc

c193

7m

b157

jimt1

mac

ls2

hcc1

806

sum

149

mx1

efm

19

High proportion of enriched terms of 2016 are missed when using old annotation

Page 19: 20160530 journal club_jqo

• Example #2: glioblastoma brain cancer

• 75 significantly mutated driver genes derived from 6,800 tumors (Tamborero et al. 2010 Sci Rep; Rubio-Perez et al. 2015 Cancer Cell)

• including EGFR, PIK3CA, PIK3R1, PTEN, TP53, NF1, RB1

• g

Page 20: 20160530 journal club_jqo

Hematopoiesis/T-cell activation

only 2016only year XBoth

Pro

porti

on o

f ter

ms

(%)

0

25

50

75

100

2009 2010 2011 2012 2013 2014 2015

GO termsF

Supplementary information, Wadi et al.

Supplementary Figure 12: Evolution of pathway information affects recently updated and out-of-date software tools. Analysis of glioblastoma genes shows fraction of commonly detected (yellow), 2016-only (purple) and outdated-only (dark blue) Reactome pathways detected as significantly enriched. Significantly mutated glioblastoma genes were analyzed for pathway enrichments using annotations from 2009-2015 and compared to pathways detected using 2016 annotations.

�16

Year

Prop

ortio

n of

term

s (%

)

Reactome terms

0

25

50

75

100

2009 2010 2011 2012 2013 2014 2015

Figure 1

Enrichment analysis with outdated annotations missed many pathways detected with 2016 data

The 2010 era annotations (which the DAVID software uses) only capture ~20% of the 2016 results (127/827 GO biological

process and 16/128 Reactome pathways)

Page 21: 20160530 journal club_jqo

Tube

Immunesignalling

Histones &chromatin

Cell-cellsignalling

Responseto stimulus

Protein modification

Metabolism

Transcription

SignallingCell migration

Development

CNS

Insulin response

Steroid*

VEGF*

PDGF*

TGFβ*

ERBB*

NGFPI3K*

EGFR*

EGFRvIII*

FGFR*

ERK*

MAPK*

RAS*/RHO

NOTCH*

Neurons,glial cells

Kinase signalling

DNA replication/CC checkpoint

Circadianclock

Cognition/ learning/behaviour

Hematopoiesis/T-cell activation

Catabolism

Apoptosis/Neuron death

GABAergicsynatic transmission

Proliferation/stem cells

Glucose import/transport

Adhesion

Homeostasis

Endocytosis

Protein import/localization

Cell cycle/mitosis

Woundhealing

Cellular componentorganization

Embryo

Sex

Head

only 2016

2009 2010 2011 2012 2013 2014 2015

GFigure 1

only 2016only year XBoth

2009 2010 2011 2012 2013 2014 2015

Page 22: 20160530 journal club_jqo

Summary• gene annotations have become substantially broader, more specific and of higher-quality over the

last 7 years, covering more protein-coding genes and reducing the number of un-annotated genes

• out of 21 enrichment tools evaluated, more than half use annotation databases not updated for >5 years

• >70% of 2015 papers used enrichment analysis tools not updated for >5 years, implying that 1,000s of recent studies have underestimated the functional significance of their genes/proteins

• avoid outdated tools like DAVID! not updated in >5 years but used in a high proportion of recent papers

• only a few enrichment analysis tools use annotation databases updated in the last 6 months —better use this: g:Profiler, Panther, ToppGene, GREAT, goEast

• “only” 21 analysis tools evaluated —check the last update of your favourite tool

• benefit from re-analysing the data…

• side effect: the rapid growth of annotation may complicate replicability when using frequently updated annotations