Upload
javier-quilez-oliete
View
189
Download
1
Embed Size (px)
Citation preview
Impact of knowledge accumulation on pathway enrichment analysis
Lina Wadi1, Mona Meyer1, Joel Weiser1, Lincoln D. Stein1,3, Jüri Reimand1,2,*
1 - Ontario Institute for Cancer Research 2 - Department of Medical Biophysics, University Of Toronto
3 - Department of Molecular Genetics, University Of Toronto * -
Pathway enrichment analysis GO analysis
functional enrichment analysis …
High-Throughput experiment
Candidate genes
or proteins
100s of them —what can they tell us globally?
Annotation database (e.g. GO)
ApoptosisCell death
Annotation terms
(e.g. biological process)
Enriched annotated
terms
biological role of thecandidate genes/proteins
many tools…
genes associated with apoptosis
all other genes
Candidate genes
or proteins
75%
Whole genome or
assayed part of it
25%
Genes/proteins related to apoptosis
Enrichment analysis
The content of the annotation database is key for the enrichment analysis
High-Throughput experiment
Candidate genes
or proteins
100s of them —what can they tell us globally?
Annotation database (e.g. GO)
ApoptosisCell death
Annotation terms
(e.g. biological process)
Enriched annotated
terms
biological role of thecandidate genes/proteins
many tools…
Annotation databases constantly grow: gene ontology (GO) is
updated daily, Reactome quarterly
How often do these tools update the version of the annotation database
they use?
Supplementary information, Wadi et al.
Supplementary Figure 1: The majority of web-based pathway analysis tools are out of date. Shown are 21 pathway analysis tools and their time of update. Tools in the blue box have the latest updates prior to 2010.
�5
ConceptGene BayGO goSurfer
L2L GOToolBox
DAVID
EasyGO
GARNet
GeneTrail
GoMiner
FunSpec
GeneCodis
gsGator
WebGestalt
FuncAssociate
Babelomics
GREAT
PANTHER
goEast
ToppGene
g:Profiler
2006 2008 2010 2012 2014 2016
Only a few tools have been updated in the last 6 months
More than half have not been updated in 5 years!!
But most recent publications have not used tools with outdated annotation databases, right?
g:P
rofil
erG
oEas
tPA
NTH
ER
Topp
Gen
e
Bab
elom
ics
GR
EAT
Func
Ass
ocia
te
FunS
pec
Gen
eCod
isG
oMin
erW
ebge
stal
t
Bay
GO
DAV
IDE
asyG
OG
AR
Net
Gen
eTra
ilG
oSur
fer
GO
Tool
Box L2L
Con
cept
Gen
gsG
ator
% c
itatio
ns in
201
5
0
10
20
30
40
50
60
70
80
# c
itatio
ns in
201
5
UPDATE
< 6 mo < 2 yrs < 5 yrs > 5 yrs N/A
0
500
1000
1500
2000
2500
A
C
84% of 2015 publications cited tools that had not been updated within 5
years!!
Figure 1
Are outdated databases missing many new terms?
Probably yes!
• Human GO biological process annotations have more than doubled in 2009-2016
• Similar growth in GO molecular function & cell component
• Similar growth in Reactome (pathway database)
• similar growth in other speciesTe
rms
with
gen
e an
nota
tions
5000
10000
15000
20000
Year
SpeciesArabidopsisFlyHumanMouseYeast
09 10 11 12 13 14 15 16
B
Figure 1
In addition to improvements in quantity, have annotation databases improved in quality and
complexity?
Supplementary information, Wadi et al.
Supplementary Figure 3: Gene Ontology terms are increasingly specific.Histogram shows mean number of paths in the Gene Ontology connecting a given term and the root term. Significant increase in the depth of the GO hierarchy between 2009 and 2016 (p<10-5, permutation test) indicates that the biological vocabulary is increasingly detailed and terms are becoming more specific.
�7
0
5
10
15
20
2 4 6 8 10 12 14 16Mean length of path to root per GO term
% o
f ter
ms
Year20092016
3 5 7 9 11 13 15
Greater detail of the biological vocabulary
Supplementary information, Wadi et al.
Supplementary Figure 4: GO terms are increasingly interconnected.GO terms are connected to root term via significantly more paths in 2016 than in 2009 (p<10-5, permutation test), showing that biological concepts are increasingly interconnected. Complexity and depth of GO directly impacts enrichment analysis as gene annotations are hierarchically propagated.
�8
0.00
0.05
0.10
0.15
0 5 10Number of paths to root per GO term [log2]
Den
sity
Year20092016
Increasing interconnectedness of concepts
1
2.55
10204080
160320640
50 200 800 3200 12800
Pat
hway
s pe
r gen
e
2009 GO + REACTOME
50 200 800 3200 12800
2016 GO + REACTOME
Func
Ass
ocia
te
Median pathway size per gene
Pat
hway
s pe
r gen
e
DC
817 1144
Median value
0.000.050.100.150.20
Level density
1629 The median gene is
involved in 29 pathways
The size of the median functional gene set has
increased to 1,144 genes
Poorly annotated or highly specific genes?
Figure 1
Larger functional gene sets and better per-gene annotation
High-confidence experimental annotations are becoming more common
Fraction of poorly annotated human genes is decreasing
CategoryIEANon-IEAReactome
0
25
50
75
100
2009 2012 2016Year
Val
ue (%
)
DFigure 1
Reactome: manual literature curation IEA = inferred from electronic annotation —> lower confidence
Supplementary information, Wadi et al.
Supplementary Figure 7: Dark matter of protein-coding genome is decreasing.The fraction of human genes with no annotations (i.e. dark matter) has increased consistently over time. Dark matter genes include human protein-coding genes from the most recent CCDS release that have no Reactome or GO annotations, or only have root-level GO annotations.
�11
0
4
8
12
Year
% o
f dar
k m
atte
r gen
es in
CC
DS
09 10 11 12 13 14 15 16
Less genes with no annotation
Supplementary information, Wadi et al.
Supplementary Figure 8: Annotation analysis is affected by mismatches in gene symbols.Use of out-dated annotations with current human gene lists will cause increasingly common mismatches of gene symbols as standard nomenclature is updated. Symbols of protein-coding genes from the latest (2015) release are compared to earlier releases of CCDS.
�12
0.0
2.5
5.0
7.5
10.0
12.5
2009 2011 2012 2013 2014Year
Gen
e sy
mbo
ls n
ot fo
und
in 2
015
CC
DS
(%)
What happens if you try to find the function of a
nomenclature-updated gene (e.g. UCSC) using an outdated annotation
database?
But this is not the only reason for having non-annotated
genes…
Impact of outdated annotation: examples
• Example #1: breast cancer cell lines
• Top essential genes of 77 breast cancer cell lines (essential genes derived from shRNA screens (Marcotte et al. 2016 Cell)
• Enrichment analysis using 2010 and 2016 annotation
• Robust results:
• top 500 and top 100 essential genes
• GO + Reactome as well as GO and Reactome separately
Supplementary information, Wadi et al.
Supplementary Figure 11: Analysis of GO and Reactome pathways with breast cancer data confirms loss of information when using outdated software. Bar plots show the number of 2010-only, 2016-only and commonly detected enriched pathways for GO (top) and Reactome (bottom).
�15
Cell line
Num
ber o
f sig
nific
ant t
erm
s
Reactome: 2010 vs 2016
Num
ber o
f sig
nific
ant t
erm
s
GO: 2010 vs 2016Type
BothNewOld
A.
B.
0
200
400
600
mda
mb4
15su
m10
2su
m18
5su
m13
15hc
c218
5su
m44
evsa
tau
565
cal1
20hc
c221
8m
dam
b157
ocub
mhc
c156
9zr
751
hcc1
008
hcc2
688
mda
mb1
34vi
hcc1
395
mcf
10a
hcc3
8hc
c118
7hc
c159
9zr
75b
X184
b5su
m15
9hc
c315
3hb
l100
skbr
3su
m19
0kp
l1bt
20m
cf7
ly2
t47d
X184
a1m
fm22
3bt
483
skbr
7su
m22
5ca
l148
cal5
1sk
br5
hs57
8thc
c202
du44
75jim
t1m
dam
b361
sw52
7hc
c150
0m
dam
b453
X600
mpe
sum
229
hcc1
428
mcf
12a
cal8
51hc
c114
3m
dam
b468
mda
mb4
36bt
474
hcc1
419
bt54
9m
dam
b231
hcc7
0su
m52
mda
mb3
30ef
m19
2ahc
c193
7hc
c195
4ua
cc89
3ef
m19
mac
ls2
sum
149
cam
a1hd
qp1
mb1
57hc
c180
6m
x1
0
100
200
300
400
mda
mb4
15hc
c218
5su
m13
15su
m18
5ca
l120
sum
44m
dam
b157
evsa
tsu
m10
2au
565
ocub
mhc
c268
8m
dam
b134
vihc
c156
9hc
c221
8m
fm22
3m
cf10
ahc
c100
8zr
751
hcc1
599
zr75
bhc
c38
X184
a1su
m19
0X1
84b5
hcc1
395
hs57
8tsk
br3
mda
mb2
31hc
c315
3hb
l100
mda
mb4
53hc
c118
7kp
l1sk
br5
mda
mb4
68su
m22
5bt
20m
cf7
skbr
7m
dam
b361
cal1
48X6
00m
pehc
c202
du44
75ca
l851
sum
159
bt54
9t4
7dm
cf12
ahc
c142
8bt
483
sw52
7ef
m19
2asu
m52
hdqp
1hc
c114
3ly
2hc
c70
hcc1
500
hcc1
954
sum
229
bt47
4hc
c141
9m
dam
b436
uacc
893
cam
a1m
dam
b330
cal5
1hc
c193
7m
b157
jimt1
mac
ls2
hcc1
806
sum
149
mx1
efm
19
High proportion of enriched terms of 2016 are missed when using old annotation
• Example #2: glioblastoma brain cancer
• 75 significantly mutated driver genes derived from 6,800 tumors (Tamborero et al. 2010 Sci Rep; Rubio-Perez et al. 2015 Cancer Cell)
• including EGFR, PIK3CA, PIK3R1, PTEN, TP53, NF1, RB1
• g
Hematopoiesis/T-cell activation
only 2016only year XBoth
Pro
porti
on o
f ter
ms
(%)
0
25
50
75
100
2009 2010 2011 2012 2013 2014 2015
GO termsF
Supplementary information, Wadi et al.
Supplementary Figure 12: Evolution of pathway information affects recently updated and out-of-date software tools. Analysis of glioblastoma genes shows fraction of commonly detected (yellow), 2016-only (purple) and outdated-only (dark blue) Reactome pathways detected as significantly enriched. Significantly mutated glioblastoma genes were analyzed for pathway enrichments using annotations from 2009-2015 and compared to pathways detected using 2016 annotations.
�16
Year
Prop
ortio
n of
term
s (%
)
Reactome terms
0
25
50
75
100
2009 2010 2011 2012 2013 2014 2015
Figure 1
Enrichment analysis with outdated annotations missed many pathways detected with 2016 data
The 2010 era annotations (which the DAVID software uses) only capture ~20% of the 2016 results (127/827 GO biological
process and 16/128 Reactome pathways)
Tube
Immunesignalling
Histones &chromatin
Cell-cellsignalling
Responseto stimulus
Protein modification
Metabolism
Transcription
SignallingCell migration
Development
CNS
Insulin response
Steroid*
VEGF*
PDGF*
TGFβ*
ERBB*
NGFPI3K*
EGFR*
EGFRvIII*
FGFR*
ERK*
MAPK*
RAS*/RHO
NOTCH*
Neurons,glial cells
Kinase signalling
DNA replication/CC checkpoint
Circadianclock
Cognition/ learning/behaviour
Hematopoiesis/T-cell activation
Catabolism
Apoptosis/Neuron death
GABAergicsynatic transmission
Proliferation/stem cells
Glucose import/transport
Adhesion
Homeostasis
Endocytosis
Protein import/localization
Cell cycle/mitosis
Woundhealing
Cellular componentorganization
Embryo
Sex
Head
only 2016
2009 2010 2011 2012 2013 2014 2015
GFigure 1
only 2016only year XBoth
2009 2010 2011 2012 2013 2014 2015
Summary• gene annotations have become substantially broader, more specific and of higher-quality over the
last 7 years, covering more protein-coding genes and reducing the number of un-annotated genes
• out of 21 enrichment tools evaluated, more than half use annotation databases not updated for >5 years
• >70% of 2015 papers used enrichment analysis tools not updated for >5 years, implying that 1,000s of recent studies have underestimated the functional significance of their genes/proteins
• avoid outdated tools like DAVID! not updated in >5 years but used in a high proportion of recent papers
• only a few enrichment analysis tools use annotation databases updated in the last 6 months —better use this: g:Profiler, Panther, ToppGene, GREAT, goEast
• “only” 21 analysis tools evaluated —check the last update of your favourite tool
• benefit from re-analysing the data…
• side effect: the rapid growth of annotation may complicate replicability when using frequently updated annotations