Upload
twittkop
View
149
Download
6
Tags:
Embed Size (px)
Citation preview
STOP Using Just GO: A Multi-‐Ontology Hypothesis Generation Tool for High Throughput Experimentation
Tobias Wittkop, Emily TerAvest, Uday Evani, K. Mathew Fleisch, Ari E. Berman, Corey Powell, Nigam Shah and Sean D. Mooney
Mooney laboratory, Buck Institute for Research on Aging, Novato, CANational Center for Biomedical Ontology, Stanford University, Stanford,
Experiment List of genes Hypothesis generation
• Microarray
• RNASeq
• RNAi
• Yeast-2-hybrid
• ...
A2M, ABL1, ADCY5, AGPAT2, AIFM1, AKT1, APEX1, APOC3, APOE, APP, APTX, AR, ARHGAP1, ARNTL, ATF2, ATM, ATP5O, ATR, BAX, BCL2, BDNF, BLM, BMI1, BRCA1, BRCA2, BSCL2, BUB1B, BUB3, CACNA1A, CAT, CCNA2, CDC2, CDC42, CDKN2A, CEBPA, CEBPB, CHEK2, CLOCK,
• Pathway analysis
• Gene Ontology enrichment analysis
• GSEA
Gene set analysis
Gene annotations outside of GO
• NCBO currently includes over 200 ontologies
• Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies
• Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO
• NCBO provides annotator service that matches text to terms
Gene annotations outside of GO
• NCBO currently includes over 200 ontologies
• Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies
• Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO
• NCBO provides annotator service that matches text to terms
Idea: Use descriptive text for genes to retrieve up-to-date annotations from genes to a wide spectrum of ontologies
1"Genome/Proteome*
1.Collect genome/proteome from UCSC and UniProt
Automatic gene annotation pipeline
1"Genome/Proteome*
Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
targets'protein' complexes'and'co>regulates'majorcellular' funcIons.' ;>!>' FUNCTION:'CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'acetylaIon' of' the'N>terminal'methionineresidues' of' pepIdes' beginning'with'Met>Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'CoA.>!>'SUBUNIT:'Component'of' the'N>terminal'acetyltransferase'C' (NatC)complex,'which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'
SIMILARITY:' Belongs' to' the' acetyltransferase' family.'
MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;'GC14P038022;' >.H>InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'> .PharmGKB;' PA134931315;' > .eggNOG;' prNOG15463;' > .GeneTree;'ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'IPR016181;' Acy l_CoA_acy l t ransferase.Gene3D;' G3DSA:3.40.630.30;'Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.''
2"
1.Collect genome/proteome from UCSC and UniProt
2.Collect descriptive text for each gene/protein from Entrez Gene/UniProt
Automatic gene annotation pipeline
1"Genome/Proteome*
Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
targets'protein' complexes'and'co>regulates'majorcellular' funcIons.' ;>!>' FUNCTION:'CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'acetylaIon' of' the'N>terminal'methionineresidues' of' pepIdes' beginning'with'Met>Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'CoA.>!>'SUBUNIT:'Component'of' the'N>terminal'acetyltransferase'C' (NatC)complex,'which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'
SIMILARITY:' Belongs' to' the' acetyltransferase' family.'
MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;'GC14P038022;' >.H>InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'> .PharmGKB;' PA134931315;' > .eggNOG;' prNOG15463;' > .GeneTree;'ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'IPR016181;' Acy l_CoA_acy l t ransferase.Gene3D;' G3DSA:3.40.630.30;'Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.''
2"
Gene$Ontology$
Biological$process$
Apoptosis$ signaling$
Molecular$func6on$
Cellular$func6on$3"
Cell$cycle$ontology$
Biological$process$
DNA$replica6on$ini6a6on$
$
Cytokine6c$process$
Biological$con6nuant$
Acetyltransferase$
1.Collect genome/proteome from UCSC and UniProt
2.Collect descriptive text for each gene/protein from Entrez Gene/UniProt
3.Annotate text to over 200 ontologies via NCBO Annotator
Automatic gene annotation pipeline
Gene/protein specific text as annotation source
• Gene text from Entrez Gene
• Protein text from UniProt
• Gene/Protein summary
• Publication titles
• GO annotations
• Pathway annotations
• GeneRIFs
• Protein complexes, domains, interactions
• We filter for author names, db names, numbers
Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
targets'protein' complexes'and'co>regulates'majorcellular' funcIons.' ;>!>' FUNCTION:'CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'acetylaIon' of' the'N>terminal'methionineresidues' of' pepIdes' beginning'with'Met>Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'CoA.>!>'SUBUNIT:'Component'of' the'N>terminal'acetyltransferase'C' (NatC)complex,'which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'
SIMILARITY:' Belongs' to' the' acetyltransferase' family.'
MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;'GC14P038022;' >.H>InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'> .PharmGKB;' PA134931315;' > .eggNOG;' prNOG15463;' > .GeneTree;'ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'IPR016181;' Acy l_CoA_acy l t ransferase.Gene3D;' G3DSA:3.40.630.30;'Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.''
The NCBO annotator
• Simple string matching using mgrep
• Synonyms are annotated
• Annotations are propagated to the root
• Mappings between terms from different ontologies
• No NLP
• Very fast
Gene$Ontology$
Biological$process$
Apoptosis$ signaling$
Molecular$func6on$
Cellular$func6on$3"
Cell$cycle$ontology$
Biological$process$
DNA$replica6on$ini6a6on$
$
Cytokine6c$process$
Biological$con6nuant$
Acetyltransferase$
1C. Jonquet et al. AMIA Summit on Translational Bioinformatics (2009)
561,577,156 annotations of 73,248 genes and 146,271 proteins to 404,347 terms from 246 ontologies for 4 organism (human, mouse, fly and worm)
Annotation results
10 Most annotated ontologies
12,584,05013,177,577
13,341,529
13,526,445
14,541,946
15,937,632
16,338,723 33,628,760
34,137,453
35,767,064 SNOMED Clinical TermsNCI ThesaurusNIFSTDCRISP Thesaurus, 2006 Read Codes, Clinical Terms Version 3 (CTV3) GalenSuggested Ontology for Pharmacogenomics Gene Ontology Extension Human developmental anatomy, timed versionGene Ontology
STOP/GO evaluation
• Compared genes/proteins GO annotation vs our GO annotations
• High recall 0.95-0.99 (we reproduce existing annotations)
• Lower precision 0.6-0.75 (we add new annotations)
• How good are the novel annotations?
Novel GO annotation examples
human protein carboxylesterase 1 (P23141). annotated to
“cocaine metabolic process” (GO:0050783)based on title for a reference paper:
“Structural basis of heroin and cocaine metabolism by a promiscuous human drug-processing enzyme”
C. elegans protein (Q27539) ATP-dependent Clp protease proteolytic subunit 1 annotated to
“mitochondrial unfolded protein response” (GO:0034514)based on title for a reference paper:
“ClpP mediates activation of a mitochondrial unfolded protein response in C. elegans”
Critical assessment of functional annotations (CAFA)
January 2010
June 2010
• Collect all proteins from UniProt, that have no GO annotation with experimental evidence
• Submission of predicted GO annotations
• Collect novel experimental GO annotations for the same proteins
• Compare predictions with novel experimentally validated annotations
Mooneygroup was assessing-group for CAFA
CAFA results
•Our annotations ranked 15th out of 40 when predicting Molecular function (MFO) annotations: F-‐measure 0.4 (best 0.48)
•Our annotations ranked 7th out of 36 when predicting Biological process (BPO) annotations: F-‐measure 0.33 (best 0.35)
1http://biofunctionprediction.org/
Experiment List of genes Hypothesis generation
• Microarray
• RNASeq
• RNAi
• Yeast-2-hybrid
• ...
A2M, ABL1, ADCY5, AGPAT2, AIFM1, AKT1, APEX1, APOC3, APOE, APP, APTX, AR, ARHGAP1, ARNTL, ATF2, ATM, ATP5O, ATR, BAX, BCL2, BDNF, BLM, BMI1, BRCA1, BRCA2, BSCL2, BUB1B, BUB3, CACNA1A, CAT, CCNA2, CDC2, CDC42, CDKN2A, CEBPA, CEBPB, CHEK2, CLOCK,
• Pathway analysis
• Gene Ontology enrichment analysis
• GSEA
Gene set analysis
Statistical Tracking of Ontological Phrases (STOP)
http://mooneygroup.org/stop
Enrichment analysis web application using automated annotations
Statistical Tracking of Ontological Phrases (STOP)
http://mooneygroup.org/stop
• Results as table
Enrichment analysis web application using automated annotations
Statistical Tracking of Ontological Phrases (STOP)
http://mooneygroup.org/stop
• Results as table
• Termcloud
Enrichment analysis web application using automated annotations
Statistical Tracking of Ontological Phrases (STOP)
http://mooneygroup.org/stop
• Results as table
• Termcloud
• Revisit previous results
Enrichment analysis web application using automated annotations
Statistical Tracking of Ontological Phrases (STOP)
http://mooneygroup.org/stop
• Results as table
• Termcloud
• Revisit previous results
• Entrez gene id, gene symbol or UniProt id
• Custom background
Enrichment analysis web application using automated annotations
Statistical Tracking of Ontological Phrases (STOP)
http://mooneygroup.org/stop
• Results as table
• Termcloud
• Revisit previous results
• Entrez gene id, gene symbol or UniProt id
• Custom background
• Filter results by ontology
Enrichment analysis web application using automated annotations
Summary
STOP using just GO .... because we provide
• Automated annotations of genes to terms from over 200 ontologies
• Enrichment analysis on novel annotations corrects for putative false positives and expands realm of testable hypothesis
• Easy-to-use web interface allows quick analysis and assists in navigation of results
• Human, worm, mouse,fly (...many more to come)
• http://mooneygroup.org/stop
Acknowledgements
Buck Institute for Research on AgingSean Mooney, Emily TerAvest, Uday Evani, Ari Berman,Tal Oron Ronnen, Mathew Fleisch, Corey Powell NCBO FundingNigam Shah and Trish Whetzel NIH R01 LM009722 (PI:Mooney), NIH U54-‐HG004028 (PI: Musen), NIH T32-‐AG000266 (PIs: Campisi,Ellerby), NIHCAFA UL1DE019608 (PI: Lithgow), NIH RL9AG032114 (U54 Iddo Friedberg, Predrag Radivojac Geroscience), the NCBO and the Buck Trust. Wyatt Clark
http://mooneygroup.org/stop