25
STOP Using Just GO: A MultiOntology Hypothesis Generation Tool for High Throughput Experimentation Tobias Wittkop, Emily TerAvest, Uday Evani, K. Mathew Fleisch, Ari E. Berman, Corey Powell, Nigam Shah and Sean D. Mooney Mooney laboratory, Buck Institute for Research on Aging, Novato, CA National Center for Biomedical Ontology, Stanford University, Stanford,

Stop biocuration

Embed Size (px)

Citation preview

STOP  Using  Just  GO:  A  Multi-­‐Ontology  Hypothesis  Generation  Tool  for  High  Throughput  Experimentation

Tobias  Wittkop,  Emily  TerAvest,  Uday  Evani,  K.  Mathew  Fleisch,  Ari  E.  Berman,  Corey  Powell,    Nigam  Shah  and  Sean  D.  Mooney

Mooney  laboratory,  Buck  Institute  for  Research  on  Aging,  Novato,  CANational  Center  for  Biomedical  Ontology,  Stanford  University,  Stanford,  

Experiment List of genes Hypothesis generation

• Microarray

• RNASeq

• RNAi

• Yeast-2-hybrid

• ...

A2M,  ABL1,  ADCY5,  AGPAT2,  AIFM1,  AKT1,  APEX1,  APOC3,  APOE,  APP,  APTX,  AR,  ARHGAP1,  ARNTL,  ATF2,  ATM,  ATP5O,  ATR,  BAX,  BCL2,  BDNF,  BLM,  BMI1,  BRCA1,  BRCA2,  BSCL2,  BUB1B,  BUB3,  CACNA1A,  CAT,  CCNA2,  CDC2,  CDC42,  CDKN2A,  CEBPA,  CEBPB,  CHEK2,  CLOCK,  

• Pathway analysis

• Gene Ontology enrichment analysis

• GSEA

Gene  set  analysis

Gene  annotations  outside  of  GO

• NCBO currently includes over 200 ontologies

• Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies

• Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO

• NCBO provides annotator service that matches text to terms

Gene  annotations  outside  of  GO

• NCBO currently includes over 200 ontologies

• Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies

• Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO

• NCBO provides annotator service that matches text to terms

Idea: Use descriptive text for genes to retrieve up-to-date annotations from genes to a wide spectrum of ontologies

Automatic  gene  annotation  pipeline

1"Genome/Proteome*

1.Collect genome/proteome from UCSC and UniProt

Automatic  gene  annotation  pipeline

1"Genome/Proteome*

Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'

targets'protein' complexes'and'co>regulates'majorcellular' funcIons.' ;>!>' FUNCTION:'CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'acetylaIon' of' the'N>terminal'methionineresidues' of' pepIdes' beginning'with'Met>Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'CoA.>!>'SUBUNIT:'Component'of' the'N>terminal'acetyltransferase'C' (NatC)complex,'which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'

SIMILARITY:' Belongs' to' the' acetyltransferase' family.'

MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;'GC14P038022;' >.H>InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'> .PharmGKB;' PA134931315;' > .eggNOG;' prNOG15463;' > .GeneTree;'ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'IPR016181;' Acy l_CoA_acy l t ransferase.Gene3D;' G3DSA:3.40.630.30;'Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.''

2"

1.Collect genome/proteome from UCSC and UniProt

2.Collect descriptive text for each gene/protein from Entrez Gene/UniProt

Automatic  gene  annotation  pipeline

1"Genome/Proteome*

Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'

targets'protein' complexes'and'co>regulates'majorcellular' funcIons.' ;>!>' FUNCTION:'CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'acetylaIon' of' the'N>terminal'methionineresidues' of' pepIdes' beginning'with'Met>Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'CoA.>!>'SUBUNIT:'Component'of' the'N>terminal'acetyltransferase'C' (NatC)complex,'which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'

SIMILARITY:' Belongs' to' the' acetyltransferase' family.'

MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;'GC14P038022;' >.H>InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'> .PharmGKB;' PA134931315;' > .eggNOG;' prNOG15463;' > .GeneTree;'ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'IPR016181;' Acy l_CoA_acy l t ransferase.Gene3D;' G3DSA:3.40.630.30;'Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.''

2"

Gene$Ontology$

Biological$process$

Apoptosis$ signaling$

Molecular$func6on$

Cellular$func6on$3"

Cell$cycle$ontology$

Biological$process$

DNA$replica6on$ini6a6on$

$

Cytokine6c$process$

Biological$con6nuant$

Acetyltransferase$

1.Collect genome/proteome from UCSC and UniProt

2.Collect descriptive text for each gene/protein from Entrez Gene/UniProt

3.Annotate text to over 200 ontologies via NCBO Annotator

Automatic  gene  annotation  pipeline

Gene/protein  specific  text  as  annotation  source

• Gene text from Entrez Gene

• Protein text from UniProt

• Gene/Protein summary

• Publication titles

• GO annotations

• Pathway annotations

• GeneRIFs

• Protein complexes, domains, interactions

• We filter for author names, db names, numbers

Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'

targets'protein' complexes'and'co>regulates'majorcellular' funcIons.' ;>!>' FUNCTION:'CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'acetylaIon' of' the'N>terminal'methionineresidues' of' pepIdes' beginning'with'Met>Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'CoA.>!>'SUBUNIT:'Component'of' the'N>terminal'acetyltransferase'C' (NatC)complex,'which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'

SIMILARITY:' Belongs' to' the' acetyltransferase' family.'

MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;'GC14P038022;' >.H>InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'> .PharmGKB;' PA134931315;' > .eggNOG;' prNOG15463;' > .GeneTree;'ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'IPR016181;' Acy l_CoA_acy l t ransferase.Gene3D;' G3DSA:3.40.630.30;'Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.''

The  NCBO  annotator

• Simple string matching using mgrep

• Synonyms are annotated

• Annotations are propagated to the root

• Mappings between terms from different ontologies

• No NLP

• Very fast

Gene$Ontology$

Biological$process$

Apoptosis$ signaling$

Molecular$func6on$

Cellular$func6on$3"

Cell$cycle$ontology$

Biological$process$

DNA$replica6on$ini6a6on$

$

Cytokine6c$process$

Biological$con6nuant$

Acetyltransferase$

1C.  Jonquet  et  al.  AMIA  Summit  on  Translational  Bioinformatics  (2009)

561,577,156 annotations of 73,248 genes and 146,271 proteins to 404,347 terms from 246 ontologies for 4 organism (human, mouse, fly and worm)

Annotation  results

10  Most  annotated  ontologies

12,584,05013,177,577

13,341,529

13,526,445

14,541,946

15,937,632

16,338,723 33,628,760

34,137,453

35,767,064 SNOMED Clinical TermsNCI ThesaurusNIFSTDCRISP Thesaurus, 2006 Read Codes, Clinical Terms Version 3 (CTV3) GalenSuggested Ontology for Pharmacogenomics Gene Ontology Extension Human developmental anatomy, timed versionGene Ontology

STOP/GO  evaluation

• Compared genes/proteins GO annotation vs our GO annotations

• High recall 0.95-0.99 (we reproduce existing annotations)

• Lower precision 0.6-0.75 (we add new annotations)

• How good are the novel annotations?

Novel  GO  annotation  examples

human protein carboxylesterase 1 (P23141). annotated to

“cocaine metabolic process” (GO:0050783)based on title for a reference paper:

“Structural basis of heroin and cocaine metabolism by a promiscuous human drug-processing enzyme”

C. elegans protein (Q27539) ATP-dependent Clp protease proteolytic subunit 1 annotated to

“mitochondrial unfolded protein response” (GO:0034514)based on title for a reference paper:

“ClpP mediates activation of a mitochondrial unfolded protein response in C. elegans”

Critical  assessment  of  functional  annotations  (CAFA)

January 2010

June 2010

• Collect all proteins from UniProt, that have no GO annotation with experimental evidence

• Submission of predicted GO annotations

• Collect novel experimental GO annotations for the same proteins

• Compare predictions with novel experimentally validated annotations

Mooneygroup was assessing-group for CAFA

CAFA  results

•Our  annotations  ranked  15th  out  of  40  when  predicting  Molecular  function  (MFO)  annotations:  F-­‐measure  0.4  (best  0.48)

•Our  annotations  ranked  7th  out  of  36  when  predicting  Biological  process  (BPO)  annotations:  F-­‐measure  0.33  (best  0.35)

1http://biofunctionprediction.org/  

Experiment List of genes Hypothesis generation

• Microarray

• RNASeq

• RNAi

• Yeast-2-hybrid

• ...

A2M,  ABL1,  ADCY5,  AGPAT2,  AIFM1,  AKT1,  APEX1,  APOC3,  APOE,  APP,  APTX,  AR,  ARHGAP1,  ARNTL,  ATF2,  ATM,  ATP5O,  ATR,  BAX,  BCL2,  BDNF,  BLM,  BMI1,  BRCA1,  BRCA2,  BSCL2,  BUB1B,  BUB3,  CACNA1A,  CAT,  CCNA2,  CDC2,  CDC42,  CDKN2A,  CEBPA,  CEBPB,  CHEK2,  CLOCK,  

• Pathway analysis

• Gene Ontology enrichment analysis

• GSEA

Gene  set  analysis

Statistical  Tracking  of  Ontological  Phrases  (STOP)

http://mooneygroup.org/stop

Enrichment analysis web application using automated annotations

Statistical  Tracking  of  Ontological  Phrases  (STOP)

http://mooneygroup.org/stop

• Results as table

Enrichment analysis web application using automated annotations

Statistical  Tracking  of  Ontological  Phrases  (STOP)

http://mooneygroup.org/stop

• Results as table

• Termcloud

Enrichment analysis web application using automated annotations

Statistical  Tracking  of  Ontological  Phrases  (STOP)

http://mooneygroup.org/stop

• Results as table

• Termcloud

• Revisit previous results

Enrichment analysis web application using automated annotations

Statistical  Tracking  of  Ontological  Phrases  (STOP)

http://mooneygroup.org/stop

• Results as table

• Termcloud

• Revisit previous results

• Entrez gene id, gene symbol or UniProt id

• Custom background

Enrichment analysis web application using automated annotations

Statistical  Tracking  of  Ontological  Phrases  (STOP)

http://mooneygroup.org/stop

• Results as table

• Termcloud

• Revisit previous results

• Entrez gene id, gene symbol or UniProt id

• Custom background

• Filter results by ontology

Enrichment analysis web application using automated annotations

Summary

STOP using just GO .... because we provide

• Automated annotations of genes to terms from over 200 ontologies

• Enrichment analysis on novel annotations corrects for putative false positives and expands realm of testable hypothesis

• Easy-to-use web interface allows quick analysis and assists in navigation of results

• Human, worm, mouse,fly (...many more to come)

• http://mooneygroup.org/stop

Acknowledgements

Buck  Institute  for  Research  on  AgingSean  Mooney,  Emily  TerAvest,  Uday  Evani,  Ari  Berman,Tal  Oron  Ronnen,  Mathew  Fleisch,  Corey  Powell                NCBO                                                                                                                                              FundingNigam  Shah  and  Trish  Whetzel                        NIH  R01  LM009722  (PI:Mooney),  NIH  U54-­‐HG004028  (PI:                                                                                                                                                                          Musen),  NIH  T32-­‐AG000266  (PIs:  Campisi,Ellerby),  NIHCAFA                                                                                                                                                UL1DE019608  (PI:  Lithgow),    NIH  RL9AG032114  (U54                                            Iddo  Friedberg,  Predrag  Radivojac                                          Geroscience),  the  NCBO  and  the  Buck  Trust.  Wyatt  Clark

http://mooneygroup.org/stop