Upload
egon-willighagen
View
100
Download
0
Embed Size (px)
Citation preview
Data integration with identifiers and ontologiesWhy are names and graphs not enough?
Egon Willighagen
http://chem-bla-ics.blogspot.com/@egonwillighagenORCID:0000-0001-7542-0286
Uppsala University2016-09-12
Acknowledgements● WikiPathways and PathVisio projects
– Prof. Alex Pico's team, UCSF
– Current and past members of BiGCaT (Prof. Chris Evelo): Marloes Poort
– Pathway Providers: Pieter Giesbertz (TUM), Kozo Nishida (RIKEN)
● Maastricht University– Toxicology: Rianne Fijten
– MaCSBio team
– Maastricht Science Programma (VOC project)
● Open PHACTS– Manchester University: Prof. Carole Goble, Christian Brenninkmeijer, Stian Soiland-Reyes
– Heriot-Watt University: Alasdair Gray
– Royal Society of Chemistry: Colin Batchelor
● Others– Bioclipse: Ola Spjuth (Uppsala University)
– MetaboLights collaboration: Reza Salek, Chandu Venkata, Garima Thakur
– ChEBI collaboration: Christoph Steinbeck, Gareth Owen
– PubChem collaboration: Evan Bolton, Gang Fu
– HMDB, Wikidata teams
Asthma: Detecting and Understanding
Smolinska et al. PLOS ONE. 2014 9:e105447doi:10.1371/journal.pone.0105447
Systems Biology: pathways
Andón FT, Fadeel B; ''Programmed Cell Death: Molecular Mechanisms and Implications for Safety Assessment of Nanomaterials.''; Acc Chem Res, 2012
Dopamine metabolism
Marloes Poort
The effect of troglitazone on heme biosynthesis
PathVisio: pathway enrichment (etc)
Van Iersel, M.P., et al. "Presenting and exploring biological pathways with PathVisio." BMC bioinformatics 9.1 (2008): 399. http://pathvisio.org/ → Martina Kutmon
We see a lot? But what is it?● Current techniques can see up to 1000
metabolites in one analysis– Only part of all 40k metabolites
● Only 10% we can identify– The other 90% is unknown
Databases & identifiers
● HMDB: Human Metabolome Database● ChEBI: Database of Chemicals Entities of
Biological Interest● ChemSpider, PubChem● CAS: Chemical Abstracts Service
● InChI: International Chemical Identifier
Acid/Base conjugates
CHEBI:15361 (Pyruvate) -> Ce:CHEBI:32816 (conjugate) -> Ck:C00022 -> [WP2456 HIF1A and PPARG regulation of glycolysis, WP2453 TCA Cycle and PDHc]
Switching identities: Glucose
Switching identities: Warfarin
Porter, W. (2010). Warfarin: history, tautomerism and activityJournal of Computer-Aided Molecular Design, 24 (6-7), 553-573DOI: 10.1007/s10822-010-9335-7
Bridging: identifiers
So, what IDs are used in WikiPathways?
Curated Collectionsubset
BridgeDb
Van Iersel, M.P., et al. "The BridgeDb framework: standardized accessto gene, protein and metabolite identifier mapping services."BMC Bioinformatics 11.1 (2010): 5.
New tools● Open PHACTS' Identifier Mapping Service
● R package● Bioclipse
Metabolite ID Mapping database● HMDB, ChEBI Wikidata
BridgeDb: scientific lenses
● Gene
– gene-protein– gene-probe
● Metabolite
– Tautomers– Compound class– Charge (acid/ate)
Brenninkmeijer, CYA, et al. "Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision." Proceedings of 2nd International Workshop on Linked Science. 2012.
#1: The breath data setCAS numbers: 1843
CAS numbers (unique): 1733
CAS numbers with mappings: 718
CAS numbers matches: 54
Pathways found: 76
Matches via CAS: 9
Matches via mapping: 29
Matches via ChEBI super class: 35
Matches via ChEBI charged species: 3
Matches via ChEBI tautomers: 0
CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]
What if we add more CAS ID mappings? (e.g. from Wikidata)INFO: Number of ids in Ch (HMDB): 41514 (changed +0.0%)INFO: Number of ids in Ce (ChEBI): 64222 (changed +0.0%)INFO: Number of ids in Kd (KEGG Drug): 2406 (changed +23960.0%)INFO: Number of ids in Ca (CAS): 38621 (changed +30.5%)INFO: Number of ids in Wi (Wikipedia): 3991 (changed +0.0%)INFO: Number of ids in Ck (KEGG Compound): 15896 (changed +0.0%)INFO: Number of ids in Cpc (PubChem-compound): 29170 (changed +72.5%)INFO: Number of ids in Wd: 18237INFO: Number of ids in Cs (Chemspider): 23981 (changed +49.4%)
- 30% more CAS numbers (294 unique IDs in WikiPathways)- 73% more PubChem compound identifiers (217 unique IDs in WP)- 50% more Chemspider identifiers (157 unique IDs in WP)- a lot more KEGG Drug identifiers
#1: The breath data set
CAS numbers: 1843CAS numbers (unique): 1733CAS numbers with mappings: 978CAS numbers matches: 116Pathways found: 158 (unique: 62)Matches via CAS: 9Matches via mapping: 28Matches via ChEBI super class: 108Matches via ChEBI charged species: 9Matches via ChEBI tautomers: 0Matches via ChEBI roles: 4
CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]
Wikidata
Mietchen, D. et al. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+ (2015)
Wikidata: identifiers
Application Programming Interfaces
Application Programming Interfaces
Conclusions
● Updated metabolite ID database– HMDB: still a major workhorse– ChEBI: charged species, compound
classes– Wikidata: CAS numbers, other
missing● Pathway Analysis
– Mapping with Bioclipse and PathVisio
– Scientific lenses improve mappings– Better annotation