Upload
alasdair-gray
View
1.132
Download
0
Embed Size (px)
Citation preview
The Data TodayAlasdair GrayHeriot-Watt University, Edinburgh, [email protected]@gray_alasdair
2@gray_alasdair Big Data Integration
Dataset Downloaded Version Licence TriplesBio Assay Ontology CC-By 10,360
CALOHA 8 Apr 2015 2014-01-22 CC-By-ND 14,552
ChEBI 4 Mar 2015 125 CC-By-SA 1,012,056
ChEMBL 18 Feb 2015 20.0 CC-By-SA 445,732,880
ConceptWiki 12 Dec 2013 CC-By-SA 4,331,760
DisGeNET 31 Mar 2015 2.1.0 ODbL 15,011,136
Disease Ontology 2015-05-21 CC-By 188,062
DrugBank 19 Feb 2015 4.1 Non-commercial 4,028,767
ENZYME 2015_11 CC-By-ND 61,467
FDA Adverse Events 9 Jul 2012 CC0 13,557,070
Total: ~3 Billion triples
Dataset Downloaded Version Licence TriplesGene Ontology 4 Mar 2015 CC-By 1,366,494
Gene Ontology Annotations 17 Feb 2015 CC-By 879,448,347
NCATS OPDDR Nov 2015 Oct 2015 2,643
neXTProt (NP) 1 Feb 2014 1.0 CC-By-ND 215,006,108
OPS Chemical Registry 4 Nov 2014 CC-By-SA 241,986,722
HMDB 3.6 HMDBMeSH 2015 MeSH
PDB Ligands 2 PDB
OPS Metadata CC-By-SA 2,053
UniProt 2015_11 CC-By-ND 1,131,186,434
WikiPathways 20151118 CC-By 11,781,627
Total: ~3 Billion triples
John Wilbanks consulted for us
A framework built around STANDARD well-understood Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
Disease
Tissue
Target
Compound
Pathway
STANDARD_TYPE UNIT_COUNT---------------- -------AC50 7 Activity 421 EC50 39 IC50 46 ID50 42 Ki 23 Log IC50 4 Log Ki 7 Potency 11 log IC50 0
STANDARD_TYPE STANDARD_UNITS COUNT(*)------------------ ------------------ --------IC50 nM 829448 IC50 ug.mL-1 41000 IC50 38521 IC50 ug/ml 2038 IC50 ug ml-1 509 IC50 mg kg-1 295 IC50 molar ratio 178 IC50 ug 117 IC50 % 113 IC50 uM well-1 52
~ 100 units>5000 typesImplemented using the Quantities, Units, Dimension, TypesOntology (http://www.qudt.org/)
Quantitative Data Challenges
Quality Assurance
ops:OPS437281
✔
ops:OPS380297 ops:OPS380292
is_stereoisomer_of[ci:CHEMINF_000461]
has_stereoundefined_parent [ci:CHEMINF_000456] Other relationships
• has part• is tautomer of• uncharged counterpart• isotope…
Chemical Registration Service Data
Mappings: Raw
Mappings (Raw)25,087,328
Mappings: Computed
Mappings (Comp)200,000,000+
P12047X31045 P12
047
GB:29384RS_
2353
Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
DrugbankChemSpider PubChem
MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?It depends upon your task!
skos:exactMatch(InChI)
Strict Relaxed
Analysing Browsing
I need to perform an analysis, give me details of the active compound in Gleevec.
skos:closeMatch(Drug Name)
skos:closeMatch(Drug Name)
skos:exactMatch(InChI)
Strict Relaxed
Analysing Browsing
Which targets are known to interact with Gleevec?
A lens defines a conceptual view over the data Specifies operational equivalence conditions
Consists of:Identifier (URI)Title (dct:title)Description (dct:description)Documentation link (dcat:landingPage)Creator (pav:createdBy)Timestamp (pav:createdOn)Equivalence rules (bdb:linksetJustification)
Scientific Lens
Lenses34 in total
7 Public
25 Chemistry
2 Gene
Data Governance
Contribution must not be underestimated!!!
Alasdair J G [email protected]/~ajg33/@gray_alasdair
Open [email protected]@open_phacts