Upload
valery-tkachenko
View
234
Download
0
Embed Size (px)
Citation preview
Implementing chemistry platform for OpenPHACTS: Lessons learned
Colin Batchelor, Alexey Pshenichnov, Jon Steele, Valery Tkachenko
Royal Society of Chemistry
ACS Spring 2016San Diego, CAMarch 17th 2016
Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources
Into A Single Open & SustainableAccess Point
[email protected] @Open_PHACTS
Open PHACTS Practical SemanticsAcknowledgements
GlaxoSmithKline – CoordinatorUniversität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit AmsterdamNovartisMerck SeronoH. Lundbeck A/SEli LillyNetherlands Bioinformatics CentreSwiss Institute of BioinformaticsConnectedDiscoveryEMBL-European Bioinformatics InstituteJanssen Esteve AlmirallOpenLink ScibiteThe Open PHACTS FoundationSpanish National Cancer Research Centre University of Manchester Maastricht University AqnowledgeUniversity of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität BonnAstraZenecaPfizer
Why is it so hard to….
Competitors?
What’s the structure?
Are they in our file?
What’s similar?
What’s the target?Pharmacology
data?
Known Pathways?
Working On Now?Connections to
disease?
Expressed in right cell type?
IP?
LiteraturePubChem
GenbankPatents Databases Downloads
Data Analysis Data Integration Firewalled Databases
How do R&D companies use public data?
9@gray_alasdair Big Data Integration
Patent annotations in Open PHACTS
• Huge amount of knowledge hidden in patent corpus • Most of which will never be published elsewhere • Substantial lag between patent and scientific literature • SureChEMBL system already extracts chemical entities from full-text
patent documents • Text (title, abstract, description, claims), images, molfiles• Complemented with gene and disease entity annotations • Using the Termite text-mining tool by SciBite• Relevance scoring to reduce noise • Tested for recall• Patent, compound, gene, disease info available via API
Open PHACTS Expanding EcoSystem
Further Apps
Data Warrior
• VM install of Open PHACTS – Docker Image is now available
• Updating to ver 2.0 Open PHACTS• Allows you to customise and load your own data into the
environment
Want to load your data into Open PHACTS?
Want to run Open PHACTS within your environment?
Usage
>500 million queries
All Users by Sector Type
Challenge of migrating between versions of the API
Upgrading
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
openphactsfoundation.org/apps.html
Explorer.openphacts.org
http://data.openphacts.org/artifactory/
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
Indexing
Cor
e Pl
atfo
rm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Apps
We integrate, standardize and host the chemical compound collection underpinning Open PHACTS.
We have developed a structure validation and normalization platform (CVSP) to ensure chemical structures are normalized to rules derived from the FDA structure normalization guidelines and modified based on input from members of EFPIA.
http://cvsp.chemspider.com/
The Royal Society of Chemistry’s role in Open PHACTS
Freely-available (requires logging in) chemical validation system for:• Structure validation: warning on query
atoms, pseudoatoms, nonsensical or unclear stereo
• Standardization workflows.
CVSP and the Open Pharmacological Space Chemical Registration System (OPS CRS)
Chemical data sourcesData source Number of records in
sourceDrugBank 6828
PDB ligands 18681
MeSH (extracted by text mining)
24381
ChEBI 40503
HMDB 41494
ChEMBL 20 1456020
SureChEMBL 1.0 14228299
We generate RDF that:1. Describes synonyms and identifiers2. Provides linksets between our data sources
and the OPS identifiers3. Describes molecule–molecule relations of
interest to the pharma industry4. Delivers calculated physicochemical
properties of compounds5. Lists the validation and standardization
issues found by CVSP.
Royal Society of Chemistry data provided to Open PHACTS
• Use standard ontologies where possible (CHEMINF for cheminformatics properties, QUDT for units, OBO ontologies elsewhere)
• Use an event-based pattern for cheminformatics outputs. This enables us to add arbitrary provenance information.
Principles
Use the CHEMINF ontology: https://github.com/semanticchemistry
Validated ChemSpider synonyms, Unvalidated ChemSpider synonyms, Validated database identifiers, Unvalidated database identifiers, InChI, InChIKey, SMILES, preferred ChemSpider name
1. Synonyms and identifiers
Metadata describing the RDF:• Can be used to build a directory of the RDF
available• Find what’s there without having to download all
of it first• Describes how Datasets are linked by the
Linksets using SKOS.
Recommendations here: http://www.openphacts.org/specs/2013/WD-datadesc-20130912/
2. Linksets:Vocabulary of Interlinked Datasets
We relate molecules to “parent” forms, variously, those which are:• uncharged• not isotopically-specified• not stereochemically-specified• the preferred tautomer• the largest fragment• the “superparent” (all of the above)
3. Molecule–molecule relations in CHEMINF
log P, log D (at pH 5.5 and 7.4), bioconcentration factor, KOC (at pH 5.5 and 7.4), index of refraction, polar surface area, molar refractivity, molar volume, polarizability, surface tension, density at STP, flash point a 1 atm, enthalpy of vaporization at STP, vapour pressure at STP.
4. Calculated physicochemical properties
5. Issues from validation and standardization
We use the CHEMINF ontology again.
We distinguish between information, warnings and errors. Only serious failures to process, such as a structure having an invalid atom, count as errors.
This is the world we live in
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and private databases
• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Chemistry Validation and Standardization Platform