Implementing chemistry platform for OpenPHACTS

  • View
    234

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Implementing chemistry platform for OpenPHACTS: Lessons learned

Colin Batchelor, Alexey Pshenichnov, Jon Steele, Valery Tkachenko

Royal Society of Chemistry

ACS Spring 2016San Diego, CAMarch 17th 2016

Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources

Into A Single Open & SustainableAccess Point

info@openphactsfoundation.org @Open_PHACTS

Open PHACTS Practical SemanticsAcknowledgements

GlaxoSmithKline – CoordinatorUniversität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit AmsterdamNovartisMerck SeronoH. Lundbeck A/SEli LillyNetherlands Bioinformatics CentreSwiss Institute of BioinformaticsConnectedDiscoveryEMBL-European Bioinformatics InstituteJanssen Esteve AlmirallOpenLink ScibiteThe Open PHACTS FoundationSpanish National Cancer Research Centre University of Manchester Maastricht University AqnowledgeUniversity of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität BonnAstraZenecaPfizer

Why is it so hard to….

Competitors?

What’s the structure?

Are they in our file?

What’s similar?

What’s the target?Pharmacology

data?

Known Pathways?

Working On Now?Connections to

disease?

Expressed in right cell type?

IP?

LiteraturePubChem

GenbankPatents Databases Downloads

Data Analysis Data Integration Firewalled Databases

How do R&D companies use public data?

9@gray_alasdair Big Data Integration

Patent annotations in Open PHACTS

• Huge amount of knowledge hidden in patent corpus • Most of which will never be published elsewhere • Substantial lag between patent and scientific literature • SureChEMBL system already extracts chemical entities from full-text

patent documents • Text (title, abstract, description, claims), images, molfiles• Complemented with gene and disease entity annotations • Using the Termite text-mining tool by SciBite• Relevance scoring to reduce noise • Tested for recall• Patent, compound, gene, disease info available via API

Open PHACTS Expanding EcoSystem

Further Apps

Data Warrior

• VM install of Open PHACTS – Docker Image is now available

• Updating to ver 2.0 Open PHACTS• Allows you to customise and load your own data into the

environment

Want to load your data into Open PHACTS?

Want to run Open PHACTS within your environment?

Usage

>500 million queries

All Users by Sector Type

Challenge of migrating between versions of the API

Upgrading

Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium

MOE Collector Cytophacts Utopia Garfield SciBite

KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna

openphactsfoundation.org/apps.html

Explorer.openphacts.org

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

We integrate, standardize and host the chemical compound collection underpinning Open PHACTS.

We have developed a structure validation and normalization platform (CVSP) to ensure chemical structures are normalized to rules derived from the FDA structure normalization guidelines and modified based on input from members of EFPIA.

http://cvsp.chemspider.com/

The Royal Society of Chemistry’s role in Open PHACTS

Freely-available (requires logging in) chemical validation system for:• Structure validation: warning on query

atoms, pseudoatoms, nonsensical or unclear stereo

• Standardization workflows.

CVSP and the Open Pharmacological Space Chemical Registration System (OPS CRS)

Chemical data sourcesData source Number of records in

sourceDrugBank 6828

PDB ligands 18681

MeSH (extracted by text mining)

24381

ChEBI 40503

HMDB 41494

ChEMBL 20 1456020

SureChEMBL 1.0 14228299

We generate RDF that:1. Describes synonyms and identifiers2. Provides linksets between our data sources

and the OPS identifiers3. Describes molecule–molecule relations of

interest to the pharma industry4. Delivers calculated physicochemical

properties of compounds5. Lists the validation and standardization

issues found by CVSP.

Royal Society of Chemistry data provided to Open PHACTS

• Use standard ontologies where possible (CHEMINF for cheminformatics properties, QUDT for units, OBO ontologies elsewhere)

• Use an event-based pattern for cheminformatics outputs. This enables us to add arbitrary provenance information.

Principles

Use the CHEMINF ontology: https://github.com/semanticchemistry

Validated ChemSpider synonyms, Unvalidated ChemSpider synonyms, Validated database identifiers, Unvalidated database identifiers, InChI, InChIKey, SMILES, preferred ChemSpider name

1. Synonyms and identifiers

Metadata describing the RDF:• Can be used to build a directory of the RDF

available• Find what’s there without having to download all

of it first• Describes how Datasets are linked by the

Linksets using SKOS.

Recommendations here: http://www.openphacts.org/specs/2013/WD-datadesc-20130912/

2. Linksets:Vocabulary of Interlinked Datasets

We relate molecules to “parent” forms, variously, those which are:• uncharged• not isotopically-specified• not stereochemically-specified• the preferred tautomer• the largest fragment• the “superparent” (all of the above)

3. Molecule–molecule relations in CHEMINF

log P, log D (at pH 5.5 and 7.4), bioconcentration factor, KOC (at pH 5.5 and 7.4), index of refraction, polar surface area, molar refractivity, molar volume, polarizability, surface tension, density at STP, flash point a 1 atm, enthalpy of vaporization at STP, vapour pressure at STP.

4. Calculated physicochemical properties

5. Issues from validation and standardization

We use the CHEMINF ontology again.

We distinguish between information, warnings and errors. Only serious failures to process, such as a structure having an invalid atom, count as errors.

This is the world we live in

Data quality issue and CVSP

– Robochemistry

– Proliferation of errors in public and private databases

• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL

– Automated quality control system

Chemistry Validation and Standardization Platform

Chemistry Validation and Standardization Platform

Thank you

Email: tkachenkov@rsc.org

Slides: http://www.slideshare.net/valerytkachenko16

Recommended